By Donavyn Coffey, Wired, 28, April 2021. Te Hiku Media gathered huge swathes of Māori language data. Corporates are now trying to get the rights to it
In March 2018, Peter-Lucas Jones and the ten other staff at Te Hiku Media, a small non-profit radio station nestled just below New Zealand’s most northern tip, were in disbelief. In ten days, thanks to a competition it had started, Māori speakers across New Zealand had recorded over 300 hours of annotated audio in their mother tongue. It was enough data to build language tech for te reo Māori, the Māori language – including automatic speech recognition and speech-to-text.
The small staff of Māori language broadcasters and one engineer were about to become pioneers in indigenous speech recognition technology. But building the tools was only half the battle. Te Hiku soon found itself fending off corporate entities trying to develop their own indigenous data sets and resisting detrimental western approaches to data sharing. Guarding their data became the priority because the only people truly interested in revitalising the Māori language were the Māori people, themselves.
Languages around the world are dying – the UN estimates that an indigenous language dies every two weeks. Racist assimilation policies are largely to blame. Well into the 20th century, Māori children were often punished with shame or physical beatings when they spoke their native language in schools. As a result, when that generation reached adulthood, many chose not to pass on the language to their own children to protect them from the same types of persecution. This was a major cause of Māori language decline between 1920 and 1960. Now, the fluent population within many indigenous groups is both shrinking and aging. The language – and the traditional knowledge embedded in it – are both at risk of extinction.
Jones, the CEO of Te Hiku, and Keoni Mahelona, the chief technology officer, started to see a need for speech recognition after they digitised the massive audio collection Te Hiku had accumulated over 30 years of radio broadcasting. “We’d captured all these idiomatic phrases, colloquialism and unique phrases,” Jones says. It was the native sound of their language – one less adulterated by English and time. But to make this resource useful to Māori people living across the country and the world, Te Hiku would need to transcribe the audio. To transcribe the thousands of hours of Māori audio, they’d need to teach the computer to speak their language.
The tools for building speech-to-text systems – which allow Te Hiku to transcribe their radio content – and other speech recognition technology are fairly accessible, such as Mozilla’s open-source tool Deep Speech. The real challenge for indigenous communities is a lack of annotated data to build with. To create speech recognition tools from scratch, with no prior data, it typically requires a ballpark figure of 10,000 hours of annotated audio, according to Kelly Davis, cofounder of Coqui a start-up for open-source speech technology. That’s an extremely daunting, if not impossible, requirement for small indigenous languages with little prior documentation.
But with just its initial 320 hours of data, Te Hiku was able to build a speech-to-text engine with an initial word error rate of 14 per cent, according to Mahelona, a Native Hawaiian who’s been working at Te Hiku for seven years. For reference, Google’s ASR achieves a word error rate of 6.7 per cent with a 12,500-hour data set, according to one 2018 conference abstract. “The fact that they are getting word error rates that low for just over 300 hours, for a language that basically didn’t have speech recognition before, that’s very impressive,” Davis says.
Mahelona and Jones started presenting their success at conferences. It’s not important that they were the first to build ASR tools for an indigenous language, Mahelona says “but that we proved it was possible.” Language revitalisation experts from other indigenous communities, including the Mohawk in southeastern Canada and the Native People of Hawaii, have approached Te Hiku about using its code and mimicking its strategy. “Technology is a force multiplier,” says Nathan Brinklow, professor of Mohawk at Queen’s University, Canada. “They are leading the way. But this is something regular people can do.”
By the summer of 2018, Te Hiku had already reduced its word error rate to ten per cent. (The tech has not been externally validated). And that’s around the time it got a request from Lion Bridge, an American company that, according to its website, specialises in “translation and localisation solutions for global enterprises.”
“They basically sell globalisation as a service,” Mahelona says. He says that on behalf of a client, Lion Bridge contacted several Māori academics and radio groups to offer $45 (US) an hour to anyone who would provide Māori audio. All they had to do was speak Māori into their phone, Mahelona recalled. “We realised that $45 could seem like a lot to some members of our community,” Mahelona says. Lion Bridge did not respond to a request for comment for this article.
So after Te Hiku rejected the offer from Lion Bridge, Mahelona and Jones published their rejection along with a video explaining why and the risk in selling their language to an American corporation. The Te Hiku team see data as the final frontier for colonisation. “They suppressed our languages and physically beat it out of our grandparents,” Jones says. “And now they want to sell our language back to us as a service.”
Te Hiku is adamant that the only people who should profit from the Māori language are the Māori people, themselves. And Te Hiku guards that right voraciously by maintaining sovereignty over the Māori data it has gathered over 30 years. “We don’t trade our values for anything,” Mahelona says. “We aren’t going to sell the data or give it away for research.”
Selling or giving away the data invites western corporations to mine their language – and the thousands of years of traditional knowledge therein – for commercial opportunity, Jones says. It would mean entrusting data scientists with no connection to the language to develop the very tools that will shape the future of the language. And worst of all, it would mean that Māori would miss out on the economic opportunities created using the language that belongs to them, much like they didn’t see the economic benefits of the land that belonged to them. “We are guarding against history repeating itself,” Jones says. Protecting their data ensures the Māori people maintain the right to self-determination.
Te Hiku has since fielded around a dozen requests for its data or its ASR model. In late 2018, Davis was still working with open source speech tech at Mozilla. He approached the team at Te Hiku, who he’d been working with on and off for over a year, about adding its data to Mozilla’s open-source database, Common Voice. Again, the team was quick to decline.
“While we recognise the value of open-source, we also realise the majority of [our] people don’t have the resources to take advantage of it,” Jones says. Since the Māori people haven’t been afforded the same opportunities for education and advancement as many of the people who regularly make use of open source databases, Jones says making their data open-source doesn’t work to the benefit of his people. After hearing Te Hiku’s explanation, “a light bulb went off,” Davis says. “It makes complete sense” why they would want to retain control over their data.
Where Te Hiku does form partnerships, namely with universities, the terms are meticulously laid out based on Te Hiku’s data license. According to the license, the project must directly benefit the Māori people and any project created using Māori data belongs to the Māori people. This ensures that future economic opportunities always belong to the communities from which the data was gathered.
Thanks to a $13 million grant in 2019, the Te Hiku team includes five additional data scientists and five new Māori language experts. It is now developing and refining language tools that not only preserve the language, but restore the integrity of the original sound. Its newest language app, which just reached the demo stage, is intended to help current speakers refine pronunciation and remove some of the influence of English. “We are decolonising the sound of our language,” Jones says. “We want to speak the native sound into the future of our language.”
Speed is key. There are technologies, such as semi-supervised learning which requires very little labeled data, that could eventually allow tech companies to develop language services without seeking out cultural knowledge, Mehelona says. In the meantime, the team at Te Hiku is in a rush to develop the necessary tools first – spell check, grammar assistants, virtual language tutors. Whatever, it takes “we need to create better alternatives,” Mahelona says. “We want to provide a better place [online] for all indigenous people.”