Preserving indigenous languages with AI
- Māori people’s indigenous language, known as te reo, is being preserved with AI.
- Te Hiku Media is developing automatic speech recognition (ASR) models for te reo, which is a Polynesian language.
- The efforts have spurred similar ASR projects now underway by Native Hawaiians and the Mohawk people in southeastern Canada.
The United Nations says that 96% of the world’s approximately 6,700 languages are spoken by only 3% of the world’s population. Although indigenous peoples make up less than 6% of the global population, they speak more than 4,000 of the world’s languages.
But what’s more concerning is that some 3,000 indigenous languages could disappear before the end of the century. Put simply, indigenous languages are on a path to becoming endangered or extinct if no proper preservations are planned.
Traditionally, these languages are passed down from generation to generation. However, as society evolves and popular languages like English, Mandarin and Spanish begin dominating communication, society moves towards perfecting those languages instead.
When that happens, indigenous languages can end up being forgotten. Today, native speakers of some indigenous languages are down to a few hundred or less. While there is work to preserve and safeguard these languages, it will eventually become a memory – and then a historical curiosity – if no one uses it for conversation.
How AI can save indigenous languages
In New Zealand, a local broadcaster is now focused on preserving the Māori people’s indigenous language, known as te reo, with AI. Specifically, the broadcaster, Te Hiku Media, is developing automatic speech recognition (ASR) models for te reo, which is a Polynesian language. The model will be an ethical, transparent method of speech data collection and analysis to maintain data sovereignty for the Māori people.
Built using the open source Nvidia NeMo toolkit for ASR and Nvidia A100 Tensor Core GPUs, the speech-to-text models transcribe te reo with 92% accuracy. They can also transcribe bilingual speech using English and te reo with 82% accuracy. They’re pivotal tools, made by and for the Māori people, that are helping preserve and amplify their stories.
“There’s immense value in using Nvidia’s open source technologies to build the tools we need to ultimately achieve our mission, which is the preservation, promotion and revitalization of te reo Māori,” said Keoni Mahelona, chief technology officer at Te Hiku Media, who leads a team of data scientists and developers, as well as Māori language experts and data curators, working on the project.
“We’re also helping guide the industry on ethical ways of using data and technologies to ensure they’re used for the empowerment of marginalized communities,” added Mahelona, a Native Hawaiian now living in New Zealand.
Given the concerns of privacy, copyrights and other issues of uploading video and audio onto popular global platforms, Te Hiku Media decided to build its own content distribution platform. Called Whare Kōrero, the platform now holds more than 30 years’ worth of digitized, archival material featuring about 1,000 hours of te reo native speakers.
“It’s an invaluable resource of acoustic data,” Mahelona said.
Despite the success of the platform, the team soon realized that the manual transcription required lots of time and effort from limited resources. Hence, the organization decided to go with trustworthy AI efforts to accelerate its work using ASR.
“No one would have a clue that there are eight Nvidia A100 GPUs in our derelict, rundown, musky-smelling building in the far north of New Zealand — training and building Māori language models,” Mahelona said. “But the work has been game-changing for us.”
A clear and transparent process
To collect speech data in a transparent, ethically compliant, community-oriented way, Te Hiku Media began by explaining its cause to elders, garnering their support and asking them to come to the station to read phrases aloud.
“It was really important that we had the support of the elders and that we recorded their voices because that’s the sort of content we want to transcribe,” Mahelona said. “But eventually these efforts didn’t scale — we needed second-language learners, kids, middle-aged people and a lot more speech data in general.”
In addition to other open source AI tools, Te Hiku Media now uses the Nvidia NeMo toolkit’s ASR module for speech AI throughout its entire pipeline. The NeMo toolkit comprises building blocks called neural modules and includes pre-trained models for language model development.
The efforts have spurred similar ASR projects now underway by Native Hawaiians and the Mohawk people in southeastern Canada.
The success of preserving the languages with AI showcases one of the many benefits the technology can bring to society. While the initial process can be challenging, once the LLM has trained with the data, it should be able to learn and run at a much faster pace.
Just as there are now more translation capabilities being generated by AI use cases, its capabilities in preserving indigenous languages could be the solution needed to ensure these languages are not forgotten.
“It’s indigenous-led work in trustworthy AI that’s inspiring other indigenous groups to think: ‘If they can do it, we can do it, too,’” Mahelona concluded.
READ MORE
- 3 Steps to Successfully Automate Copilot for Microsoft 365 Implementation
- Trustworthy AI – the Promise of Enterprise-Friendly Generative Machine Learning with Dell and NVIDIA
- Strategies for Democratizing GenAI
- The criticality of endpoint management in cybersecurity and operations
- Ethical AI: The renewed importance of safeguarding data and customer privacy in Generative AI applications