The development of generative artificial intelligence (AI) models that learn from mass data sets scraped from the web to create original text, images, videos, and more has raised growing concerns about plagiarism, unethical sourcing of data, and cultural appropriation. While these technologies can help preserve and revive indigenous languages, harvesting data without consent risks abuse, distorting indigenous culture, and depriving minorities of their rights, say experts.
Karaitiana Taiuru, a Māori ethicist and honorary academic at the University of Auckland, said, “Data is like our land and natural resources. If Indigenous peoples don’t have sovereignty of their own data, they will simply be re-colonised in this information society.” Taiuru’s comments come after OpenAI trained its Whisper chatbot on 680,000 hours of audio from the web, including 1,381 hours of the te reo Māori.
Many indigenous languages are under threat of disappearing, warned the United Nations, taking with them cultures, knowledge, and traditions. In New Zealand, where the Māori language is enjoying a revival, the government aims to have one million basic speakers by 2040. This means digital systems using Māori will be rolled out in increasing numbers, said Peter-Lucas Jones, CEO of Te Hiku Media, a non-profit that runs Māori broadcasts and archives and promotes the language.
But it was “concerning” to see a non-Māori organization roll out a speech model using their language, he said. Jones explained that what we are seeing with these large AI models is data being scraped from the internet with little regard for any bias that could be present in the data, let alone any associated intellectual property rights.
Indigenous leaders were angered when Air New Zealand sought to trademark a logo with the words “kia ora” – meaning “hello” or “good health” in Māori – highlighting tensions over attempts to co-opt their language and culture by outside groups. Critics warn Indigenous groups, who are generally not involved in the design or testing of AI systems, are at risk from bias that can be embedded within algorithms, while generative AI models may also spread incorrect information.
Indigenous data and knowledge need protection, said Karaitiana Taiuru. There is a growing recognition of the need to protect Indigenous data and knowledge, with the World Trade Organization outlining measures in 2006 to provide intellectual property protection for “traditional knowledge and folklore.” Federally recognized tribes in the US can restrict data collection on their reservations. However, data collection “can fly under the radar and avoid the jurisdiction of a tribe,” said Michael Running Wolf, an AI ethicist and Native American who founded the non-profit Indigenous in AI.