Meta, a leading tech company, has developed new AI models that were trained using the Bible to recognize and generate speech in over 1,000 languages. The company aims to employ these algorithms in efforts to preserve languages that are at risk of disappearing.
Currently, there are approximately 7,000 languages spoken worldwide. To empower developers working with various languages, Meta is making its language models publicly available through GitHub, a popular code hosting service. This move encourages the creation of diverse and innovative speech applications.
The newly developed models were trained on two distinct datasets. The first dataset contains audio recordings of the New Testament Bible in 1,107 languages, while the second dataset comprises unlabeled New Testament audio recordings in 3,809 languages. By leveraging these comprehensive datasets, Meta’s research scientist, Michael Auli, explains that the models can be utilized to build speech systems with minimal data.
While languages like English possess extensive and reliable datasets, the same cannot be said for smaller languages spoken by limited populations, such as those spoken by only 1,000 individuals. Meta’s language models provide a solution to this data scarcity, enabling the development of speech applications for languages lacking adequate resources.
The researchers assert that their models can not only converse in over 1,000 languages but also recognize more than 4,000 languages. Furthermore, when compared to rival models like OpenAI Whisper, Meta’s version exhibited a significantly lower error rate despite covering a broader range of languages, exceeding even 11 times more language coverage.
However, the scientists acknowledge that the models may occasionally mistranscribe specific words or phrases. Additionally, their speech recognition models displayed a slightly higher occurrence of biased words compared to other models, albeit only by a marginal increase of 0.7%.
Chris Emezue, a researcher at Masakhane, an organization focused on natural-language processing for African languages, expressed concerns about the use of religious text, such as the Bible, as the basis for training these models. He believes that the Bible carries inherent biases and misrepresentations, which could impact the accuracy and neutrality of the models’ outputs.
This development poses an important question: Is Meta’s advancement in language models a step forward, or does its utilization of religious text for training introduce controversial elements that hinder its overall impact? The conversation around the ethical considerations and potential biases involved in training language models remains ongoing.