Breaking the AI Language Barrier

FROM OUR BLOG

Breaking the AI Language Barrier

Jul 28, 2025

a pile of plastic letters and numbers on a pink and blue background

TLDR: AI is trained primarily on English-language data, meaning it excels at spoken and written tasks in English but fails to perform in most of the 7,000 languages used globally. For AI to become a universally used tool, it needs to work in the world’s languages – and it needs curated, local language datasets to do so. Enter Aris: we work with communities around the world to collect unique multimodal data (like audio, text, image, video, and more) designed to train multilingual AI. Our goal is to empower people to understand and participate in the AI data economy, so they can contribute to building AI that works for everyone, everywhere.

If you’re building a tool for the world to use, it needs to speak the world’s languages — but most AI doesn’t. Roughly 90% of generative AI training data is in English, and according to one study, over 80% of LLMs’ non-English training data comes from low-quality translations of English content. When AI appears “multilingual”, it’s often just translating images poorly. Without high-quality datasets, AI fails to understand cultural context, dialectal nuances, or even basic fluency.

Only 20% of the world actually speaks English. This dramatic imbalance between AI and real life leaves 6 billion speakers of the world's more than 7,000 other languages unable to properly adopt generative tools. For companies creating technology to serve the whole world, that’s a problem.

The Linguistic Digital Divide

Generative AI tools have been trained primarily on internet data, which may widen the digital divide between those who speak one of a few data-rich languages (like English, Mandarin Chinese, Spanish, or Italian) and those who do not. Tier 2 languages with millions of speakers are considered “low-resource” – not because they’re rare, but because the internet doesn’t reflect how they’re spoken, written, or used in everyday life.

Cantonese, which is spoken by over 85 million people, has distinct spoken and written forms. This means that much of the formal Cantonese on the internet doesn’t translate well to casual audio training, resulting in stilted, unusable content.

Dialectical languages like Arabic present a different challenge. Though Modern Standard Arabic may have sufficient online representation, there are approximately 30 Arabic dialects spoken around the world, which vary greatly in their pronunciation, vocabulary, and grammar – so much so that some are mutually unintelligible. A speaker of Egyptian, Levantine, or Gulf Arabic dialects may be unable to understand AI trained on Modern Standard Arabic.

Tier 3 languages spoken by smaller populations like Kikuyu (spoken by 7 million Kenyans) and Quechua (spoken by 8-10 million people across South America) represent the most challenging frontier. Though some grassroots efforts have aimed to capture Tier 3 language data for local models, these datasets are minuscule in comparison to the massive volumes of data for English-language models.

How Does Insufficient Data Affect Language Models?

LLMs trained on insufficient datasets demonstrate a few failures to perform:

Lack of fluency: LLMs struggle with basic fluency in under-resourced languages and stumble over basic tasks.
Cultural and Contextual Misunderstandings: Even in high-resource languages, models often fail to grasp cultural context.
Bias Toward English-Centric Thinking: LLMs often prioritize English worldviews, leading to distorted answers in other languages.
High Hallucination Rate: When translating between languages, LLMs “hallucinate” more frequently, or introduce errors that range from minor inaccuracies to complete nonsense.

The lack of digital resources in under-resourced languages creates a vicious cycle: limited internet content means minimal training data, which produces poor AI performance, which discourages digital content creation in these languages. This is both a missed opportunity and a growing divide that limits global AI adoption.

A Scalable Solution

The real AI race isn’t just about perfecting English performance — it’s about creating truly multilingual AI that thinks natively across languages. The end goal is semantic fluency — models that understand jokes, sarcasm, idioms, and cultural nuance.

The path forward requires:

High-quality training data curated for AI across languages and dialects.
Active involvement from native speakers.
Continuous data contributions, especially from communities with low digital representation.

How Aris Delivers

Aris is collaborating with vetted networks of native speakers to gather curated datasets designed to create multilingual models. Unlike scraping or translations, our methodology captures how languages are actually spoken, written, and understood in context. Through our mobile-first, contributor-focused approach, we provide AI builders with the scalable, permissioned data they need to train models that truly speak the world’s languages.

If you’re building multilingual AI — whether for local tools or global platforms — let’s talk. Partner with Aris to access the data that makes it possible.

Let’s build AI for anyone, anywhere.