Bid to include African datasets in LLM training gain traction 🌍 🧠

Bid to include African datasets in LLM training gain traction 🌍 🧠

#machine learning
Researchers ratchet up efforts to embed African perspectives into LLM training datasets as AI ethics activists push for a more representative global AI ecosystem.

Seth Onyango, bird story agency 

Efforts to imbue Large Language Models (LLMs) with a deeper understanding of African perspectives and narratives have recently gained traction. 

This includes curating expansive datasets that include African languages, literature, and historical texts, many of which are underrepresented or absent entirely, from current AI models.

This drive comes amidst a glaring lack of representation of Africa's 1,000+ languages and content on the internet, where they make up a mere 1% of all languages present.

Miguel Botero, Director of Social Impact at Biografika, asserts the disparity is palpable. He notes that the current imbalance in AI training data has led to a skewed representation of global knowledge and experiences. 

Roughly a third of global languages stem from Africa, where, by some estimates, as many as 2,000 languages are spoken.

Of these, 75 are spoken by populations exceeding one million, while many remain solely oral, lacking written or digitized formats. 

This situation underscores some of the problems in developing digital databases and creating LLMs, especially given the complexity and high costs associated with crafting LLMs.


Botero posits that this not only limits AI's effectiveness but also perpetuates a narrow, often inaccurate portrayal of diverse cultures and societies in the global south.

"African languages make up only about point 1% of the languages represented on the internet. That means that anything you ask an LLM will produce and uphold a vision of humanity, where Africa in this case, only influences the reply marginally," he said. 

"So, everything you ask an LLM will be driven by the views and the general image of humanity (that) has been created by data from the global north. That means Africa is not influencing the replies that are given by these LLMs."

He notes the same thing applies to Wikipedia, where only a small fraction of articles on the online encyclopaedia are produced by people from Africa or Latin America - communities that he terms the "global majority".

Moky Makura, Executive Director at Africa No Filter, is leading an effort to address this issue through her organisation's support of African content that works to change a narrow and mostly negative view of Africa. Working with a variety of platforms and creatives to populate the internet with authentic African content, she insists that African content creators need to flood the internet with better stories, told better, about the continent.

In 2022, AI ChatGPT was hailed as one of the year's most impressive technological innovations upon its release. In a bid to make it less toxic, OpenAI used outsourced Kenyan labourers earning less than $2 per hour, according to a TIME investigation.

Still, ChatGPT has cliches about Africa. In December last year, Dr Ibraheem Dooba, Director of Research and Publication at APS sought to know how ChatGPT sees a successful African. The results although shocking are not surprising.

A successful man turned out to be a European-looking young man, in a suit with Africans who contribute to his success are themselves partially clothed.

Several experiments have shown that ChatGPT, the world's pre-eminent chatbot tends to portray African nations and their rulers in a poor light, hinting at themes of poverty, illness, or graft.

Despite multiple prompts, ChatGPT won't get it right. 

"In sum, without some serious prompt engineering, ChatGPT 4 default responses can be useless," Dooba concluded.

Biografika is currently developing a visibility report focused on making the stories of changemakers from the global majority more accessible and impactful. 

Key to Biografika's strategy is the acknowledgement of the vast array of knowledge production methods that exist outside the Western, text-based paradigm. 

Many communities in Africa and other parts of the global majority rely on oral traditions, music, and other forms of expression that are often overlooked in digital archives and datasets. 

By advocating for the inclusion of these multimodal forms of knowledge, Biografika aims to challenge and expand how information is collected, archived, and utilised in AI development.

Germany, meanwhile, is mulling a collaboration with data collection institutes such as the NGO Afrobarometer, the Konrad Adenauer Foundation said.

"A glaring lack of data does not only exist in the area of languages, but statistically representative data on various socio-economic categories of social coexistence are also incomplete in many African countries and are often plagued by difficulties in the collection processes," wrote in a recent paper.

"However, such representative native data from specific African contexts are indispensable to develop accurate AI models."

bird story agency.

Top comments(0)