Stop Linguistic Colonial Exploitation
The Tusoma Project focuses on ending linguistic colonial exploitation when collecting audio and video language data for ChatGPT and other AI tools. We achieve this by fairly compensating contributors, who are primarily refugees. This income and support are designed to lift them out of poverty holistically for at least one year, addressing their immediate needs while empowering their futures.
The second story is about Agnes, a woman with HIV who worked as a prostitute and participated in a 20-year study. Despite $20 million in funding, Agnes remained in poverty. Any donations that could have allowed her to retire from sex work were blocked—she was required to remain constantly exposed to the virus for research purposes. After two decades, Agnes disappeared into obscurity, her plight a stark example of exploitation. Our mission seeks to combat such injustices.
Focus on Endangered Languages
Our project prioritizes the collection of thousands of hours of endangered language data. Among the 27 languages we have identified, 5 or 6 are considered critically endangered. Collaborating with other language preservation initiatives, we prioritize these languages, ensuring cultural and linguistic diversity is preserved.
One key aspect of our work is identifying gaps in research on endangered languages. As our understanding deepens, the languages we focus on may shift. An immediate priority is conducting additional research to finalize a grant application for thehumanitarian grant, due in February. Pre-funding is essential to support the team in this effort, as we require resources to write the grant itself.
Holistic Humanitarian Goals
Our primary goal is humanitarian: to transform lives through income that functions as a universal basic income or universal microgrant. Beyond financial support, case management is critical. Each refugee will receive a phone, internet access, and personalized case management. This holistic support ensures they produce high-quality audio and video data while addressing their resettlement needs and improving their overall well-being and mental health.
Supplementary funds will cover additional needs such as medical costs, administrative fees for camps, and aid for malnutrition or hunger. The majority of charitable funding will go toward essential expenses like phones, internet access, and income for contributors. Technical funding will focus on developers in Africa who build machine learning models and apps, ensuring local expertise is leveraged.
Technical Development and Funding
Our focus for humanitarian funding includes capturing, archiving, and securely storing data using Amazon Web Services. This funding does not cover machine learning developers, as separate funding from corporations like Google and Mozilla will support the technical effort. Developers will act as advisors, ensuring data quality by addressing representation and diversity, such as including women’s voices and various accents.
If we aim for 2,000 hours of video per language, the income for refugees will far exceed a living wage. The number of hours each refugee can contribute will be assessed on a case-by-case basis. Leveraging modern technology, including smartphones, makes video collection as accessible as audio, while offering additional benefits like capturing facial expressions, gestures, and group dynamics. Generative video is the future, and this data future-proofs linguistic preservation.
Ethical Data Collection
We are committed to ethical practices in data collection. Individuals provide data for their personal benefit and the betterment of humanity, ensuring it is not misused or sold without proper compensation. The primary beneficiary is the individual contributor, with any secondary profits being a lower priority.
If a company like Google were to offer payment but refuse fair compensation, such as paying only a dollar an hour or demanding data for free, we would reject the offer. Our commitment to humanitarian principles is unwavering. If pressured to commercialize or prioritize profit over purpose, the project would cease.
Content Creation with Refugees
We will collaborate with UNHCR to identify individuals suited for video content creation. Phones are integral to this process, enabling refugees to tell their stories. These narratives will include how they became refugees, how they are escaping poverty, and how this project transforms their lives. We emphasize health, particularly women’s health and family well-being, as key goals.
Additional funding will support the development of translation apps. For example, existing data on Kinyarwanda can be transformed into a functional app. These apps will identify spoken African languages, enabling users to seamlessly communicate. A bank employee, for instance, could use the app to recognize and translate up to 25 languages in real time.
Fighting Exploitation
As technology becomes simpler and more affordable, it is imperative to ensure that funds directly benefit contributors, combating the tide of data collection exploitation. By focusing on grants that support refugees and impoverished individuals, we aim to break cycles of poverty rather than perpetuate them. Unlike exploitative practices seen in past studies, we prioritize informed consent, ethical data practices, and respect for individual dignity.
In addition to language data, refugees can financially benefit from other types of data they provide, such as DNA and health records. These contributions are governed by best practices for informed consent, ethical data collection, and strict data privacy protocols—standards we have extensively documented and published.
Conclusion
Our project represents a transformative approach to linguistic preservation and humanitarian aid. By centering refugees as active contributors and beneficiaries, we combat exploitation while leveraging cutting-edge technology to preserve endangered languages. Our mission is built on dignity, fairness, and a commitment to creating a better future for the world’s most vulnerable populations.




Leave a comment