The Tusoma Project’s goal is to stop linguistic colonial exploitation when collecting audio and video language data for ChatGPT and other AI tools. We achieve this by fairly compensating contributors, who are primarily refugees. This income and support are designed to lift them out of poverty holistically for at least one year, addressing their immediate needs while empowering their futures.

Two stories illustrate the problem. First, for-profit technology companies have historically exploited workers, as recently seen in Kenya. Workers who helped build ChatGPT were underpaid, undervalued, and subjected to traumatizing work without mental health support or dignity. “Dignity” is a cornerstone of our mission. We aim to economically uplift individuals, foster self-esteem and confidence, and create compassion-driven, humanitarian outcomes.

The second story is about Agnes, a woman with HIV who worked as a prostitute and participated in a 20-year study. Despite $20 million in funding, Agnes remained in poverty. Any donations that could have allowed her to retire from sex work were blocked—she was required to remain constantly exposed to the virus for research purposes. After two decades, Agnes disappeared into obscurity, her plight a stark example of exploitation. Our mission seeks to combat such injustices.

The two stories above do not include the theft without payment of millions of pieces of art that were shared online. All lawsuits fighting for compensation have failed.

Focus on Endangered Languages

Our project prioritizes the collection of thousands of hours of endangered language data. Among the 27 languages we have identified, 5 or 6 are considered critically endangered. Collaborating with other language preservation initiatives, we prioritize these languages, ensuring cultural and linguistic diversity is preserved.

One key aspect of our work is identifying gaps in research on endangered languages. As our understanding deepens, the languages we focus on may shift. An immediate priority is conducting additional research to finalize a grant application for thehumanitarian grant, due in February. Pre-funding is essential to support the team in this effort, as we require resources to write the grant itself.

Holistic Humanitarian Goals

Our primary goal is humanitarian: to transform lives through income that functions as a universal basic income or universal microgrant. Beyond financial support, case management is critical. Each refugee will receive a phone, internet access, and personalized case management. This holistic support ensures they produce high-quality audio and video data while addressing their resettlement needs and improving their overall well-being and mental health.

Supplementary funds will cover additional needs such as medical costs, administrative fees for camps, and aid for malnutrition or hunger. The majority of charitable funding will go toward essential expenses like phones, internet access, and income for contributors. Technical funding will focus on developers in Africa who build machine learning models and apps, ensuring local expertise is leveraged.

Technical Development and Funding

Our focus for humanitarian funding includes capturing, archiving, and securely storing data using Amazon Web Services. This funding does not cover machine learning developers, as separate funding from corporations like Google and Mozilla will support the technical effort. Developers will act as advisors, ensuring data quality by addressing representation and diversity, such as including women’s voices and various accents.

If we aim for 2,000 hours of video per language, the income for refugees will far exceed a living wage. The number of hours each refugee can contribute will be assessed on a case-by-case basis. Leveraging modern technology, including smartphones, makes video collection as accessible as audio, while offering additional benefits like capturing facial expressions, gestures, and group dynamics. Generative video is the future, and this data future-proofs linguistic preservation.

Ethical Data Collection

We are committed to ethical practices in data collection. Individuals provide data for their personal benefit and the betterment of humanity, ensuring it is not misused or sold without proper compensation. The primary beneficiary is the individual contributor.

Our commitment to humanitarian principles is unwavering. If pressured to commercialize or prioritize profit over helping our contributors, the project would cease.

Content Creation with Refugees

We will collaborate with UNHCR to identify individuals suited for video content creation. Distribution of smartphone and credits for free internet are integral to this process, enabling refugees to tell their stories. These narratives will include how they became refugees, how they are escaping poverty, and their hopes and dreams for the future. We emphasize health, particularly women’s health and family well-being, as key goals. We aim to collect narratives from some of the most margenalized people on the planet – from LGBTQIA people fearing for their safety, to sex workers in ghettos.

Additional technical funding (as opposed to humanitarian funding) will support the development of translation apps. For example, existing data on Kinyarwanda can be transformed into a functional app. These apps will identify spoken African languages, enabling users to seamlessly communicate. A bank employee, for instance, could use the app to recognize and translate up to 25 languages in real time.

Fighting Exploitation

As technology becomes simpler and more affordable, it is imperative to ensure that funds directly benefit contributors, combating the tide of data collection exploitation that was done by Open AI, who changed from a not-for-profit to a for-profit company. By focusing on grants that support refugees and impoverished individuals, we aim to break cycles of poverty rather than perpetuate them. Unlike exploitative practices seen in past data-training efforts, we prioritize ethical data collection practices.

In addition to language data, refugees can financially benefit from other types of data they provide, such as DNA and health records. These contributions are governed by best practices for informed consent, ethical data collection, and strict data privacy protocols—standards we have extensively documented and published.

From Public Outrage to Meaningful Change

It’s also very important to point out that articles like this do not make change. There continue to be thousands of impoverished research subjects who remain in poverty after being studied.

For Agnes, who spent decades trapped in sex work while contributing to medical research – again, no laws were broken and no reforms were implemented. The cycle of exploitation continues. (Note: the Tusoma Project is trying to help with this issue). For workers in Kenya, it’s unclear if anything has changed.

The perfect example of outrage not equating to change is Martin Shkreli, infamously known as “Pharma Bro,” who sparked outrage by raising the price of a drug from $13.50 to $750 per pill. Despite being labeled “the most hated man in America”, the price of Daraprim was never reduced- it is not a crime to impose exorbitant price hikes on critical medications.

Similarly, the healthcare industry has yet to implement reforms despite public outrage following the shooting of UnitedHealthcare’s CEO. Public support for these three issues has not spurred changes to the system. These stories illustrates the gap between public outrage and meaningful reform.

Conclusion

The Tusoma Project represents a transformative approach to linguistic preservation and humanitarian aid. By centering refugees as partners, subject matter experts, active contributors and beneficiaries, we combat exploitation while leveraging cutting-edge technology to preserve endangered languages. Our mission is built on dignity, fairness, and a commitment to creating a better future for as many people as we can, with a focus on the most marginalized within the marginalized (like trans people and sex workers in refugee camps in Africa).

Leave a comment

Trending