Masakhane MT: Decolonise Science

Rationale

When it comes to scientific communication and education, language matters. The ability of science to be discussed in local indigenous languages not only has the ability to reach more people who do not speak English or French as a first language, but also has the ability to integrate the facts and methods of science into cultures that have been denied it in the past. As sociology professor Kwesi Kwaa Prah put it in a 2007 report to the Foundation for Human Rights in South Africa, “Without literacy in the languages of the masses, science and technology cannot be culturally-owned by Africans. Africans will remain mere consumers, incapable of creating competitive goods, services and value-additions in this era of globalization.”


During the COVID19 pandemic, many African governments did not communicate about COVID19 in the most wide-spread languages in their country. ∀ et al (2020) demonstrated that the machine translation tools failed to translate COVID19 surveys since the only data that was available to train the models was religious data. Furthermore, they noted that scientific words did not exist in the respective African languages. This highlights the failure


Thus, we will build a multilingual parallel corpus of African research, by translating African pre-print research papers released on AfricArxiv into 6 diverse African languages.

Proposed Dataset


When it comes to scientific communication, language matters. Jantjies (2016) demonstrates how language matters when it comes to STEM education: students perform better when taught mathematics in their home language. Language matters, in scientific communication, in how it can dehumanise the people it chose to study - Robyn Humphreys, at the #LanguageMatters seminar at UCT Heritage 2020, noted the following “During the continent’s colonial past, language – including scientific language – was used to control and subjugate and justify marginalisation and invasive research practices”.


The ability of science being discussed in local indigenous languages not only has the ability to reach more people who do not speak English as a first language, it also has the ability to integrate the facts and methods of science into cultures that have been denied it in the past. As sociology professor Kwesi Kwaa Prah put it in a 2007 report to the Foundation for Human Rights in South Africa, “Without literacy in the languages of the masses, science and technology cannot be culturally-owned by Africans. Africans will remain mere consumers, incapable of creating competitive goods, services and value-additions in this era of globalization.” (Prah, Kwesi Kwaa, 2007). When science becomes "foreign" or something non-African, when one has to assume another identity just to theorize and practice science, it's a subjugation of the mind - mental colonization.


There is a substantial amount of distrust in science, in particular by many black South Africans who can cite many examples of how it has been abused for oppression in the past. In addition, the communication and education of science was weaponized by the oppressive apartheid government in South Africa, and that has left many seeds of distrust in citizens who only experience science being discussed in English.


Through government-funded efforts, European derived Languages such as Afrikaans, English, French, and Portuguese, have been used as vessels of science, but African indigenous languages have not been given the same treatment. Modern digital tools like machine learning offer new, low-cost opportunities for scientific terms and ideas to be communicated in African indigenous languages.


During the COVID19 pandemic, many African governments did not communicate about COVID19 in the most wide-spread languages in their country. ∀ et al (2020) demonstrated the difficulty in translating COVID19 surveys since the only data that was available to train the models was religious data. Furthermore, they noted that scientific words did not exist in the respective African languages.


Thus, we propose to build a multilingual scientific parallel corpora of African research, by translating African papers released on AfricArxiv into multiple African languages.


Use cases:

  • A machine translation tool for AfricArxiv to aid translation of their research to and from African languages

  • Terminology developed will be submitted to respective boards for addition to official language glossaries for further improvements to scientific communication

  • A machine translation tool for African universities to ensure accessibility of their publications

  • A machine translation tool for scientific journalists to assist in widely distributing their work on the African continent

  • More generally, the datasets developed would be a welcome addition


The selection of languages for this grant were based on the following factors:

  • Prevalence of usage of the languages in question

  • Existing relationships with co-ordinators, trusted translation partners, journalists and linguists for the languages in the Masakhane community

  • The lack of existing open source translation data in non-religious contexts

  • The geographic diversity of the languages to be representative of the continent


Based on the above, we have selected the following 6 languages: isiZulu, Northern Sotho, Yoruba, Hausa, Luganda, Amharic.

Team & Partners

Core Team

  • Jade Abbott (Principal Investigator) - Masakhane, Retro Rabbit

  • Dr Johanna Havemann - AfricArxiv

  • Sibusiso Biyela - ScienceLink

Specifications and Deliverables for Proposed Data and Documentation

The dataset will take the form of parallel corpus for each language pair targeted. The data to be translated will be preprint papers published on AfricArxiv. Currently they have approximately 600 articles from a variety of domains (Life Sciences, Engineering, Law, Social Sciences, Mathematics). The length of each article varies but on average approximately 400 sentences can be extracted from each paper. We will translate 180 papers from a variety of domains, and so we estimate the corpus for each language to consist of 72,000 parallel sentence pairs, per language. Given we are targeting 6 languages, the estimated total size of the multilingual dataset will be 360 000 parallel sentences.


Ethics and Bias:

Africans widely view ethics as a necessary component in the creation of mutually beneficial harmonious social, economic, and political relationships between individuals, their environment, and the community. In particular ubu-Ntu ethics is notable for its strong focus on enriching relationships in ways that affirm human rights and human dignity through the equitable distribution of power, and individual and communal participation in mutually beneficial goals. Corpora creation activity, in order to align with these local ethical principles, should affirm and enrich the human worth and dignity of Africans. To this extent, great concern should be taken to ensure that the creation of corpora should be used for the aims and goals shaped by Africans and in particular avoids exploitation from non-African big tech organizations. For a dataset centered on African knowledge, by contrast, it’s difficult conceptualize use (or rather misuse) of the dataset for any work beyond education and scientific communication. However, there exists discourse on the potential misuse of scientific data to generate fake scientific content that appears real, and can be used for misinformation

  • Masakhane Ethical Manifesto - This project will facilitate the creation of the Masakhane Ethical Manifesto which can be used into the future as a reference for any NLP dataset creation efforts on the African continent.

  • Paper Selection: The data curation process will ensure the representation of African countries, a diversity of research topics, and gender parity as far as possible

  • Translator Selection: We will engage with the Masakhane Consortium ethics partners and translation partners in order to ensure African-centricity and gender parity, using their developed frameworks. Additionally, we will ensure that the translation agencies used in the project are black-owned African businesses.


Quality Control

  • A pilot run will be performed with our translation partners to translate a sample paper. The translations from the pilot run will be assessed by our linguistic partners and data curators to ensure quality and format of translations and allow for feedback to the translation partners, before work continues.

  • Spell-checkers will be used, where available for the languages in question.

  • Our translation partners will be trained translators, rather than crowd sourced individuals, which aims to minimize quality issues. While this increases the cost of translation, this is highly necessary since the subject matter, being research, is complex.


Challenges

Terminology creation for scientific terms is not non-trivial. To mitigate this problem:

  • We’re involving African scientific journalists and African linguists to assist in the creation of terminology where translators

  • The translators will mark scientific terms they are unable to translate, which can then be handled separately by the terminology experts

  • AfricaArxiv team will facilitate contact to the paper authors in order to disambiguate complex scientific terminology

Pathway(s) to Impact and Intended Beneficiaries

This dataset will not only facilitate tools and products being developed, but also creation of terminology.


Glossaries - Language terms are useful as long as they are standardised and used by several entities and organisations. The dataset will allow greater collaboration from organisations creating works in translating scientific terms for outreach and in instances where scientists work with indigenous communities in their research. Our collaboration with ScienceLink via will ensure the dissemination of these glossaries widely


Tools and Products

  • Specifically, the Masakhane Web platform (currently in development) will allow open access to translation models that Masakhane builds using this dataset. This will allow a human-in-the-loop mechanism to evaluate the models and contribute further data.

  • Translation tools for scientific purposes, for use by students, universities, science communicators and media, and textbook publishers.

  • Teaching aids, to assist teachers in educating students in multiple languages

  • Language models can be trained on the monolingual portions of the datasets, which can be used to support downstream tasks of classifiers, or conversational agents.


Through a decolonial lens, as was described in earlier sections, the creation of tools that empower the cultural-ownership and African-centricity of science are imperative to the empowerment of Africans beyond mere consumers of goods and instead pivots them into creators. By enabling scientific communication in African languages of African knowledge, this work facilitates decolonisation.


Taking into account the SDGs, we notice that by providing translational tools to include African language groups into global discussions we cut across all SDGs, specifically addressing SDGs 4 (quality education) and 17 (partnerships for the goals). SDG #4: Quality Education states that remote learning is out of reach for 500 million students. This is in part because remote learning favours English as the primary language of instruction. In order to ensure sustainability of the educational system as the digital world moves forward, it’s imperative that scientific communication in African languages is facilitated in a way that is scalable. SDG 17: Partnerships for the Goals are facilitated by machine translation tools.


Under both these frameworks (Decolonial and SDG), it is clear that the artifacts generated from the existence of such a dataset empower African populations. The creation of these datasets oils the wheels in a research, commercial and non-profit sense, and the depth of impact is endless - it provides fuel for the “virtuous cycle” described in (∀, 2020)


Potential Constraints

While we understand this dataset may not be sufficient to perfectly train a multilingual translation tool, it facilitates the first step of many by lowering the translation burden, allowing authors to post-edit and correct a translation, rather than translate from scratch as well as referenced developed vocabularies


Despite the datasets, businesses and NGOs would still need to acquire funding to build the tools based on the datasets. That said, open-source initiatives such as Masakhane have done much to oil the wheel in terms of the amount of effort required to build such a tool, as well as develop expertise across the continent.

Accessibility, Data Management, and Licensing

  • The dataset will be published on Zenodo and will be stored in the StatMT parallel corpora format for parallel corpuses. We will publish both per-language pair datasets as well as a combined multi-lingual dataset.

  • Since AfricArXiv preprint publications are published under CC BY 4.0, we do not anticipate any copyright issues

  • Any publications describing methodology, datasheets will be licensed under the required CC BY 4.0. This will be legally detailed in the Masakhane consortium agreement.

Risks, Including Ethics and Privacy

  • Given the source datasets are already under CC BY 4.0, we do not foresee any privacy risks

  • Discussion of ethics exists in the specifications section

Sustainability Plan

As part of the work, this grant will facilitate the establishment of the Masakhane Legal Entity which will continue to support independent researchers of Masakhane beyond this grant. Additionally, it will facilitate the creation of the Masakhane Ethical Manifesto which can be used into the future as a reference for any NLP dataset creation efforts on the African continent.


The Masakhane community has developed Masakhane Web, a platform to deploy Masakhane models, for research, to evaluate practical capabilities of the models built by the community. The Masakhane community will train and deploy models with these datasets. The system includes a feature to collect further data based on real world participants