Kenyan author Ngugi Wa Thiong'o in his novel Decolonising the Mind states “The effect of a cultural bomb is to annihilate a people's belief in their names, in their languages, in their environment, in their heritage of struggle, in their unity, in their capacities and ultimately in themselves.". When a technology treats something as simple and fundamental as your name as an error, it in turn robs you of your personhood and reinforces the colonial narrative that you are other.
Named entity recognition (NER) is a core NLP task in information extraction and NER systems are a requirement for numerous products from spell-checkers to localization of voice and dialogue systems, conversational agents, and that need to identify African names, places and people for information retrieval. Currently, the majority of existing NER datasets for African languages are WikiNER which are automatically annotated, and are very noisy since the text quality for African languages is not verified. Only a few African languages have human-annotated NER datasets. To our knowledge, the only open-source Part-of-speech (POS) datasets that exist are a small subset of languages in South Africa, and Yoruba, Naija, Wolof and Bambara (Universal Dependencies).
Pre-trained language models such as BERT and XLM-RoBERTa are producing state-of-the-art NLP results which would undoubtedly benefit African NLP. Beyond the direct uses, NER also is a popular benchmark for evaluating such language models. For the above reasons, we have chosen to develop a wide-spread POS and NER corpus for 20 African languages based on news data.
Peter Nabende (Makerere University) - Principal Investigator
Jonathan Mukiibi (Makerere University)
David Ifeoluwa Adelani (Masakhane; Saarland University)
Jade Abbott (Masakhane; Retro Rabbit)
Daniel D’souza (Masakhane)
Constantine Lignos (Masakhane; Brandeis University)
Sascha Heyer (IO Annotator)
Proposed Dataset and Use Cases
Named entity recognition (NER) is a critical task in information extraction (IE), enabling the identification of named entities (people, organizations, locations, etc.) in text. NER allows for searching and aggregating information of entities that enables more broad applications such as developing voice assistants or chat bots that can correctly identify local names and locations.
While there have been significant projects to annotate NER data and build systems for North American, European, and Asian languages, there have only been limited efforts to do so for African languages, leaving those languages underserved. Many of the resources required to enable NER annotation, such as large text collections and the language-specific tools, still do not exist for these languages. The few resources for NER in African languages that do exist are both limited in availability, often requiring expensive data purchases and not allowing redistribution, and limited in scope, covering few languages.
We propose to perform NER and part of speech (POS) annotation on a wide variety of African languages that represent the continent’s linguistic diversity and make this annotation broadly available to the research community. The result of this annotation will be a profound transformation of the possibilities for developing language technology for the languages of Africa. Performing this annotation will enable the speakers of these languages to search and access more information in their native languages and will provide a seed from which to develop more language technology (search tools, chat bots, speech recognition) in languages that have been thus far underserved.
A few NER datasets have been created for African languages, the largest dataset is WikiAnn supporting 282 languages but the entities are automatically annotated and have less than 10,000 tokens for any of the indigeneous African languages. Also, the quality of text annotated was not verified. Other datasets are the SADiLaR NER dataset [Eiselen 2016] for the eleven South African languages based on the Government data, and a small corpus for Yoruba [Alabi et al., 2020] and Hausa [Hedderich et al., 2020] in the news domain. The LORELEI project are also working on creating NER datasets for a few African languages such as Yoruba, Hausa, Amharic, Somali, Twi, Swahili, Wolof and Zulu via LDC [Strassel and Tracey, 2016; Tracey et al.,2019) but they are not publicly available.
Our goal is to create a large NER corpus for over 20 African languages with at least 50,000 tokens (depending on the availability of data) in the news domain because several standard NER corpus like the English CoNLL dataset [Tjong Kim Sang et al 2003] have been annotated in the same domain. Having our NER corpus in the news domain will encourage cross-lingual comparison with other benchmark datasets. To the best of our knowledge, this will be the first large scale collection of NER dataset for many African languages. The news corpus we will be collecting are news websites written in indigeneous African languages. For the languages that do not have news articles, we will translate contents obtained from indigeneous new websites written in English or French to the target language because we want models trained on the dataset to be able to correctly identify African names and locations. Part-of-Speech tags have always been jointly annotated for the NER task as an important feature for training NER models prior to neural networks ability to correctly annotate entities without these features. However, it has been recently shown for some African languages like isiXhosa that simpler machine learning models that include these features (e.g in a CRF [Lafferty et al. 2001] model) [Loubser et al 2020, Hedderich et al., 2020] gives better performance than the most powerful neural network model. Therefore, we will also be jointly annotating POS and NER for all the African languages.
We consider 20 very diverse languages spoken in about 35 African countries and 4 regions of Africa (West, Central, East and Southern). Masakhane has sourced local coordinators to assist with the annotation. Languages to be included: Bambara, Chichewa, Ewe, Fon, Ghomala, Hausa, Igbo, Kinyarwanda, Luganda, Luo, Moore, Naija Pidgin, Setswana, Shona, Swahili, Twi, Wolof, Xhosa, Yoruba, Zulu
Pathway(s) to Impact and Intended Beneficiaries
The immediate benefit from the data collection and annotation that we propose will be that new NLP systems can be created in these 20 languages, enabling richer information extraction and search capabilities in languages with very limited NLP technology to date. However, the impact in development of NLP tools will extend far beyond developing POS taggers and NER systems. The annotated corpora will provide tasks that can immediately be used to evaluate the quality of word embeddings in these languages. The development of high-quality embeddings in these languages would enable the development of countless other downstream capabilities, as researchers will be able to use this annotation to guide the construction of better transfer learning methodologies for contextualized embeddings, enabling the BERT revolution to come to previously underserved languages.
The annotated POS and NER datasets will create benchmark sets that can be used within African universities so that students can learn to build language technology in their mother languages, an opportunity previously denied to them due to lack of annotated data. From this new generation of NLP students can come tools that impact the speakers of African languages in their everyday lives: phone keyboards in their native languages, new or improved translation and transliteration resources, and the general ability to access digital resources in their own languages. More globally, the creation of these benchmark datasets will have the impact of stirring interest in African languages, leading to a critical mass of African and non-African NLP researchers working on them.
Accessibility, Data Management, and Licensing
The dataset will be published on the Masakhane Github page. We will publish monolingual new corpus, parallel datasets for the translated languages and the NER and POS annotated datasets. All annotated datasets will be published under the license CC-BY-4.0. Only the monolingual datasets that we obtained copyright permission from will be published. At the time of writing this grant, we have six languages with datasets in CC-BY-4.0, two other languages (Ghomala and Ewe) have received permission from the site owners (by email), we are in discussion with BBC to release the monolingual data crawled if the sentences are shuffled. At the moment, BBC permission policy is only for non-commercial use. Our legal team will be in discussion with BBC and other news website owners to get full permission before the annotation of the data.
Risks, Including Ethics and Privacy
Ethics and Bias. A potential ethical issue with using news data is that it often represents western narrative and non-African entities. For this reason we use data published by African authors from the countries where the words are spoken. Additionally, Africans have been historically exploited by technology and as a result, the fear is that the datasets from the continent will find their way into misuse. To mitigate this risk, our dataset will be published with the Masakhane Ethics Manifesto and a Datasheet outlining appropriate use of the dataset.
Privacy: Given all sources are already publicly available as news, we do not anticipate any privacy risks.
The monolingual data, parallel data, and annotated dataset for all the twenty (20) African languages will be uploaded and maintained on Masakhane Github page. The language-specific datasets that have been collected are also stored on the same Github. As part of the work, this grant will facilitate the establishment of the Masakhane Consortium (described in detail in Appendix A) which will continue to support independent researchers of Masakhane beyond this grant. Additionally, it will facilitate the creation of the Masakhane Ethical Manifesto which can be used into the future as a reference for any NLP dataset creation efforts on the African continent.
We do not see any direct impact for creating parallel translation dataset since the annotators and translators will be working remotely.
Jesujoba O. Alabi, Kwabena Amponsah-Kaakyire, David I. Adelani, and Cristina España Bonet. Massive vs. curated word embeddings for low-resourced languages. the case of Yorùbá and Twi. In 12th International Conference on Language Resources and Evaluation (LREC), 2020
Roald Eiselen. 2016. Government domain named entity recognition for south African languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3344–3348, Portoroz, Slovenia. European ˇ Language Resources Association (ELRA).
Michael A. Hedderich, David Ifeoluwa Adelani, Dawei Zhu, Jesujoba O. Alabi, Udia Markus, and Dietrich Klakow. Transfer learning and distant supervision for multilingual transformer models: A study on African languages. In EMNLP, 2020
Ọlájídé Ishola & Dan Zeman (2020). Yorùbá Dependency Treebank (YTB). In Proceedings of the 12th Language Resources and Evaluation Conference (pp. 5180–5188). Marseille, France: European Language Resources Association.
John D. Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML.
Melinda Loubser and Martin J. Puttkammer. 2020b. Viability of neural networks for core technologies for resource-scarce languages. Information, 11:41.
David Mueller, Nicholas Andrews, and Mark Dredze. 2020. Sources of transfer in multilingual named entity recognition. In ACL.
Stephanie Strassel and Jennifer Tracey. 2016. LORELEI language packs: Data, tools, and resources for technology development in low resource languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3273–3280, Portoroz, Slovenia. European Language Resources Association (ELRA).
Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
Jennifer Tracey, Stephanie Strassel, Ann Bies, Zhiyi Song, Michael Arrigo, Kira Griffitt, Dana Delgado, Dave Graff, Seth Kulick, Justin Mott, and Neil Kuster. 2019. Corpus building for low resource languages in the DARPA LORELEI program. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages, pages 48–55, Dublin, Ireland. European Association for Machine Translation.
∀ et al. "Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages." Findings of EMNLP (2020).