MakerereNLP:
Text & Speech for East Africa
Abstract
The project aims to deliver open, accessible and high quality text and speech datasets for low resourced East African languages from Uganda, Tanzania and Kenya. Taking advantage of the advances in NLP and voice technology requires a large corpora of high quality text and speech datasets. This project will aim to provide this data for the languages: Luganda, Runyankore-Rukiga, Acholi, Swahili and a subset of Luhya Languages which are cross-border between Uganda and Kenya. In collaboration with natural language experts in the 3 countries, we plan to deliver datasets composed of:
(a) Parallel text Corpora for Luganda (900,000 sentences), Swahilli (900,000 sentences), Runyankore-Rukiga (200,000 sentences), Acholi (200,000 sentences) and Lumasaaba (200,000 sentences) obtained from various sources.
(b) Speech data set for Luganda (1000 hours) and Swahilli (1000 hours) on the Common Voice platform. The voice data will be collected using the Common voice platform based on the text corpora in (a) and through established voice communities in the three countries. Mozilla Common Voice platform is a well-established platform for crowdsourcing voice contributions and availing the voice data for free. The speech data for Luganda and Swahilli will be geared towards training a speech-to-text engine for an SDG relevant use-case, general-purpose ASR models that could be used in such tasks as: driving aids for the impaired, development of AI tutors to support early education. Monolingual and parallel text corpora will be used in several NLP applications that need NLP models, natural language classification, topic classification, sentiment analysis, spell checking and correction, and machine translation.
Team
Makerere University Artificial Intelligence (AI) and Data Science lab
Mozilla Foundation
Mbarara University of Science and Technology
TAVODET Youth Development (TYD) Innovation Incubator
Maseno University
Quotidian Data
United States International University-Africa (USIU-Africa)
Kabarak University
Makerere University Department of African Languages
Masakhane Consortium
Proposed Dataset and Use Cases
Problem
All native languages in East Africa, including the languages of our research focus, Swahilli, Luganda, Runyankore-Rukiga, Luo/Acholi and the Luhya languages of Uganda, are considered low resourced languages. This implies that these languages lack large monolingual or parallel corpora sufficient for building NLP applications. If advances are to be made towards applying AI and/or NLP-aided tools for the preservation of African languages, building educational applications for communities with lower literacy levels, monitoring demographic and political processes, emergency response through building NLP and voice recognition technology, the first step is the creation and curation of high quality datasets for these languages. The dataset will be very useful for filling the gap for several AI for social good tasks which include: building voice recognition models for monitoring radio broadcasts for topics of public interest and spoken dialog systems, building educational applications for communities with lower literacy levels, monitoring demographic and political processes, emergency response for example in current COVID-19 pandemic where access to vulnerable people is reduced and access is restricted.
The majority of speech recognition and translation platforms are developed and maintained by a select number of companies, such as Google and Amazon, who keep their technology proprietary. They are also primarily driven by commercial potential and therefore mostly dedicate their resources to only but the most commonly spoken, typically western languages. On the other hand, the available text and speech resources for East African languages are limited and sometimes domain specific [Appen, ref15]. Low resourced languages and speakers are left-out from the benefits of this technological revolution. As an example, Google translate is not available for all East African Languages, apart from Swahilli. However, even for Swahilli which is a widely spoken East African Language {ref}, the accuracy of the translations is sub par [ref for this] and Swahili is still considered a low resourced language [1] because of lack of data for NLP tasks.. This means that users who are not savvy in English or some other mainstream language cannot consume the huge amount of information available on the internet.
Another compounding factor is that while some, albeit limited, datasets for some Ugandan, and generally African, Languages [ref] are available, they are typically hidden behind institutional repositories or paywalls [ref], it is therefore important that more datasets with openly licensed are developed and made available for use in the development of NLP applications and voice recognition technologies. Efforts have been made to build language resources for East African languages, for example under the AI4D language fellowship [ref] [ref AI4D and AI4Good https://github.com/AI-Lab-Makerere/Data4Good ], work in Tanzania has started in building text corpus for Swahili although this has been limited due to the imbalance of topic distribution in the collected news dataset. At the AI Lab at Makerere University, we have made progress towards building NLP and speech resources for Luganda {ref}. These include building a speech-to-text parallel corpus from radio recordings, Luganda sentence collection published under a creative commons license. In collaboration with Mozilla and GIZ, Luganda language has been added to the Common Voice platform {ref} and efforts are underway to organize communities to provide voice contributions.
A lot more needs to be done before we can be able to build robust NLP and speech recognition models for East African languages. Moreover, such resources need to be openly available to allow more accessible researchers,and machine learning enthusiasts in East Africa and throughout the world for further improvement, addition and use in development of supplementary algorithmic approaches. The tools also tend to work better for men than women and struggle to understand people with different accents, all of which is a result of biases within the data on which they are trained
Solution
Premised on the above unmet need, the overarching goal of this project is to develop open, accessible and high quality text and voice datasets for East African Languages. Specifically Luganda (spoken by 8.5 million in Uganda), Swahilli (150 million speakers majorly in East Africa), Runyankore- Rukiga (spoken by 5.8 million in Uganda), select Luhya languages (6.8 million in Uganda), that can be used for building natural language processing models that range from spell and grammar checking, sentiment analysis, topic modeling, text summarization, misinformation and fake news detection, machine translation and automatic speech recognition models.
Text Data
Our main goal for this work is to organise and drive the creation, collection, annotation storage and maintenance of freely-available and open source datasets for 7 of the more than 279 under-resourced languages indigenous to the East African countries of Uganda, Kenya and Tanzania. In Uganda, text resources will be developed for four languages and languages groups: Luganda, Runyankore/Rukiga, Luo/Acholi, and Lumaasaba which have been chosen from Central, Western, Northern, and Eastern parts of Uganda for regional representation. In three East African countries, text resources will be collected towards building a corpus for Swahilli Language, the official native language in all three countries. Swahilli also happens to be the Lingua Franca of the Africa Great Lakes Region [https://en.wikipedia.org/wiki/Swahili_language]. Text data for all cases will be sourced from newspapers, openly available language novels and plays, original sentence contributions, sentence transcripts, other online repositories such as language bibles and Wikipedia pages and language transcripts available in University repositories.
Speech Data
Our research aims to contribute to the availability of speech data to train speech models in native East African Languages. We have added Luganda as one of the languages on the Common Voice platform and part of the dataset created here is a first step to the sentence composition task for the Common voice platform. As part of this project, we expect to ramp up the Luganda donation activities and as well add Swahilli to the Common Voice platform. Build open source datasets to enable various upstream tasks to enable voice recognition tech, etc. Targeting Luganda, R-R, swahilli etc
E.g. common voice (including a campaign to add swahilli to common voice), radio,
Common voice platform Datasets -profile of speakers,
Partnership with radio station - data which we will open source, newspaper, news reporter, Institute of languages,
Specifications and Deliverables for Proposed Data and Documentation
Our specific aims are:
Collect and curate high quality text corpora that supports multilingual and cross-lingual NLP research
In Uganda we aim to build text corpora for Luganda, Runyankore-Rukiga, Acholi, and Lumasaaba. In Uganda, we will expand the existing Luganda dataset of 16,500 Luganda - English parallel sentences that have been collected by the Makerere University AI lab to 900,000 sentences. In the project, we will target 200,000 sentences for the Acholi, Runyankore-Rukiga, and Lumasaba building from the existing 10,000, 10,000, and 1,000 parallel sentences for each language respectively. The text collection for Luganda, Acholi, Runyankore-Rukiga, Lumasaaba will be led by Makerere University, while Runyankore-Rukiga will be led by Mbarara University of Science and Technology. The collection and validation process will in all cases be supported by the Department of African Languages at Makerere University.
In all three countries, we will collect text contributions to a Swahilli text corpus to build an English-Swahili sentence corpus of 900,000 sentence pairs. While Swahilli is an official language in all three countries, there are significant variations in how the language is adopted, spoken and hence written across the countries. For example, whereas in Tanzania Swahilli is the language of early education in schools, in Kenya and Uganda it is only a subject and the main language of instruction is English. To capture the different dialects across the three countries, a number of institutions will be involved in the collection process. We plan to collect copyright-free texts through both crowdsourcing on common voice, public facebook page comments, public whatsapp groups or targeted apps and obtaining non-liming permission to use copy-righted texts.
The main activities for this objective are:
Create protocols and guides for text data collection
Develop software tools to support the collection, validation, annotation, translation and archiving of the text datasets.
Identify various sources for creative commons licence (CC0) text data for all languages and collect (scrape, translate, crowdsource) text contributions with guides.
Expand and build NLP clubs for Luganda, Acholi, Lumasaaba, Runyankole-Rukiga and Swahili around the different partner universities communities
Data set preparation, publication through journals and datasheets for datasets [ref].
2. Collection of speech data at scale for Luganda and Swahili languages on the Mozilla Common Voice platform
The aim of this objective is to expand the Luganda voice contributions on the Common Voice platform and drive the addition of Swahili as another East African language on the Common Voice platform. With more than 8.5m speakers, Luganda is by far the most widely spoken local language in Uganda. It is therefore the first candidate of choice at Makerere AI lab for developing voice recognition tools as there would be multiple avenues to source contributions across the country. In September 2020, Luganda was launched on the common voice platform and we are the source of driving contributions towards 40 hours of speech based on the 6,500 CC sentences collected so far.
Swahili has more than 150 million speakers with the majority in East Africa either spoken as a mother language or as a fluent second language in East Africa. In Tanzania, it is the language of administration and primary education while for Kenya it is the main language for these purposes after English. There are about 15 main Swahili dialects {ref12} which provide diversity and this justified the choice of Swahili as a major East African language for building voice technologies.
With this project, we plan to drive contributions for both languages to 1,000 hours of speech from between 200 to 500 diverse contributors which will make it possible to build limited vocabulary continuous speech recognition models that can be applied to specific technical domains {ref 6} such as agriculture, health and education. The text corpus for Luganda and Swahili built in the previous objective will be used as the sentence inputs to the Common Voice platform in this objective.
The voice contributions will be driven through the Makerere AI Lab in Uganda, and the Department of African Languages and the TAVODET Youth Development in Tanzania together with their language partners Dar es Salaam School of Journalism and Nelson Mandela African Institution of Science and Technology (NM-AIST) and the Quotidian Data (QD), Maseno University, United States International University Africa (USIU-Africa) and Karabak University in Kenya.
Through crowdsourcing of sentence and voice contributions, we will be able to build a dataset that diversifies AI and provides a more balanced and representative voice dataset. We will ensure that accents and dialects that tend to be under-represented in training datasets are represented for both Luganda and Kiswahili and we will also have our NLP voice communities focus on having a fair voice representation of female and elderly people.
How to deal with bias have representative datasets.
The main activities for this objective are:
Localization and launch of Swahili Common Voice platform and addition of 900,000 Luganda and Kiswahili sentences on the Common Voice platform.
Drive voice data collection and validation through the NLP clubs for voice data communities.
3. Annotation of data for Topic classification and Sentiment Analysis
Through the first objective, we will generate text corpora for five languages which can serve as the basis for numerous downstream tasks. A number of the tasks we are targeting, such as machine translation, will make use of end-to-end deep learning and therefore do not require any labelling after the (parallel) corpus has been generated and validated.
However, other tasks of interest to the consortium, specifically Sentiment analysis and topic identification require labels to be generated for the sentences.
Sentiment analysis - to understand the sentiments of people we will annotate key texts in the text corpora from the first objective. These datasets collected from objective one will be done for Luganda and Kiswahili. We will also build a polarity lexicon for words in these two languages.
Topic classification - our prior work in data collection for Luganda (Makerere AI Lab) and Kiswahili (TYD through an AI4D Africa grant) has yielded a fairly good dataset. However, this dataset has not been balanced in terms of the topic distribution, eg., across various domains like Agriculture, Health, Education, Legal, Sports, Politics and Finance. In this objective, we will expand on the datasets we have collected in the previous objective and provide a more balanced and topic labelled text dataset.
The main activities for this objective are:
4. Capacity Building, Relationship, Outreach and Sustainability
With our partnerships with various media houses, language experts, and other stakeholders through voice communities, this research will contribute significantly to increasing the amount of both text and speech data that can be leveraged for various NLP and speech recognition tasks. The datasets generated in this project will be made available to the public for further analysis and re-use and all results of our work will be publicly available. Under this objective, we will extensively work on building capacity through the NLP groups centered around university partners but with a firm reach in the community as a means to sustain the work beyond the initial grant.
The main activities under this objective will include:
Capacity building for masters students, language experts and local businesses.
Research dissemination through conferences and workshops.
Pathways to Impact and Intended Beneficiaries
Specific downstream tasks:
Text corpus can have several downstream tasks
Build Automatic Speech Recognition (ASR) models (all languages) .
Build keyword spotting models for Luganda.
Build machine translation models (all languages).
Sentiment analysis, misinformation, fake news detection on social media and code switched Twitter data (all languages), rumor detection.
Our work primarily targets NLP tasks that can utilise End-to-End learning trained e.g. Automatic Speech Recognition on labelled datasets with satisfactory results. Another class of tasks, the . Natural Language Understanding (NLU) tasks tend to require “knowledge” of syntax (language structure) and semantics (meaning) at various levels of granularity (i.e. sub-word, word, phrase, sentence and supra-sentential levels). The Corpus
To develop Text processing and annotation tools
Sourcing translators for open-source text copra in the public domain for English. Translators can also be used to translate among the languages under the study
Validators to check the quality of annotated data
Collection of all kinds of unrestricted text (running till end of project)
Creation of guidelines for text data collection, storage and maintenance framework or routine
Accessibility, Data Management, and Licensing
To ensure that the text and voice datasets created out of this project adhere to the FAIR data principles, we will carry out several steps that include along the project timeline. The datasets and the associated metadata (including dataset datasheets) will be stored on DataVerse repository which shall be created for the consortium partners led by Makerere University. This is an open-source data repository software that accepts a wide range of data types in different formats. The dataset and the associated metadata will be assigned a Digital Object Identifier (DOI) to make it Findable which will be used as a permanent unique identifier for the data. Once the data is made publicly available, the identifier will be published by Dataverse, and thus making it Accessible in the location, retrieval, access to the metadata and downloading the dataset via public-machine accessible interfaces provided by Dataverse [ref]. The data will also be mirrored on the local servers to create a local backup. The metadata will be cited and will include domain specific and file-level data that maps to metadata standards within machine learning domain to make it Interoperable. The associated metadata for the dataset will be published and made available to provide a clear description of the datasets, data collection, preprocessing and annotation processes, use cases of the datasets and any other information that supports understanding the context and composition of the data and ensure that they are Reusable.
The data will be licensed under the CC BY-SA licence where the credit must be given to the creator (BY), the data can be used for commercial use and adaptations but must be shared under identical terms. Furthermore, to implement a FAIR data environment, we shall join FORCE11 working group on FAIR data to guide us in data management and archival in contemporary data publishing environments.
Risks, Including Ethics and Privacy
All annotation guidelines shall be followed and reviewed every two months.
b) All translation teams shall consist of at least two members working independently so as to measure inter-annotator agreement. A high measure implies that annotators are following the guidelines strictly and a very low measure implies annotators require further training
c) Use of annotation tools to speed up the work
Sustainability plan
The text and voice datasets will be maintained by each partner institution and will be expanded through masters and postgraduate students as part of their research projects. The Swahili and Luganda voice datasets on Commonvoice platform will be sustained via the different university NLP clubs that will be formulated as the project progresses. Based on our experience with the Makerere NLP club, we will build on that to establish and maintain different NLP clubs in the partner universities. The project partners will continue to voluntarily work together with AI communities, NLP clubs and academic institutions to contribute to the datasets. The project partners will also work closely with the Institute of languages departments and linguistics in the various partner countries who will also contribute greatly to the datasets maintenance beyond the lifetime of the project. The partners in the consortium are also part of the machine learning community and will continue to build NLP and Speech recognition models from the datasets and also extend the work. Since we aim to have a good text corpus for Luganda, Acholi, Runyankole-Rukiga, Lumasaaba, we intend to move this text community to the Common Voice platform which can also enable communities to make voice contributions in these diverse local languages.
To expand the image annotated dataset, semi-supervised machine learning will be done for new unlabelled datasets based on the already existing labelled datasets. Protocols and tools used for this project will be open-sourced and publicly can be reusable for anyone else and built upon ensuring sustainability beyond the scope of the project. The licensing adopted for the dataset will enable small entrepreneurs to innovate, create applications and train their locally built solutions which will help with job employment and wealth creations
Building an API for monetising the data as a source of funds for the sustainability of data maintenance.