A Focus on Machine Translation for African Languages
Let's puts Africa on the NLP Map!
We need African researchers from ACROSS the continent to join our effort in building translation models for African languages. Masakhane means "We Build Together" in isiZulu and was inspired by the Deep Learning Indaba theme for 2018.
We want to develop baseline machine translation models from English-to-Target African languages, with publicly available code and data. This problem is part data gathering, part developing the translation model, and part error analysis to understand what issues the models have. We are currently at over 30 benchmarks
Once we, as a collaborative African NLP team, have trained baseline translation models for many of our languages, we combine our datasets and do transfer learning with fine-tuning across the languages.
Write & submit a paper, with all of our work, to a top-tier NLP conference and in doing so, once and for all put Africa on the NLP map
Current State of NLP in Africa
Even in the forums which aim to widen NLP participation, Africa is barely represented - despite the fact that Africa has over 2000 languages. The 4th Industrial revolution in Africa cannot take place in English. It is imperitive that NLP models be developed for the African continent
In particular, for Africa to take part on the global conversation, we should be developing machine translation systems to translate the internet and it's content into our languages and vice versa.
As per Martinus (2019), some problems facing machine translation of African languages are as follows:
- Focus: According to Alexander (2009), African society does not see hope for indigenous languages to be accepted as a more primary mode for communication. As a result, there are few efforts to fund and focus on translation of these languages, despite their potential impact
- Low Resourced: The lack of resources for African languages hinders the ability for researchers to do machine translation
- Low Discoverability: The resources for African languages that do exist are hard to find. Often oneneeds to be associated with a specific academic nstitution in a specific country to gain access to the language data available for that country. This reduces the ability of countries and institutions to combine their knowledge and datasets to achieve better performance and innovations. Often the existing research itself is hard to discover since they are often published in smaller African conferences or journals, which are not electronically available nor indexed by research tools such as Google Scholar.
- Lack of publicly-available benchmarks: Due to the low discoverability and the lack of research in the field, there are no publicly available benchmarks or leaderboards to new compare machine translation techniques to
- Reproducibility: The data and code of existing research are rarely shared, which means researchers cannot reproduce the results properly.
We propose to change that! Only by working together across the African continent can we do this!
But how will we do this?
- Building a Community - We share resources, ideas so join our Google Group and our Slack. You don't need to have a fancy degree from a fancy university to contribute to the project, so please join us :) Everyone is welcome!
- Running Benchmarks - So, we've put together a Neural Machine Translation Google Colab notebook using Joey NMT, which has parameters tweaked for Low-resourced languages, based on findings in (Abbott, 2019). The idea is that you would find a corpus (or combine multiple corpus) for an African languages of your choice (preferably one you can speak), and train up a base-line result for your language. As a useful default, we use JW300 dataset which is created from Jehovah's Witness texts. Now, religious texts aren't ideal for machine translation, so we suggest that you spend time searching for extra parallel data for your language. Sources can include governmental documents, literature, and news items. Here is a link to the African languages we've identified on in the JW300 dataset.
- Writing Papers - Our research is already unearthing interesting findings. We write papers together to be submitted to workshops and conferences.
- Using Machine Translation as a stepping stone - Machine translation is the beginning. The long term goal will be to expand out into the rest of NLP.
To begin, check out GitHub README
The community consists of 144 participants from 17 African countries with diverse educations and occupations, and 2 countries outside Africa (USA and Germany). Currently, over 30 translation results for over 28 African languages have been published by over 25 contributors on GitHub
The community recently submitted at least 8 papers to the AfricaNLP workshop at ICLR
Where can I help?
You don't need to be a NLP researcher to join us! We want anyone passionate about African languages to join. So we have many major ways to help:
- Accessing or creating datasets
- Training Models
- Analysing how good our translation models are
- Mentoring budding NLP practitioners
- Being a story teller - capturing our journey