A Focus on Machine Translation for African Languages
Let's puts Africa on the NLP Map!
We need African researchers from ACROSS the continent to join our effort in building translation models for African languages. Masakhane means "We Build Together" in isiZulu and was inspired by the Deep Learning Indaba theme for 2018.
We want to develop baseline machine translation models from English-to-Target African languages, with publicly available code and data. This problem is part data gathering, part developing the translation model, and part error analysis to understand what issues the models have
Once we, as a collaborative African NLP team, have trained baseline translation models for many of our languages, we combine our datasets and do transfer learning with fine-tuning across the languages.
Write & submit a paper, with all of our work, to a top-tier NLP conference and in doing so, once and for all put Africa on the NLP map
Current State of NLP from Africa
Even in the forums which aim to widen NLP participation, Africa is barely represented - despite the fact that Africa has over 2000 languages. The 4th Industrial revolution in Africa cannot take place in English. It is imperitive that NLP models be developed for the African continent
In particular, for Africa to take part on the global conversation, we should be developing machine translation systems to translate the internet and it's content into our languages and vice versa.
As per Martinus (2019), some problems facing machine translation of African languages are as follows:
- Focus: According to Alexander (2009), African society does not see hope for indigenous languages to be accepted as a more primary mode for communication. As a result, there are few efforts to fund and focus on translation of these languages, despite their potential impact
- Low Resourced: The lack of resources for African languages hinders the ability for researchers to do machine translation
- Low Discoverability: The resources for African languages that do exist are hard to find. Often oneneeds to be associated with a specific academic nstitution in a specific country to gain access to the language data available for that country. This reduces the ability of countries and institutions to combine their knowledge and datasets to achieve better performance and innovations. Often the existing research itself is hard to discover since they are often published in smaller African conferences or journals, which are not electronically available nor indexed by research tools such as Google Scholar.
- Lack of publicly-available benchmarks: Due to the low discoverability and the lack of research in the field, there are no publicly available benchmarks or leaderboards to new compare machine translation techniques to
- Reproducibility: The data and code of existing research are rarely shared, which means researchers cannot reproduce the results properly.
We propose to change that! Only by working together across the African continent can we do this!
But how will we do this?
- So, we've put together a Neural Machine Translation Google Colab notebook using Joey NMT, which has parameters tweaked for Low-resourced languages, based on findings in (Abbott, 2019). The idea is that you would find a dataset (or combine multiple datasets) a.k.a (a parallel corpus) for an African languages of your choice (preferably one you can speak), and train up a base-line result for your language.
- To aid the early stages, we recommend looking at the JW300 dataset which is created from Jehovah's Witness texts. Now, religious texts aren't ideal for machine translation, so we suggest that you spend time searching for extra parallel data for your language. Sources can include governmental documents, literature, and news items. If you can't find parallel data, then go ahead and collect monolingual data and dictionaries - we have a variety of techniques to help us train Unsupervised NMT systems. Here is a link to the African languages we've identified on in the JW300 dataset.
- If your data is small enough, then a model should be trainable in Google Colab, on a GPU, your browser (might take around 10 hours).
- If your data is too big, then it might take longer than 10 hours. If so, we are getting google cloud resources for the project so contact us as firstname.lastname@example.org
After you've trained your model, please submit your notebook, your test BLEU score, with links to your data, and in particular your test sets to email@example.com
- Baseline Results: Re-evaluating
- Transfer Learning Results: Re-evaluating