A grassroots NLP community for Africa, by Africans
Let's puts Africa on the NLP Map!
MASAKHANE is an research effort for natural language processing for African languages that is OPEN SOURCE, CONTINENT-WIDE, DISTRIBUTED and ONLINE. This project houses the community, data, code, results and research for building open baseline translation results for African languages (other NLP tasks coming soon...).
We need African researchers from ACROSS the continent to join our effort in building translation models for African languages. Masakhane means "We Build Together" in isiZulu and was inspired by the Deep Learning Indaba theme for 2018.
To build and facilitate a community of NLP researchers, connect and grow it, spurring and sharing further research, build helpful tools for applications in government, medicine, science and education, to enable language preservation and increase its global visibility and relevance.
For NLP Research
To build data sets and tools to facilitate NLP research on African languages, and to pose new research problems to enrich the NLP research landscape.
For Global Community of Researchers
To discover best practices for distributed research, to be applied by other emerging research communities.
Current State of NLP in Africa
Even in the forums which aim to widen NLP participation, Africa is barely represented - despite the fact that Africa has over 2000 languages. The 4th Industrial revolution in Africa cannot take place in English. It is imperitive that NLP models be developed for the African continent
In particular, for Africa to take part on the global conversation, we should be developing machine translation systems to translate the internet and it's content into our languages and vice versa.
As per Martinus (2019), some problems facing machine translation of African languages are as follows:
Focus: According to Alexander (2009), African society does not see hope for indigenous languages to be accepted as a more primary mode for communication. As a result, there are few efforts to fund and focus on translation of these languages, despite their potential impact
Low Resourced: The lack of resources for African languages hinders the ability for researchers to do machine translation
Low Discoverability: The resources for African languages that do exist are hard to find. Often oneneeds to be associated with a specific academic nstitution in a specific country to gain access to the language data available for that country. This reduces the ability of countries and institutions to combine their knowledge and datasets to achieve better performance and innovations. Often the existing research itself is hard to discover since they are often published in smaller African conferences or journals, which are not electronically available nor indexed by research tools such as Google Scholar.
Lack of publicly-available benchmarks: Due to the low discoverability and the lack of research in the field, there are no publicly available benchmarks or leaderboards to new compare NLP techniques to
Reproducibility: The data and code of existing research are rarely shared, which means researchers cannot reproduce the results properly.
We propose to change that! Only by working together across the African continent can we do this!
But how will we do this?
Building a Community - We share resources, ideas so join our Google Group and our Slack. You don't need to have a fancy degree from a fancy university to contribute to the project, so please join us :) Everyone is welcome!
Running Benchmarks - So, we've put together a Neural Machine Translation Google Colab notebook using Joey NMT, which has parameters tweaked for Low-resourced languages, based on findings in (Abbott, 2019). The idea is that you would find a corpus (or combine multiple corpus) for an African languages of your choice (preferably one you can speak), and train up a base-line result for your language. As a useful default, we use JW300 dataset which is created from Jehovah's Witness texts. Now, religious texts aren't ideal for machine translation, so we suggest that you spend time searching for extra parallel data for your language. Sources can include governmental documents, literature, and news items. Here is a link to the African languages we've identified on in the JW300 dataset.
Building Datasets - We are building expertise around data gathering and performing "data archeology" to discover and create datasets!
Writing Papers - Our research is already unearthing interesting findings. We write papers together to be submitted to workshops and conferences.
Using Machine Translation as a stepping stone - Machine translation is the beginning. The long term goal will be to expand out into the rest of NLP.
To begin, check out GitHub README
The community consists of >400 participants from 30 African countries with diverse educations and occupations, and >3 countries outside Africa. As of February 2020, over 49 translation results for over 38 African languages have been published by over 35 contributors on GitHub.
The EMNLP Findings paper describes our approach to low-resource NLP: participatory research.
Where can I help?
You don't need to be a NLP researcher to join us! We want anyone passionate about African languages to join. So we have many major ways to help:
Accessing or creating datasets
Analysing how good our translation models are
Mentoring budding NLP practitioners
Being a story teller - capturing our journey