Previous Positions
Data Governance Fellowships
The Masakhane Research Foundation, through the support of FAIR Forward, an initiative of the German Development Cooperation (GIZ), is glad to announce two Data Governance Fellowship positions.
Data sovereignty, African-centricity, sustainability, inclusivity and ownership are values fundamental to Masakhane. As it stands, Masakhane is regularly involved in language data collection activities. These are often volunteer efforts driven by researchers in a bid to create the datasets they need for their own research. We have also participated in dataset creation activities that have been funded by various organisations. Eg. We have been the recipient of a handful of Lacuna Fund grants to create language datasets and have collaborated with other organisations in the implementation of this work.
While data governance is a topic that is regularly addressed in our various meetings, we believe it requires dedicated resources to adequately explore, document and disseminate our learnings. These learning are often additional outputs through our various activities and therefore run the risk of not being intentionally documented. The ‘Data Governance’ Fellowships will be a step towards making sure of the intentionality of this task. We especially want to hear from women and historically marginalised candidates with a passion for inclusive AI for sustainable development and a strong background in digital and AI-related topics.
Activities and Responsibilities
Development of an African NLP Data Collection Handbook
This handbook will compile learnings and recommendations from the language data collection experiences of those amongst our ranks and others within our network who have undertaken this work. These learnings will broadly address community involvement looking to enable participatory dataset creation, curation and management through discussing issues such as;
What are ethical data collection practises that partners should ensure to observe
How can data collection practises result in the creation of culturally relevant data that speaks to local histories and practices, taking into account African-centric methodologies. When working in communities where there has not been a strong culture of documenting, efforts to document languages should focus on content relevant to the communities.
How to structure dataset development activities so as to meaningfully involve and upskill local populations in language technology development through mentorship and opportunities to contribute to subsequent research work
How to acknowledge and how much to pay contributors and annotators who contribute to the dataset
How to develop software applications for use by local populations, and putting in place iterative feedback loops for the improvement of these tools and continuous collection of additional data
A guide on the development of new terminology for African languages in contexts such as in Science, where there have not been sustained efforts to develop this terminology
Management and Maintenance of an African NLP data catalogue
Lanfrica is a platform that catalogues and links African language resources in order to mitigate the difficulty encountered in discovering African works by creating a centralised catalogue. For instance, if you’re looking for resources (linguistic datasets or research papers) in a particular African language, Lanfrica will point you to the different sources on the web that have such datasets in the desired language.
This project has adopted a participatory and community led approach, which is in-line with the fundamental values of Masakhane.
As the project platform is already set up, this part of the work will focus on;
expanding the number of sources (ie. Repositories and data providers that feed the Lanfrica platform eg. arxiv, africarxiv and zenodo)
increase the number of available resources, ie. papers and datasets linked
Carry out consultations with stakeholders in the African NLP ecosystem as well as users of Lanfrica so as to establish a data governance policy for the platform
Case Studies on IP, Copyright and Fair Use in Africa
The legal landscape for fair use and access to data for research purposes is quite nascent in Africa. We have found that many researchers in the African context will approach accessing data for research purposes with a western mindset, which assumes fair use. This means going ahead and scraping available data from the web without asking for permission or taking the time to note any IP or copyright restrictions on the data. Through our work and collaboration with the Centre for Intellectual Property and Information Technology (CIPIT) at Strathmore University, we have been approaching access to existing data on a case by case basis. This often means approaching data owners, letting them know of our intentions and asking for permission, a process for which some templates exist. We have also explored partnership models that would enable us to access data and these can also be documented. This work aims to create a documentation of a minimum of 5 case studies, accounts which will detail how our interactions as well as those of others within our networks have unfolded with several data owners in Africa. From these, we intend to derive a list of good practices, recommendations on how others can approach such interactions and insight into what not to try.Tooling for Data Statements
Datasheets for datasets is a tool for documenting the datasets used for training and evaluating machine learning models. The aim of datasheets is to increase dataset transparency and facilitate better communication between dataset creators and dataset consumers (e.g., those using datasets to train machine learning models). Datasheets encourage dataset creators to carefully reflect on the dataset creation process, enabling them to uncover possible sources of bias in their data or unintentional assumptions that they’ve made. For dataset consumers, the information contained within datasheets can help ensure that the dataset is the right choice for the task at hand. Datasheets can optionally be exposed to end users for increased transparency and trust.
In our work, we have been encouraging members of Masakhane to upload the datasets created in the course of their work onto an African NLP community on Zenodo. Additionally, we ask that the dataset be accompanied by data sheets. Having noted the lack of adherence to these requirements, this work would create a tool to streamline the process of creating metadata for datasets. We envisage a web tool that allows users to to enter project information and also be prompted in a step by step way to create a Data Sheet draft or markdown file. The tool starts with a few questions with drop downs. Given these responses, it then creates a document for you to create a version 1 of a datasheet.
The tool will be open source and will be a requirement of datasets that individuals wish to upload on the Zenodo African NLP community. As the community feature on Zenodo allows for the curator to not accept or make public a request to publish a dataset, this requirement will be enforceable on the platform.
Results(Outputs):
Data Collection Handbook (A handbook on Data Collection best practices and recommendations), with a focus on African language data and lessons from the African content
Further development of an African NLP dataset catalogue platform, ie. Lanfrica
Data governance policy for Lanfrica
Compilation of case studies on Intellectual Property, copyright and fair use of data in Africa
Technical tooling for streamlining the process of creating data statements/data sheets for datasets
Recommendations on Community-led Data Governance with a focus on African NLP
Professional Requirements of the Candidate:
High level understanding of relevant fields such as: Natural Language Processing, Dataset creation for Machine Learning. Applicants with a humanities background are encouraged to apply.
Experience undertaking qualitative research (stakeholder interviews, questionnaires, etc.), using statistical methods and other digital tools for analysis of results and drawing insights.
Strong communication skills in English and efficient collaboration skills across digital platforms (having knowledge of an African language would be a plus)
Strong writing skills, ability to disseminate learnings and difficult concepts in written format and in plain, easy to understand language.
Master’s degree in a relevant field or equivalent research experience
Number of Individuals: 2
Time commitment: part-time (approximately 18 hours a week)
Remuneration: KES 2,730,240 before tax (approximately USD 20,984 using the current exchange rate - 130.11. As the funding is received in KES, payment will be made using the exchange rate made available by the Central Bank of Kenya on the day of payment)
Duration: 9 months
Location: Remote
To apply, please share a copy of your latest CV and a one page (A4) motivation letter to mrf-employment@googlegroups.com before 22h00 GMT on Friday, 19 April 2024.
Linguist and Language Technology Resident
The Lacuna Fund will soon be putting out a 2nd call for African NLP datasets. Masakhane and the Lacuna Fund are jointly opening a position for a Resident who will work to build avenues and resources for greater multi-disciplinary collaboration with linguists. This work is in a bid to open up further possibilities in language technologies, primarily in the creation of datasets, and build relationships that can support applications to the 2nd Lacuna funding call for language datasets.
Through some interaction with several linguists who are already part of Masakhane, we have learnt that:
While both technologists and linguists have an innate sense that they can be of use to each other, there is room to further explore what resources can be created at the intersection of the two disciplines that would be of use in building language technology
Linguists, through their education and academic careers, often develop language resources as part of language documentation efforts. These resources are however often not in a format that is easily accessible for Machine Learning or entirely not online. Creating frameworks for collaboration between the 2 disciplines with the expected outcome of digitisation of these resources would be useful.
This will be a three month role.
Given the majority representation of NLP researchers within Masakhane, the ideal candidate for this role is a linguist who has access to networks and communities of linguists and has collaborated on Masakhane projects. This positions them well to access the individuals with the profiles needed to partake in the research work.
Responsibilities
Design a plan for engagement of the linguists and NLP researcher stakeholder groups
Define outputs/resources created (ideally resources that can encourage knowledge of the fields and encourage collaboration; eg. NLP for Linguists/Linguistics for NLP notebooks/tutorials)
Facilitating sessions with stakeholders
Compiling of outputs
Requirements
Previous work experience as a linguist
A good network of linguists and involvement with linguists communities, particularly in Africa
Knowledge of software commonly used by linguists in language documentation and other professional duties
Experience working with Masakhane NLP researchers
High-level grasp of NLP research work
Knowledge around language datasets for NLP
Event organisation and coordination
Number of Individuals: 1
Time commitment: 10 to 15 hours a week (will vary)
Remuneration: US$ 1000 per month
Duration: 3 months (October to January with a break for the holidays)
Location: Remote
How to apply: Email your CV and 500 words motivation, highlighting relevant experience, to tgwadabe@yahoo.com
Application Deadline: 22 September 2021
Francophone Engagement Resident
The Lacuna Fund will soon be putting out a 2nd call for African NLP datasets. Masakhane and the Lacuna Fund are jointly opening a position to complement the 2nd call for proposals for AfricaNLP datasets to enable more inclusion, facilitate more collaborations and partnerships across the continent
Masakhane and Lacuna are jointly opening a Francophone Engagement Residency. Among members of our community, it is increasingly becoming apparent that, courtesy of English being the language of Science, Francophone and Lusophone African researchers and subsequently languages are being left behind. In a bid to begin closing this gap, we propose a series of activities to be led by several members of our community who belong to these underrepresented groups.
Responsibilities
As part of the Francophone Engagement Residency, you would be responsible for
The translation of existing Masakhane resources to French. More specifically:
The Masakhane website
NLP notebooks and READMEs of projects on github
Talks and presentations that are central to communicating the mission of Masakhane
Application support and mentorship for the Lacuna Fund call
Advocating for the inclusion of Francophone African languages in the existing Lacuna-funded Masakhane Projects(NER, MT), in addition to submissions for original works.
Evangelism about African NLP within Francophone machine learning communities, encouraging them to apply to the Lacuna Fund for grants
Requirements
Experience in community engagement
1+ years Administration experience (e.g. through AI community management or as part of a startup)
Good networks within the African AI community, especially the Francophone community
Fluent in French, spoken and written.
Excellent communicator
Existing participation in the Masakhane community
Number of Individuals: 3
Time commitment: 10 to 15 hours a week (will vary)
Remuneration: US$ 1000 per month
Duration: 3 working months (October to mid-Jan, allowing for a break in December)
Location: Remote
How to apply: Email your CV and 500 words motivation, highlighting relevant experience, to tgwadabe@yahoo.com
Application Deadline: 22 September 2021
Mentorship and Collaboration Resident
The Lacuna Fund is putting out a 2nd call for African NLP datasets. Masakhane and the Lacuna Fund are jointly opening a position to complement the 2nd call for proposals for AfricaNLP datasets. The Mentorship and Collaboration Resident will work with individuals and teams to grow their ideas, facilitate more collaborations and partnerships across the continent, therefore enabling inclusion and stronger applications to the 2nd Lacuna Call for Language datasets.
This would entail several sets of activities whose main purpose will be to make individuals in the ecosystem aware of each other's work, and where applicable, encourage them to collaborate ahead of the next call for funding.
Responsibilities
Design a preliminary Expression of Interest(EoI) for individuals within the extended Masakhane network who have any ongoing work or ideas of work that could benefit from receiving funding via the Lacuna call.
Design a mentorship program such that each group of applicants is able to access a mentor with relevant experience
Organize several workshops bringing together applicants who are working in related fields.
Coordinating and organizing communication sessions between relevant groups.
Publicise several calls for EoI, Mentorship programs, and workshops and maintain accessible information on each.
Requirements
High level understanding of relevant fields such as: Natural Language Processing, Linguistics, Dataset creation for Machine Learning.
Experience with technical event organization such as workshops, hackathons, or mentorship sessions.
Established network and involvement in the African ML/NLP/Data science ecosystem and/or Masakhane
Strong communication & Collaboration skills.
Number of Individuals: 2
Time commitment: 10 to 15 hours a week (will vary)
Remuneration: US$ 1000 per month
Duration: 3 working months (October to mid-Jan, allowing for a break in December)
Location: Remote
How to apply: Email your CV and 500 words motivation, highlighting relevant experience, to tgwadabe@yahoo.com
Application Deadline: 22 September 2021