Open Positions
Data Governance Fellowships
The Masakhane Research Foundation, through the support of FAIR Forward, an initiative of the German Development Cooperation (GIZ), is glad to announce two Data Governance Fellowship positions.
Data sovereignty, African-centricity, sustainability, inclusivity and ownership are values fundamental to Masakhane. As it stands, Masakhane is regularly involved in language data collection activities. These are often volunteer efforts driven by researchers in a bid to create the datasets they need for their own research. We have also participated in dataset creation activities that have been funded by various organisations. Eg. We have been the recipient of a handful of Lacuna Fund grants to create language datasets and have collaborated with other organisations in the implementation of this work.
While data governance is a topic that is regularly addressed in our various meetings, we believe it requires dedicated resources to adequately explore, document and disseminate our learnings. These learning are often additional outputs through our various activities and therefore run the risk of not being intentionally documented. The ‘Data Governance’ Fellowships will be a step towards making sure of the intentionality of this task. We especially want to hear from women and historically marginalised candidates with a passion for inclusive AI for sustainable development and a strong background in digital and AI-related topics.
Activities and Responsibilities
Development of an African NLP Data Collection Handbook
This handbook will compile learnings and recommendations from the language data collection experiences of those amongst our ranks and others within our network who have undertaken this work. These learnings will broadly address community involvement looking to enable participatory dataset creation, curation and management through discussing issues such as;
What are ethical data collection practises that partners should ensure to observe
How can data collection practises result in the creation of culturally relevant data that speaks to local histories and practices, taking into account African-centric methodologies. When working in communities where there has not been a strong culture of documenting, efforts to document languages should focus on content relevant to the communities.
How to structure dataset development activities so as to meaningfully involve and upskill local populations in language technology development through mentorship and opportunities to contribute to subsequent research work
How to acknowledge and how much to pay contributors and annotators who contribute to the dataset
How to develop software applications for use by local populations, and putting in place iterative feedback loops for the improvement of these tools and continuous collection of additional data
A guide on the development of new terminology for African languages in contexts such as in Science, where there have not been sustained efforts to develop this terminology
Management and Maintenance of an African NLP data catalogue
Lanfrica is a platform that catalogues and links African language resources in order to mitigate the difficulty encountered in discovering African works by creating a centralised catalogue. For instance, if you’re looking for resources (linguistic datasets or research papers) in a particular African language, Lanfrica will point you to the different sources on the web that have such datasets in the desired language.
This project has adopted a participatory and community led approach, which is in-line with the fundamental values of Masakhane.
As the project platform is already set up, this part of the work will focus on;
expanding the number of sources (ie. Repositories and data providers that feed the Lanfrica platform eg. arxiv, africarxiv and zenodo)
increase the number of available resources, ie. papers and datasets linked
Carry out consultations with stakeholders in the African NLP ecosystem as well as users of Lanfrica so as to establish a data governance policy for the platform
Case Studies on IP, Copyright and Fair Use in Africa
The legal landscape for fair use and access to data for research purposes is quite nascent in Africa. We have found that many researchers in the African context will approach accessing data for research purposes with a western mindset, which assumes fair use. This means going ahead and scraping available data from the web without asking for permission or taking the time to note any IP or copyright restrictions on the data. Through our work and collaboration with the Centre for Intellectual Property and Information Technology (CIPIT) at Strathmore University, we have been approaching access to existing data on a case by case basis. This often means approaching data owners, letting them know of our intentions and asking for permission, a process for which some templates exist. We have also explored partnership models that would enable us to access data and these can also be documented. This work aims to create a documentation of a minimum of 5 case studies, accounts which will detail how our interactions as well as those of others within our networks have unfolded with several data owners in Africa. From these, we intend to derive a list of good practices, recommendations on how others can approach such interactions and insight into what not to try.Tooling for Data Statements
Datasheets for datasets is a tool for documenting the datasets used for training and evaluating machine learning models. The aim of datasheets is to increase dataset transparency and facilitate better communication between dataset creators and dataset consumers (e.g., those using datasets to train machine learning models). Datasheets encourage dataset creators to carefully reflect on the dataset creation process, enabling them to uncover possible sources of bias in their data or unintentional assumptions that they’ve made. For dataset consumers, the information contained within datasheets can help ensure that the dataset is the right choice for the task at hand. Datasheets can optionally be exposed to end users for increased transparency and trust.
In our work, we have been encouraging members of Masakhane to upload the datasets created in the course of their work onto an African NLP community on Zenodo. Additionally, we ask that the dataset be accompanied by data sheets. Having noted the lack of adherence to these requirements, this work would create a tool to streamline the process of creating metadata for datasets. We envisage a web tool that allows users to to enter project information and also be prompted in a step by step way to create a Data Sheet draft or markdown file. The tool starts with a few questions with drop downs. Given these responses, it then creates a document for you to create a version 1 of a datasheet.
The tool will be open source and will be a requirement of datasets that individuals wish to upload on the Zenodo African NLP community. As the community feature on Zenodo allows for the curator to not accept or make public a request to publish a dataset, this requirement will be enforceable on the platform.
Results(Outputs):
Data Collection Handbook (A handbook on Data Collection best practices and recommendations), with a focus on African language data and lessons from the African content
Further development of an African NLP dataset catalogue platform, ie. Lanfrica
Data governance policy for Lanfrica
Compilation of case studies on Intellectual Property, copyright and fair use of data in Africa
Technical tooling for streamlining the process of creating data statements/data sheets for datasets
Recommendations on Community-led Data Governance with a focus on African NLP
Professional Requirements of the Candidate:
High level understanding of relevant fields such as: Natural Language Processing, Dataset creation for Machine Learning. Applicants with a humanities background are encouraged to apply.
Experience undertaking qualitative research (stakeholder interviews, questionnaires, etc.), using statistical methods and other digital tools for analysis of results and drawing insights.
Strong communication skills in English and efficient collaboration skills across digital platforms (having knowledge of an African language would be a plus)
Strong writing skills, ability to disseminate learnings and difficult concepts in written format and in plain, easy to understand language.
Master’s degree in a relevant field or equivalent research experience
Number of Individuals: 2
Time commitment: part-time (approximately 18 hours a week)
Remuneration: KES 2,730,240 before tax (approximately USD 20,984 using the current exchange rate - 130.11. As the funding is received in KES, payment will be made using the exchange rate made available by the Central Bank of Kenya on the day of payment)
Duration: 9 months
Location: Remote
To apply, please share a copy of your latest CV and a one page (A4) motivation letter to mrf-employment@googlegroups.com before 22h00 GMT on Friday, 19 April 2024.