By Peter Marsh, Data Scientist, Climate System Analysis Group (CSAG), University of Cape Town, South Africa
At the Climate System Analysis Group (CSAG), I focus on building open-source tools to support scientific collaboration. As part of the HE²AT Center, I have worked on a project to make clinical health data easier to harmonise and analyse across large, multi-country research efforts.
That work led to the development of ‘The Mapping App’, an AI-assisted tool designed to speed up and scale the harmonisation of clinical data drawn from a wide range of African studies.
The Scale of the problem
Until now, harmonising clinical data has been done manually—usually by a small team of experts working with just a few studies at a time. In contrast, the HE²AT Center aims to combine data from more than 100 African cohort studies, covering over 200,000 patients across 12 countries and 40 trials, into a single FAIR-aligned database. Meeting that scale meant we needed a different approach that could grow without losing accuracy or oversight.
Building ‘The Mapping App’
To meet that challenge, we created the Mapping App. At the tool’s core is an ontology recommendation engine that suggests the most likely matches between incoming variables and a curated set of 155 target variables. During validation, it ranked the correct variable first in 82% of cases, and within the top five 92% of the time. We also built a confidence indicator and a variable transformation engine powered by a large language model (LLM). This engine reads expert-written instructions and converts them into Python-like code. In early testing, it could automatically transform 22% of variables without manual input.
The Mapping App allows researchers to work more efficiently with complex datasets by combining automated tools with expert supervision. It is already helping make clinical data easier to use in climate-health studies, particularly those focused on heat-related health outcomes.
Open-source and reusable
We built the Mapping App using open-source tools like Streamlit, making testing, sharing, and adapting easier across teams. The tool is freely available on GitHub for others to explore or modify: https://github.com/csag-uct/Metadata-Harmonisation-Tool
Researchers at the University of Witwatersrand (Wits) are now leading a new project that builds on this work, expanding the tool’s functionality to support different harmonisation needs.
Collaboration and expert guidance
Throughout the development process, we worked closely with experts in data harmonisation, including Katherine Johnston, Lyndon Zass, and Wei Kheng Teh from eLwazi. Their advice helped us define the project scope and build a tool that others across the HE²AT and DS-I Africa networks can use.
When we first proposed the idea, Katherine Johnston said, “If you can make this work, it will be a game changer.” That input helped guide the project from the very beginning.
Looking ahead
Standardising clinical data is critical for cross-study, collaborative research—especially in fields like climate and health, where linked datasets are essential. The Mapping App offers a practical way to make that work more efficiently and scalably while allowing expert input where needed. We hope others in the field will use, adapt, or expand on the tool to support their research and collaboration goals.
