📄 health informatics

The Power of Open Health Data: Impact, Representation, and Knowledge Diffusion

This study evaluates four major open health data repositories using a novel two-degree citation methodology to reveal that while open data consistently generates a ~10x indirect citation amplification across vastly different funding levels, significant disparities in global representation and persistent gender gaps in senior authorship highlight that data access alone cannot address structural inequities in research leadership.

Original authors: Gorijavolu, R., Armengol de la Hoz, M. A., Bielick, C., Cajas, S., Charpignon, M.-L., El Mir, A., Gichoya, J. W., Kwak, H. G., Madapati, K., Mattie, H., McCullum, L., Mwavu, R., Nair, V., Nakayama, L.

Published 2026-03-24

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Gorijavolu, R., Armengol de la Hoz, M. A., Bielick, C., Cajas, S., Charpignon, M.-L., El Mir, A., Gichoya, J. W., Kwak, H. G., Madapati, K., Mattie, H., McCullum, L., Mwavu, R., Nair, V., Nakayama, L. F., Nanyonjo, J., Nazer, L., Patel, M. S., Sauer, C. M., Celi, L. A.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine the world of medical research as a massive library. For a long time, the most valuable books (patient health data) were locked in private vaults, accessible only to a few wealthy researchers. In recent years, governments and organizations have spent billions of dollars to unlock these vaults and put the books on open shelves, hoping that anyone, anywhere, could read them and discover new cures.

This paper is like a report card on how well that "open library" experiment is actually working. The authors asked two big questions:

How much value is this data creating? (Is it just sitting there, or is it sparking new ideas?)
Who is actually reading the books? (Is it just the same old group of experts, or are people from all over the world, including poorer countries, getting a seat at the table?)

Here is the breakdown of their findings using some simple analogies:

1. The "Snowball Effect" (Citation Amplification)

The researchers discovered something amazing about how ideas spread.

The Direct Impact: When a scientist uses the data to write a paper, that's one "direct" hit.
The Indirect Impact: But then, other scientists read that paper and use its ideas to write their own papers.

The study found a consistent "10x Snowball Effect." For every single paper written directly using the data, about 10 more papers are written later that build on those ideas. It's like throwing a pebble into a pond; the initial splash is small, but the ripples spread out to touch the entire shore. This happened across all four major data "libraries" they studied, regardless of how much money was spent to build them.

2. The "Ticket Price" Problem (Funding vs. Output)

The researchers compared four different data libraries:

MIMIC: A small, free, easy-to-download library of ICU records (Cost: ~$14 million).
UK Biobank: A massive, long-term study of 500,000 people with DNA data (Cost: ~$525 million).
OpenSAFELY: A secure system for UK doctors' records (Cost: ~$53 million).
All of Us: A huge US project trying to enroll 1 million diverse people (Cost: ~$2.16 billion).

The Surprise: When they looked at how many research papers were generated for every $1 million spent, the cheapest library (MIMIC) was the clear winner. It produced 689 papers per million dollars, while the most expensive one (All of Us) produced only 1 paper per million dollars.

Why?
Think of it like building a house.

MIMIC was like taking an existing house, cleaning it up, and handing out keys. It was cheap and easy to use, so everyone rushed in.
All of Us was like building a new city from scratch, hiring thousands of workers, and planting trees. It's incredibly valuable for the future, but right now, the "construction costs" are so high that the "number of houses built per dollar" looks low.
The Lesson: You can't just compare the "number of papers" to the "total budget" without understanding what that budget actually paid for.

3. The "Who's in the Room?" Problem (Equity and Diversity)

This is where the story gets complicated. The authors wanted to know: Who is doing the research?

The "Global South" Gap:
- MIMIC was a huge success for global inclusion. Because it was free and easy to download, researchers from Low- and Middle-Income Countries (LMICs) (like India, Brazil, and Uganda) made up 42% of the authors. They weren't just helpers; they were the bosses (senior authors) too.
- All of Us, despite being the most expensive and diverse project, had very few researchers from these countries (only 4%). The barriers to entry (complex rules, cloud computing costs) kept them out.
The "Glass Ceiling" for Women:
- Across all four libraries, there was a persistent gap. Women were well-represented as the "first authors" (the people doing the heavy lifting and writing the paper), but they were much less likely to be the "last authors" (the senior professors or lab heads who get the credit and funding).
- It's like a sports team where the female players are great at scoring goals, but the coaches and team owners are almost always men. This isn't a problem with the data itself; it's a problem with the system of how science careers work.

4. The "Tool vs. Treasure" Paradox

One interesting twist was why MIMIC was so popular.

MIMIC is often used by computer scientists as a "practice field" (like a video game level) to test their AI algorithms. It's easy to use, so it gets cited a lot.
All of Us contains deep, rich data about real people's lives, but it's harder to use.
The Takeaway: Sometimes, the data that gets the most "fame" (citations) isn't the most "deep" or "local" data. It's the data that is easiest to play with.

The Bottom Line

This paper tells us that open data is a powerful engine for knowledge, creating a ripple effect that is 10 times bigger than the initial research.

However, access doesn't automatically mean fairness.

Making data free helps researchers from poorer countries get involved (like MIMIC did).
But simply having a diverse group of people in the room doesn't mean they are all getting the same opportunities to lead (the senior author gap).
And spending billions on a project doesn't guarantee it will produce more "papers" than a cheap, simple project, because some projects are building the foundation for the future, not just the immediate results.

In short: We need to keep opening the vaults, but we also need to make sure the keys are easy to find, and that everyone who walks in has a chance to become the captain of the ship, not just a passenger.

1. Problem Statement

Despite billions of dollars in public funding allocated to open health data repositories (e.g., MIMIC, UK Biobank, All of Us), there is no systematic framework to evaluate:

Downstream Scholarly Impact: Existing bibliometric methods typically only count "first-degree" citations (papers directly using the data), failing to capture the broader "second-degree" knowledge diffusion (papers citing the papers that used the data).
Research Community Composition: It is unclear whether open data truly levels the playing field for researchers in Low- and Middle-Income Countries (LMICs) or if it merely attracts them as participants rather than intellectual leaders.
Equity Dimensions: There is a lack of understanding regarding how gender and geographic diversity vary across different repository models and whether access policies translate into locally relevant knowledge production.

2. Methodology

The authors conducted a cross-sectional bibliometric analysis using the OpenAlex database (data retrieved Jan–Feb 2026).

Study Subjects: Four major open health data repositories representing distinct models:
1. MIMIC: Retrospective EHR data (Critical Care, Boston).
2. UK Biobank: Prospective cohort with genomics (UK).
3. OpenSAFELY: Federated EHR platform (Primary Care, England).
4. All of Us: Prospective national cohort with biobanking and community engagement (USA).
Two-Degree Citation Methodology:
- First-Degree: Identified all publications directly citing the repository's primary works ( $n = 30,049$ ).
- Second-Degree: Identified all publications citing the first-degree papers ( $n = 485,396$ ).
- Metric: Calculated a Citation Amplification Ratio (Second-degree mass / First-degree mass) to quantify indirect knowledge diffusion.
Normalization: All output metrics were normalized by total program funding (in millions of USD) to allow cross-repository comparison.
Demographic Analysis:
- Gender: Inferred via first names using the Genderize.io API (identification rate >98%).
- Geography: Institutional affiliations mapped to World Bank 2024 income classifications (HIC vs. LMIC).
- Positioning: Analyzed authors by position (First, Last/Senior, Middle) to distinguish participation from leadership.
Statistical Analysis: Used Pearson's chi-square tests with odds ratios (OR) and Cramér's V to assess demographic differences.

3. Key Contributions

Novel Metric: Introduced a two-degree citation methodology that reveals a consistent ~10× amplification of knowledge diffusion beyond direct data users, a metric previously unquantified in health data literature.
Equity Framework: Distinguished between Representational Equity (presence in authorship) and Transformative Equity (local capacity building and relevant knowledge production), arguing that high LMIC authorship does not automatically equate to the latter.
Comparative Analysis: Provided the first large-scale comparison of four distinct repository models, highlighting how design (retrospective vs. prospective, federated vs. centralized) and funding structure influence research community composition.

4. Key Results

A. Scholarly Impact & Funding Efficiency

Citation Amplification: All four repositories exhibited a remarkably consistent citation amplification ratio between 9.3× and 11.5×. This indicates that open data generates indirect scholarly impact roughly 10 times greater than its direct usage.
Funding-Normalized Output: There was an extreme disparity in efficiency when normalized by total funding:
- MIMIC: 689 first-degree papers per $1M (Total Funding: $14.4M).
- All of Us: 1 first-degree paper per $1M (Total Funding: $2.16B).
- Note: The authors attribute this to MIMIC being a low-cost, retrospective curation project, whereas All of Us includes massive costs for recruitment, biobanking, and community engagement not captured by citation metrics.

B. Geographic Diversity (LMIC Representation)

MIMIC had the highest LMIC authorship (41.8%), followed by UK Biobank (22.9%), OpenSAFELY (17.4%), and All of Us (4.3%).
Leadership: LMIC researchers held significant leadership roles in MIMIC (43.8% of first authors, 41.5% of last/senior authors). In contrast, All of Us had only 4.1% first authors and 3.0% last authors from LMICs.
Odds Ratio: The odds of LMIC authorship in MIMIC were 17.7 times higher than in All of Us.

C. Gender Representation

Inverse Correlation: Repositories with the highest LMIC representation (MIMIC) had the lowest female authorship (31.8%), while All of Us had the highest female authorship (43.2%).
Disciplinary Driver: The low female rate in MIMIC is partly explained by its primary citing field being Computer Science (43.3%) rather than Medicine, reflecting known gender gaps in CS.
Senior Authorship Gap: Across all repositories, women were consistently underrepresented in last-author (senior) positions compared to first-author positions. The gap ranged from 4.9 to 10.9 percentage points, indicating structural barriers to career advancement that persist regardless of data access.

D. Intersectionality

In three of the four repositories, female representation was higher within the LMIC subgroup than the HIC subgroup, suggesting complex interactions between geography and gender in research participation.

5. Significance and Implications

Value Beyond Citations: While citation metrics show MIMIC is the most "efficient" per dollar, the study argues that programs like All of Us generate value (community engagement, diverse biobanking) that bibliometrics cannot capture.
Limitations of Access: Low-barrier access (like MIMIC) successfully attracts diverse researchers into leadership positions, but this does not guarantee locally relevant knowledge production. LMIC researchers often use US-centric data (e.g., Boston EHRs) to build models for Global North journals, which may not address local health needs.
Structural Inequities: The persistent gender gap in senior authorship suggests that data access policies alone cannot solve structural career inequities; active mentorship and community building are required.
Future Directions: Evaluations of open data investments must move beyond counting papers to assessing who is producing research, where they are located, and whether their work leads to transformative equity (e.g., local dataset creation and capacity building).

Conclusion: Open health data acts as a powerful catalyst for knowledge diffusion (~10× amplification), but its impact on equity is nuanced. While it can foster global leadership for LMIC researchers, it does not automatically resolve disciplinary gender gaps or ensure that research outcomes are relevant to the communities where the researchers reside.