Learning relationships in epidemiological data using graph neural networks

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a massive mystery: Who infected whom?

In the world of infectious diseases (like Bovine Tuberculosis in cows and badgers), you have two main clues:

The Life Story: Where the animals lived, when they were born, and who they hung out with.
The Genetic Fingerprint: A DNA scan of the bacteria inside them.

The problem is that the "Life Story" clues are often messy. Two cows might live on the same farm and look like they infected each other, but they could have actually caught the disease from two different sources. The "Genetic Fingerprint" is precise, but it's expensive to get for every single animal. Often, you have the life story for everyone, but the DNA only for a few.

This paper introduces a new, super-smart detective tool called a Graph Neural Network (GNN) to solve this puzzle.

The Old Way: The "Two-Person Interview"

Traditionally, scientists looked at two animals at a time (let's call them Cow A and Cow B). They asked: "Based on how close they lived and when they were born, are they related?"

The Flaw: This is like interviewing two suspects in separate rooms. You miss the big picture. If Cow A is clearly related to Cow B, and Cow B is clearly related to Cow C, logic dictates Cow A and Cow C are likely related too. But the old method treats every pair as a totally independent mystery, ignoring the connections between the other animals.

The New Way: The "Gossip Network" (Graph Neural Networks)

The authors suggest treating the whole outbreak like a giant social network or a family tree.

The Nodes (The People): Every infected cow or badger is a person in the network.
The Edges (The Relationships): The lines connecting them represent how close they are (physically, in time, or genetically).

The GNN is like a detective who doesn't just interview two people; they walk through the entire neighborhood.

Listening to the Gossip: When the GNN looks at Cow A, it doesn't just look at Cow A's life story. It asks, "Who is Cow A's best friend? What is that friend's life story? Who is that friend's friend?"
The "Embedding" (The Summary): The GNN creates a "summary card" for every animal. This card doesn't just say "Cow A lives here." It says, "Cow A lives here, but she is surrounded by a cluster of animals that all have very similar DNA, so she is likely part of that specific family tree."
Solving the Mystery: When a new animal (Cow Z) appears with no DNA test, the GNN looks at Cow Z's life story and compares it to the "summary cards" of everyone else. It can say, "Cow Z looks a lot like the 'Badger Group' over there, so even without a DNA test, I'm 90% sure she belongs to that transmission chain."

The Experiment: Training the Detective

The researchers tested this new detective on two types of cases:

1. The Synthetic Cases (The Perfect Crime Scene)
They created computer simulations with 2,000 animals where they knew exactly who infected whom.

Result: The GNN was a superstar. It was much better at guessing the connections than the old "two-person interview" methods. It used the "gossip" (the genetic data of the whole group) to make incredibly accurate guesses about new animals.

2. The Real-World Cases (The Messy Crime Scene)
They tried it on real data from the UK:

Woodchester Park: A huge, open area with many animals and a lot of genetic variety.
- Result: The GNN did okay, but not amazing. Why? Because the area was so big and open that animals could have gotten sick from outside sources the model couldn't see. It was like trying to solve a mystery in a city where the suspects keep running away to other cities.
Cumbria: A small, contained outbreak.
- Result: The GNN was better here than the old methods, but because the group was so small, there wasn't enough "gossip" to go around. The detective needed a bigger crowd to learn from.

The Big Takeaway

The paper shows that context is king.

In epidemiology, you can't just look at two people in isolation. You have to look at the whole web of connections.

The Analogy: If you want to know if two strangers are related, asking them "Do you know each other?" is okay. But if you look at their entire family tree, their friends, their neighbors, and their shared history, you can figure it out with much higher certainty.

Why does this matter?
If we can predict who infected whom without testing everyone's DNA, we can stop outbreaks faster. We can target the specific "super-spreaders" or "hotspots" and stop the disease from spreading, saving money and lives. The Graph Neural Network is the tool that lets us see the invisible threads connecting the outbreak.

1. Problem Statement

In precision epidemiology, identifying transmission pathways (who-infected-whom) is critical for designing disease control strategies. While whole-genome sequencing (WGS) of pathogens provides powerful data to estimate the time to the most recent common ancestor (TMRCA) and genetic distance between hosts, traditional statistical approaches face significant limitations:

Data Incompleteness: Real-world datasets often contain unidentified hosts or missing metadata, making full transmission tree reconstruction infeasible.
Pairwise Independence Assumption: Conventional methods (e.g., logistic regression, random forests) treat host pairs as independent observations. They organize data as a flat list of pairs $(A, B), (A, C), (B, C)$ , ignoring the intrinsic tree-like structure of infectious disease spread. This approach fails to utilize contextual information from other hosts in the dataset when predicting the relationship between a specific pair.
Inference Difficulty: Even with WGS, inferring exact transmission chains is difficult due to slow pathogen evolution (e.g., Mycobacterium bovis) and complex transmission dynamics (e.g., cross-species transmission between cattle and badgers).

The authors propose that Graph Neural Networks (GNNs) offer a natural architecture to model these datasets by treating infected hosts as nodes and their relationships as edges, thereby preserving the global relational structure of the data.

2. Methodology

Data Representation

The authors model the epidemiological dataset as a fully connected, undirected graph:

Nodes ( $N$ ): Represent infected hosts. Each node $i$ has attributes $n_i$ (e.g., sampling time, spatial coordinates, species).
Edges ( $E$ ): Represent relationships between host pairs $(i, j)$ . Each edge has attributes $e_{ij}$ (e.g., physical distance, time difference, and crucially, the genetic distance between the pathogens of $i$ and $j$ ).
Task: The goal is an edge-level prediction task: predicting whether a new, unsequenced host ( $H+1$ ) is closely genetically related to existing hosts based on the known genetic distances and metadata of the existing network.

Graph Neural Network Architecture

The proposed GNN utilizes a message-passing mechanism (specifically conv.GeneralConv from the PyTorch Geometric library) to generate node embeddings that incorporate global context:

Message Passing:
- For a target node $i$ , the model aggregates information from all neighbors $j$ .
- It transforms node attributes ( $n_i, n_j$ ) and edge attributes ( $e_{ij}$ ) using learned linear weights ( $W, W', W''$ ) and biases.
- A "message" $m_{ij}$ is computed combining the transformed attributes of the source node, target node, and the edge connecting them.
Attention Mechanism:
- The model employs an attention mechanism to weight the importance of different neighbors.
- Attention scores $\alpha_{ij}$ are calculated using a ReLU activation and Softmax normalization, allowing the model to learn that certain neighbors (e.g., those in the same location/time) provide more relevant context than others.
Embedding Generation:
- The final embedding $\tilde{n}_i$ for host $i$ is a weighted sum of the messages from all neighbors. This embedding captures the local attributes of $i$ and the global context of the entire dataset.
Prediction Head:
- To predict the relationship between a pair $(i, j)$ , the model concatenates the embeddings $(\tilde{n}_i, \tilde{n}_j)$ with the edge attributes $e_{ij}$ (excluding the target genetic distance if unknown).
- This vector is passed through a Multi-Layer Perceptron (MLP) to output a scalar probability $d_{pred}^{ij} \in [0, 1]$ , representing the likelihood that the pair is closely related (genetically close).

Baseline Models

The GNN is compared against three standard pairwise models trained on the same data (without global context):

Logistic Regression (LR)
Random Forest (RF)
Boosted Regression Trees (BRT)

Datasets

The study utilizes five datasets:

Three Synthetic Datasets: Simulated bovine Tuberculosis (bTB) outbreaks in Great Britain with $H=2,000$ hosts (cattle and badgers) and perfect coverage.
Woodchester Park (Real): $H=241$ hosts (open system, high genetic diversity, endemic area).
Cumbria (Real): $H=63$ hosts (closed system, novel outbreak, low genetic diversity).

3. Key Contributions

Novel Application of GNNs: This is the first application of Graph Neural Networks to infer transmission relationships in epidemiological data, moving beyond the "pairwise independence" assumption of traditional statistical models.
Contextual Learning: The study demonstrates that GNNs can leverage the global structure of the dataset. By treating the data as a graph, the model uses the known genetic distances between other pairs to inform predictions on new pairs, effectively acting as a "contextual prior."
Handling Missing Data: The framework is designed to handle scenarios where a new host has metadata but no WGS data, predicting their likely position in the transmission tree based on the network's existing structure.
Comprehensive Benchmarking: The authors provide a rigorous comparison of GNNs against established machine learning baselines across varying dataset sizes and complexities.

4. Results

Synthetic Datasets ( $H=2,000$ )

Performance: GNNs significantly outperformed all pairwise models (LR, RF, BRT).
- Balanced Accuracy (BA): GNNs achieved ~0.74–0.81, compared to ~0.60–0.68 for baselines.
- ROC-AUC: GNNs achieved ~0.85–0.87, compared to ~0.67–0.75 for baselines.
Variable Importance: Permutation importance analysis revealed that the Genetic Distance attribute (known distances between existing hosts) was the most critical variable for the GNN. This confirms the model successfully leverages the global network context to improve predictions on unseen hosts.

Real-World Datasets

Woodchester Park ( $H=241$ ):
- Performance was mixed. The GNN (BA: 0.789) performed similarly to Logistic Regression (BA: 0.798).
- Reasoning: The dataset had high genetic diversity (median 26 SNPs) and likely represented an "open system" with external infections. The limited sample size and high diversity meant that global context provided less additional value, and the Genetic Distance attribute was not statistically significant for the GNN.
Cumbria ( $H=63$ ):
- Performance was poor across all models (BA: 0.61–0.71), though GNNs still held a slight edge.
- Reasoning: The extremely small sample size (only 63 hosts) resulted in a very small number of edges for training, making the model sensitive to train/test splits and hyperparameters. However, the Genetic Distance attribute did show significant explanatory power, suggesting the GNN could utilize context even in small, closed outbreaks.

5. Significance and Conclusion

Superiority in Large Datasets: The study concludes that GNNs are a superior modeling architecture for precision epidemiology when sufficient data is available. They effectively capture the non-independent nature of infectious disease spread, leading to more accurate identification of transmission clusters.
Scalability Limitations: The performance advantage of GNNs diminishes as dataset size decreases. In small datasets, the "global context" is insufficient to overcome the noise, and simpler models may perform comparably.
Future Applications: The authors highlight that GNNs are flexible enough to handle incomplete metadata (e.g., including unsequenced reactors in a farm outbreak as nodes) and can be adapted for various tasks, such as:
- Identifying index cases (node-level task).
- Classifying overall outbreak properties (graph-level task).
- Predicting directed transmission (infector $\to$ infectee).

Final Takeaway: By treating epidemiological data as an intrinsically interconnected graph rather than a collection of independent pairs, GNNs unlock the ability to use the full dataset's structural information to infer transmission pathways, offering a powerful tool for outbreak investigation and control strategy design, particularly in large-scale surveillance scenarios.

Learning relationships in epidemiological data using graph neural networks

The Old Way: The "Two-Person Interview"

The New Way: The "Gossip Network" (Graph Neural Networks)

The Experiment: Training the Detective

The Big Takeaway

1. Problem Statement

2. Methodology

Data Representation

Graph Neural Network Architecture

Baseline Models

Datasets

3. Key Contributions

4. Results

Synthetic Datasets (H=2,000H=2,000H=2,000)

Real-World Datasets

5. Significance and Conclusion

More like this

Fusion Learning from Dynamic Functional Connectivity: Combining the Amplitude and Phase of fMRI Signals to Identify Brain Disorders

A Large-Scale Comparative Analysis of Imputation Methods for Single-Cell RNA Sequencing Data

Quantifying plasticity: a network-based framework linking structure to dynamical regimes

The Self-Replication Phase Diagram: Mapping Where Life Becomes Possible in Cellular Automata Rule Space

Lingshu-Cell: A generative cellular world model for transcriptome modeling toward virtual cells

Synthetic Datasets ( $H=2,000$ )