Linking Codon- and Protein-Level Mutation Scores to Population Genetics Reveals Heterogeneous Selection Efficiency Across Escherichia coli Lineages

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine the genome of a bacterium like E. coli as a massive, ancient library containing millions of books (genes). Over millions of years, these books have been copied, edited, and sometimes scribbled on by random typos (mutations). Some typos are harmless, some make the story worse, and a few accidentally make it better.

This paper is like a massive detective story where the authors used a super-powerful microscope to look at 81,440 different copies of the E. coli library. Their goal was to answer two big questions:

How good is nature at "editing out" the bad typos?
Does this editing skill change depending on where the bacteria live?

Here is the story of their discovery, broken down with simple analogies.

1. The Two Types of Typos

In the bacterial library, there are two main kinds of typos:

Synonymous Typos (The "Silent" Scribbles): These are changes in the text that don't actually change the meaning of the sentence. It's like changing "colour" to "color." The story is the same, but the spelling is different. Scientists usually thought these were completely neutral (no effect).
Non-Synonymous Typos (The "Meaning-Changing" Scribbles): These change the actual words, altering the protein the gene makes. It's like changing "The cat sat" to "The cat sat on the mat" or "The bat sat." This can break the story or make it better.

2. The "DCA" Score: A Protein's "Fit Check"

To figure out if a typo is good or bad, the authors used a tool called Direct Coupling Analysis (DCA).

The Analogy: Imagine you are building a complex Lego castle. You have a specific spot for a red brick. If you put a blue brick there, the castle might wobble. If you put a red brick, it fits perfectly.
How it works: The DCA tool looks at millions of different versions of the same protein (like different Lego castles) to learn the "rules" of the structure. It gives every possible typo a score.
- Negative Score: "Great job! This fits perfectly." (Beneficial)
- Zero Score: "It's fine, doesn't matter much." (Neutral)
- Positive Score: "Bad idea! This will break the castle." (Deleterious)

3. The Big Discovery: Selection is a Filter

The authors looked at how these typos are distributed in the population.

The "Low Frequency" Zone: Most bad typos appear in just one or two bacteria and then disappear. They are like weeds that get pulled out immediately.
The "High Frequency" Zone: Only the good typos (or the harmless ones) survive long enough to become common in the population.

The Surprise: They found that while "silent" typos (synonymous) have a very small range of effects (they are mostly neutral), the "meaning-changing" typos (non-synonymous) have a massive range of effects. Some are so bad they are lethal; others are slightly helpful. The difference in "badness" spans six orders of magnitude (a million times difference), whereas the silent ones only span a tiny range.

4. The "Population Size" Problem: The Small Town vs. The Big City

This is the most fascinating part. The authors compared different groups of E. coli:

The Commensals (The Big City): These are the "normal" E. coli living in the guts of healthy animals. They have huge populations (millions of individuals).
The Pathogens (The Small Town): These are the dangerous ones, like Shigella (which causes dysentery). They live in a very specific, harsh environment and have tiny populations.

The Metaphor:
Imagine a town with a strict HOA (Homeowners Association) that immediately removes any house with a broken window.

In the Big City (Large Population): There are so many people that the HOA is very efficient. If a house has a broken window (a bad mutation), it gets fixed or the owner is kicked out immediately. The city stays pristine.
In the Small Town (Small Population): There are only a few houses. The HOA is overwhelmed or non-existent. A broken window might stay there for years because no one noticed, or the owner just moved away. The town accumulates broken windows (bad mutations).

The Result:
The "Small Town" bacteria (Shigella and other pathogens) have 10,000 times less efficiency in cleaning up bad mutations compared to the "Big City" bacteria. Because their populations are so small, genetic drift (random chance) takes over. Bad mutations that would be instantly deleted in a large population are allowed to survive and spread in these small, isolated groups.

5. Why This Matters

For Medicine: It explains why dangerous bacteria like Shigella can accumulate so many genetic errors. They aren't necessarily "trying" to get sick; they just live in such small, isolated groups that nature can't "edit" their mistakes effectively.
For Science: It proves that we can use the "real-world" data of 80,000 bacteria to test our computer models. The computer models (DCA scores) predicted which mutations should be bad, and the population data confirmed it. It's a perfect match between theory and reality.

Summary

Think of evolution as a giant editing process.

Big populations are like a team of 10,000 editors who catch every single typo.
Small populations are like a single editor who is tired and misses a lot of mistakes.
The authors showed that E. coli bacteria living in "small town" lifestyles (pathogens) have accumulated a mountain of genetic mistakes because their "editors" (natural selection) are too overwhelmed to keep up, while the "big city" bacteria stay relatively error-free.

This study bridges the gap between protein chemistry (how a single mutation breaks a Lego castle) and population genetics (how many castles are in the room), giving us a clearer picture of how life evolves and adapts.

1. Problem Statement

Understanding the selective effects of individual mutations is fundamental to evolutionary biology, yet quantifying the Distribution of Fitness Effects (DFE) across a species remains challenging. Traditional population genetics often relies on binary classifications (synonymous vs. non-synonymous) or aggregate metrics like $K_a/K_s$ , which treat all mutations within a class as equivalent. This approach fails to capture the heterogeneity of mutational effects based on specific amino acid changes, protein context, and epistatic interactions.

Furthermore, there is a need to bridge the gap between:

Protein-level predictors: Computational tools (like Direct Coupling Analysis) that predict mutational effects based on structural and evolutionary constraints.
Population-level data: Large-scale genomic datasets that reflect the interplay between natural selection and genetic drift.

The authors aim to leverage a massive dataset of Escherichia coli genomes to quantify how selection efficiency varies across different ecological niches (commensal vs. pathogenic) and to validate protein-level mutation scores against real-world polymorphism data.

2. Methodology

Data Collection and Processing:

Dataset: The study utilized 81,440 high-quality E. coli and Shigella genomes downloaded from Enterobase (February 2019).
Core Genome: They identified a core genome of 2,358 genes (present in >95% of genomes, <1% duplication, <800 amino acids) to ensure strong purifying selection and comparability.
Polymorphism Identification: They identified 458,443 oriented polymorphic codon sites by comparing alleles against outgroup species (E. albertii, E. fergusonii, Salmonella).
Clustering: Genomes were clustered into 579 distinct groups based on genetic distance (0.5% threshold), allowing for the analysis of subpopulations with varying effective population sizes ( $N_e$ ).

Scoring Systems:

Codon Scores (Synonymous): Calculated as the log-ratio of ancestral to mutated codon usage frequencies. Negative scores indicate a shift toward preferred codons (potentially beneficial), while positive scores indicate a shift toward rare codons (potentially deleterious).
DCA Scores (Non-synonymous): Using Direct Coupling Analysis (DCA), a statistical physics model trained on Multiple Sequence Alignments (MSAs) of homologous proteins.
- DCA scores represent a "statistical energy" difference.
- Negative scores: Predicted beneficial (higher probability in the evolutionary model).
- Positive scores: Predicted deleterious.
- Scores account for site-specific conservation and epistatic couplings (interactions between residues).

Population Genetic Modeling:

Site Frequency Spectrum (SFS): The authors analyzed the distribution of allele frequencies for different mutation classes.
Inference Methods:
- Poisson Random Field (PRF): Used for smaller subsamples (e.g., specific clusters) to infer selection intensity ( $N_e s$ ).
- Curve-Fitting: Used for the full dataset (81k genomes) to fit observed SFS against theoretical Wright-Fisher diffusion models.
Selection Intensity: Defined as $\gamma = N_e s$ . The study inferred the DFE by fitting the observed SFS of mutations with specific scores against a neutral reference (synonymous mutations with near-zero codon scores).

3. Key Contributions

Integration of DCA with Population Genetics: This is one of the first studies to use DCA scores as a continuous latent variable to predict the probability of a mutation being beneficial, neutral, or deleterious, and to validate these predictions against a massive polymorphism dataset.
Quantification of Synonymous Selection: The study demonstrates that synonymous mutations are not strictly neutral; their selection intensities span a single order of magnitude, whereas non-synonymous mutations span six orders of magnitude.
Heterogeneity of Selection Efficiency: The authors provide a quantitative framework to measure how selection efficiency varies across E. coli lineages, linking it directly to effective population size ( $N_e$ ) and ecological lifestyle.
Benchmarking Predictors: The study establishes population genetics data as a rigorous benchmark for assessing the accuracy of protein variant fitness predictors (DCA).

4. Key Results

A. Mutation Frequency and Selection Signatures

Synonymous vs. Non-synonymous: While random mutations are ~30% synonymous, observed mutations at low frequencies are ~30% synonymous. However, as frequency increases, the proportion of synonymous mutations rises dramatically (89.3% at fixation), indicating strong purifying selection against non-synonymous variants.
DCA Score Shifts:
- Low-frequency non-synonymous mutations show a DCA distribution biased toward positive (deleterious) values but less so than random mutations.
- High-frequency (near fixation) mutations show a DCA distribution centered around 0, indicating that only neutral or beneficial mutations survive to high frequencies.
- The "enrichment slope" (the shift in DCA distribution with frequency) correlates strongly with genetic diversity ( $\theta_W$ ).

B. Selection Efficiency Across Lineages

Pathotypes: The study compared four lifestyles: Commensal/ExPEC (extra-intestinal), Shigella, EIEC (entero-invasive), and STEC/EHEC (Shiga-toxin producing).
Shigella Reduction: Shigella lineages exhibit the lowest selection efficiency, with an effective population size estimated at 1/500 to 1/10,000 of the total E. coli species. This results in a 10,000-fold reduction in selection efficiency compared to commensal strains.
Correlation: There is a strong correlation between the DCA enrichment slope and genetic diversity ( $r \approx 0.68$ ), confirming that smaller populations accumulate more deleterious mutations due to genetic drift overpowering selection.

C. Distribution of Fitness Effects (DFE)

Scale of Effects: The DFE for non-synonymous mutations spans 6 orders of magnitude ( $N_e s$ from $-10^5$ to $+1$ ).
Lethality: Mutations with high positive DCA scores are often lethal or strongly deleterious ( $N_e s < -10$ ), as they are rarely observed even at low frequencies.
Synonymous Effects: Synonymous mutations generally have weak effects ( $|N_e s| < 10$ , often $<1$ ), but selection is detectable in large, diverse populations and becomes indistinguishable in small populations (like Shigella).

5. Significance and Implications

Evolutionary Insight: The study confirms that the "effective" population size varies drastically within a single bacterial species depending on the ecological niche. Pathogens with reduced genomes and intracellular lifestyles (like Shigella) suffer from reduced selection efficiency, leading to the accumulation of deleterious mutations (Muller's ratchet).
Methodological Advancement: By treating mutation scores (Codon and DCA) as continuous variables rather than binary categories, the authors can resolve the DFE with much higher resolution. This validates DCA as a powerful tool for predicting mutational effects in natural populations.
Medical Relevance: Understanding the selection efficiency of pathogenic lineages helps explain their evolutionary trajectories. The reduced selection efficiency in Shigella and other InPEC (Intra-intestinal Pathogenic E. coli) suggests they are more prone to genomic degradation and may have different adaptive potentials compared to commensal strains.
Bridge Between Fields: The work successfully bridges statistical physics/protein bioinformatics (DCA) and population genetics, showing that protein-level constraints and population-level drift jointly shape the genomic landscape of bacteria.

In conclusion, this paper provides a comprehensive quantitative map of selection in E. coli, demonstrating that selection efficiency is not uniform but is heavily modulated by population size and ecological lifestyle, and that protein-level predictors can be rigorously validated using large-scale population genomic data.

Linking Codon- and Protein-Level Mutation Scores to Population Genetics Reveals Heterogeneous Selection Efficiency Across Escherichia coli Lineages

1. The Two Types of Typos

2. The "DCA" Score: A Protein's "Fit Check"

3. The Big Discovery: Selection is a Filter

4. The "Population Size" Problem: The Small Town vs. The Big City

5. Why This Matters

Summary

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

Reconciling the effects of PMS2 in different repeat expansion disease models supports a common expansion mechanism

Effect heterogeneity reveals complex pleiotropic effects of rare coding variants

Effects of knockdown of autophagy pathway genes on C. elegans longevity are highly condition dependent

Federated single-cell QTL meta-analysis reveals novel disease mechanisms

Sequence context and methylation interact to shape germline mutation rate variation at CpG sites