A Context-Aware Single-Cell Proteomics Analysis pipeline.

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to understand a massive, chaotic city by looking at individual houses one by one. In the world of biology, these "houses" are cells, and the things inside them that tell us what they are doing are proteins.

For a long time, scientists could easily read the "blueprints" of these houses (the DNA/RNA), but reading the actual "furniture and appliances" (the proteins) inside each house was incredibly difficult. Now, new technology allows us to take a snapshot of hundreds of proteins in a single cell. But there's a problem: we don't have a good map or a reliable guide to interpret these snapshots.

This paper introduces CASPA (Context-Aware Single-Cell Proteomics Analysis), a new "smart guide" designed to make sense of these protein snapshots automatically.

Here is how it works, explained through simple analogies:

1. The Problem: The "Noisy Room" and the "Confused Librarian"

Imagine you walk into a room where people are talking, but the room is also filled with dust, echoes, and random objects floating in the air (this is like ambient protein contamination in science).

Old Methods: Previous tools tried to organize this room using rules made for a different kind of room (like a library of books/RNA). They assumed that if you couldn't see a piece of furniture, it wasn't there. But in protein science, "not seeing it" might just mean it's hidden, or it might mean the room is actually empty.
The Confusion: If a cell is a "garbage collector" (a macrophage) that has eaten a piece of a "brick wall" (another cell), old tools would think the garbage collector is a brick wall. They get confused by the "evidence" inside the cell.

2. The Solution: The "Smart Detective" (CASPA)

The authors built a pipeline (a step-by-step automated process) that acts like a super-smart detective who knows the rules of the specific city they are investigating.

Step A: Cleaning the Evidence (Adaptive Quality Control)

Instead of using a rigid rule like "throw away any house with less than 500 items," CASPA looks at the whole neighborhood. If the neighborhood is generally messy, it adjusts its standards. If a batch of data looks weird (like a whole street of houses made of the wrong material), it flags it without throwing away the whole street. It's like a detective who knows, "Okay, this specific street is under construction, so the mess is normal here, but that other street looks suspicious."

Step B: Fixing the "Echoes" (Iterative Batch Correction)

Imagine taking photos of the same city at different times of day. The lighting changes, making the buildings look different.

Old way: Try to fix the lighting once and hope it's perfect.
CASPA way: It keeps adjusting the lighting (correcting for technical errors) and checks a "mixing meter." It keeps tweaking until the buildings from different days look like they belong in the same neighborhood. It stops exactly when the mix is just right, not too little and not too much.

Step C: The "Three-Round Interview" (Context-Aware AI)

This is the most creative part. The pipeline uses a Large Language Model (LLM)—basically a very smart AI that reads biology textbooks—to label the cells. But instead of just asking the AI, "What is this?", they use a three-round interview strategy:

Round 0 (The Briefing): Before showing the AI any data, they tell it the story: "We are looking at a baby's brain," or "We are looking at a pancreas that was just injured by a toxin." The AI thinks, "Okay, in a baby brain, you won't see fully grown adult neurons yet," or "In an injured pancreas, you'll see cells eating debris." It sets the rules of the game based on the context.
Round 1 (The Investigation): Now they show the AI the protein data. Because the AI already knows the context, it doesn't get tricked. If it sees "brick wall" proteins inside a "garbage collector" cell, it thinks, "Ah, this garbage collector is eating bricks," instead of "This is a brick wall."
Round 2 (The Double-Check): If the AI is unsure, it asks itself, "What specific clues would prove this?" and looks for those clues before making a final call.

3. The Results: Solving the "Phagocytosis" Puzzle

The paper tested this on four different "cities": a developing brain, a brain tumor, a skin tumor, and an injured pancreas.

The Brain Test: The AI correctly realized that in a 13-week-old fetus, you shouldn't label a cell a "mature astrocyte" (an adult brain cell). It used the "baby" context to give the correct label: "astroglial progenitor."
The "Eating" Test: In the neutrophil (immune cell) dataset, some cells had eaten other cells. Old tools called them "contaminants" or "debris." CASPA, knowing these cells are immune cells that eat things, correctly identified them as "phagocytic neutrophils" (cells that have eaten) or "lytic cells" (cells that exploded).
The Skin Test: They tested it on a brand-new dataset they hadn't seen before. The AI matched the "ground truth" (what scientists knew from sorting the cells manually) 90% of the time. It even spotted a subtle group of immune cells that were eating skin cells, which other methods missed.

4. Why This Matters

Think of this pipeline as giving every biology lab a universal translator and a quality control inspector rolled into one.

It's Automated: You don't need a PhD in computer science to run it.
It's Honest: If the AI isn't sure, it says, "I'm only 50% confident, and here is what I'm missing." It doesn't guess blindly.
It Understands Context: It knows that a "brick" in a construction site is different from a "brick" in a museum.

In summary: This paper presents a new, automated tool that uses "context-aware" AI to correctly identify what cells are doing, even when the data is messy, incomplete, or tricky. It turns a chaotic pile of protein data into a clear, reliable map of the cell's identity, making single-cell proteomics accessible to everyone, not just the experts.

1. Problem Statement

Single-cell proteomics (SCP) by mass spectrometry has advanced to the point where hundreds to thousands of proteins can be quantified per cell. However, the field lacks standardized, automated analytical pipelines capable of handling the unique characteristics of proteomic data.

Limitations of Current Workflows: Existing tools are largely adapted from single-cell RNA-seq (scRNA-seq) and fail to account for:
- Informative Missingness: In SCP, a non-detected protein may indicate biological absence, technical dropout, or ambient contamination, requiring distinct analytical handling.
- Ambient Contamination: Pervasive carryover of proteins (e.g., from lysed cells) complicates cell identity assignment.
- Limited Feature Space: SCP has fewer features (proteins) than scRNA-seq (transcripts), making reference-based classifiers ineffective.
- Batch Effects: Multi-batch experiments require robust correction strategies that are often treated as one-step procedures in current tools.
Annotation Bottleneck: Cell type annotation remains a manual, subjective, and non-scalable process. While Large Language Models (LLMs) offer promise, they often fail due to hallucinations, context-insensitivity (e.g., misidentifying developmental stages), and an inability to distinguish between biological uptake (phagocytosis) and technical contamination.

2. Methodology: The CASPA Pipeline

The authors present CASPA (Context-Aware Single-Cell Proteomics Analysis), an end-to-end, fully automated Snakemake workflow designed to convert raw protein-group matrices into annotated datasets. The pipeline consists of four core modules:

A. Adaptive Quality Control (QC)

Instead of fixed thresholds, CASPA uses a dataset-specific adaptive cutoff derived from the lower tail of the protein-count distribution (median of bottom-N cells × 1.7x), with a hard floor of 400 detected proteins.
Cluster-per-Batch Diagnostic: A novel metric that flags batches where cluster distributions diverge from the global pattern, identifying technically compromised batches that might pass standard cell-level QC.

B. Iterative Batch Correction

Dual-Modality Embedding: Constructs cell embeddings using a joint Principal Component Analysis (PCA) of:
1. Log-transformed protein intensities.
2. Binary detection patterns (presence/absence), leveraging the fact that detectability is as informative as abundance in SCP.
Adaptive Harmony Loop: Uses the Harmony algorithm with an iterative feedback mechanism. After each iteration, batch mixing is quantified using weighted Shannon entropy across Leiden clusters. The diversity penalty ( $\theta$ ) is automatically increased until a target mixing threshold (default 0.6) is reached, ensuring correction strength adapts to dataset complexity.

C. Multi-Modal Marker Discovery

To overcome the limitations of single-statistic approaches, CASPA integrates four complementary modalities for marker identification:

Detection Specificity: Fisher's exact test on binary presence/absence.
Intensity Differences: Detected-only Mann-Whitney U test (avoiding zero-inflation bias).
Model-Based Effects: scplainer linear mixed-effects models to separate biological cluster effects from technical factors (batch, injection order).
Pathway Activity: AUCell scoring on MSigDB Hallmark gene sets.

Consensus Ranking: Proteins supported by $\ge$ 2 modalities are prioritized via a Borda rank score.

D. Three-Round Context-Aware Annotation

This is the core innovation for cell type labeling, designed to mitigate LLM failure modes:

Round 0 (Context Reasoning): The LLM receives only the experimental context (species, tissue, developmental stage, sample prep) and generates dataset-specific constraints (e.g., "mature astrocytes are absent in fetal cortex," "phagocytosis is a likely mechanism for non-self proteins"). It sees no marker data.
Round 1 (Annotation): Clusters are annotated using the Round 0 constraints, generic biological principles, and full marker summaries.
Round 2 (Refinement): Low/medium-confidence clusters trigger an automated query for supplemental markers nominated by the model in Round 1, followed by re-annotation.
Validation: Includes orthogonal checks against PanglaoDB and researcher-defined panels, outputting confidence tiers and explicit contradiction records.

3. Key Contributions

First End-to-End SCP Pipeline: A fully automated workflow specifically designed for the statistical and biological nuances of mass spectrometry-based single-cell data.
Multi-Modal Integration: Moves beyond simple intensity or detection thresholds by combining detection patterns, intensity fold-changes, statistical modeling, and pathway analysis.
LLM Failure Mode Mitigation: Identifies two classes of LLM errors (vocabulary/threshold and prior-override) and solves them via a three-round architecture that forces the model to reason about experimental constraints before seeing data.
Handling "Non-Self" Proteins: Explicitly addresses the challenge of distinguishing contamination from biological phenomena like phagocytosis, lytic cell death (NETosis), and trogocytosis.

4. Results & Benchmarking

The pipeline was validated across four diverse datasets:

Developing Human Brain (Wu et al.):
- Successfully recovered 6 of 8 major cell lineages.
- Correction: Fixed LLM errors where "astrocyte" was incorrectly assigned to fetal progenitors (corrected to "astroglial-progenitor") and where sparse signals were over-interpreted as stress states.
- Identified technical artifacts in specific batches without discarding entire datasets.
Glioblastoma-Associated Neutrophils (Sadiku et al.):
- Challenge: Distinguishing functional states within a single lineage where granule proteins are ubiquitous.
- Correction: Correctly identified phagocytic neutrophils (carrying epithelial cargo) and lytic NETosis (depleted proteomes) as biological states rather than "contaminants" or "debris," a failure point for standard LLMs and previous analyses.
Held-Out Validation: Skin Tumour (CYLD Syndrome):
- Tested on a dataset from a different instrument (Bruker timsTOF HT) and workflow.
- Accuracy: Achieved 90.8% concordance with FACS-sorted ground truth labels.
- Key Insight: Correctly identified a cluster of macrophages carrying keratinocyte cargo (80% immune-sorted cells) as "phagocytic macrophages," whereas a competing model (GPT-5.2) mislabeled them as keratinocytes without the context-aware constraints.
Caerulein-Injured Pancreas (Mouse):
- Validated via orthogonal immunohistochemistry (IHC) and immunofluorescence (IF).
- Findings: Confirmed macrophage infiltration, a stellate activation continuum, and ductal-like populations.
- Visual Proof: IF co-staining confirmed that Reg3 protein detected in macrophages was intracellular (phagocytosed), validating the pipeline's interpretation over the "ambient contamination" hypothesis.

5. Significance

Reproducibility & Automation: CASPA provides a reproducible, fixed-setting workflow for SCP laboratories, reducing reliance on manual expert intervention.
Contextual Intelligence: The three-round annotation framework demonstrates that LLMs can be effectively constrained by experimental design logic, transforming them from "black box" predictors into transparent, confidence-quantified analytical tools.
Biological Insight: By correctly interpreting non-standard proteomic signatures (e.g., phagocytosis, cell death), the pipeline unlocks biological insights that are often missed by standard "contamination removal" heuristics.
Scalability: The pipeline is designed to scale with the growing volume of SCP data, offering a foundation for future large-scale proteomic atlases.

The authors conclude that while challenges remain (e.g., model dependency, cost of LLM APIs), the integration of adaptive QC, multi-modal evidence, and context-aware reasoning represents a significant leap forward in making single-cell proteomics a routine, interpretable tool for biological discovery.