Identification and classification of all Cytochrome P450 deposits in the Protein Data Bank

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine the Protein Data Bank (PDB) as the world's largest, most chaotic library of 3D blueprints for biological machines. Among these millions of blueprints, there is a very special, incredibly useful, but notoriously difficult-to-find group of machines called Cytochrome P450s.

These machines are the "Swiss Army Knives" of biology. They help our bodies process drugs, help plants fight off pests, and help bacteria eat oil spills. Because they are so useful, scientists have been building and photographing them for decades.

However, there is a massive problem: The library is a mess.

The Problem: A Library with No Catalog

Imagine walking into a library where some books are labeled "The Great Gatsby," others are labeled "Book 1," some are just called "The Green Light," and others have no title at all. If you asked the librarian to find "The Great Gatsby," they might miss the ones labeled "The Green Light" or the ones with no title.

This is exactly what happened with P450 enzymes in the scientific database:

Inconsistent Names: Some scientists called them by their official ID (like CYP101A1), while others used old nicknames (like P450cam or P450BM3).
Missing Labels: Many entries didn't say which "family" or "subfamily" the enzyme belonged to.
The Search Nightmare: Because the names were so messy, if a researcher tried to search for "all P450s," they would miss hundreds of them or find things that weren't P450s at all. It was like trying to find all the "red cars" in a parking lot when some are labeled "red," some "crimson," some "maroon," and some have no color tag at all.

The Solution: A Smart Detective Team

The authors of this paper decided to clean up this mess. They acted like a team of super-detectives with a new strategy. Instead of just reading the labels (which were often wrong or missing), they looked at the shape of the machines.

Here is how they did it:

The Keyword Sweep: First, they scanned the library for any book that mentioned "P450" or "heme" (the engine part of the machine). This found most of them.
The Shape-Shifter Test: They knew that even if P450s look very different on the outside (like a sedan vs. a truck), they all share the same internal engine structure. So, they took a few "perfect" P450 blueprints and compared the 3D shape of every single machine in the library against them.
- Analogy: Imagine you are looking for all the "Suzuki Swift" cars. You can't just look for the word "Swift" on the license plate because some people write "Suzuki," some write "Swift," and some write nothing. Instead, you look at the shape of the car. If it has the same wheelbase, door shape, and engine layout as a Swift, it's a Swift, even if the name tag is wrong.
The Human Review: Once the computer found the candidates, the human experts double-checked them. They fixed the labels, assigned the correct official ID (the CYPid), and even discovered five new families of these enzymes that nobody knew existed before.

The Results: A Clean, Organized Library

By the end of their work, they found 1,513 P450 structures.

They realized that while there were 1,513 blueprints, many were just copies of the same 674 unique machines.
They fixed the labels for almost everything.
They found that the most popular machines were P450-BM3 (a fatty acid cleaner) and P450-CAM (a camphor cleaner), which makes sense because they were the first ones discovered and are the easiest to study.
They also found that some machines had "fake engines" (different metal atoms instead of iron) used for special experiments, and they cataloged those too.

Why Does This Matter?

Before this paper, if a scientist wanted to study how P450s break down drugs, they had to waste weeks guessing which blueprints were real and which were mislabeled.

Now, thanks to this work:

The Library is Organized: There is a single, up-to-date list where every P450 has its correct ID card.
The Search is Easy: Researchers can now find every single P450 structure instantly.
The Future is Automated: The authors built a robot (an automated pipeline) that will check the library every three months. If a new P450 blueprint is added tomorrow, the robot will find it, label it, and add it to the list automatically.

In short: This paper took a chaotic, confusing pile of biological blueprints and turned them into a perfectly organized, easy-to-use encyclopedia, ensuring that scientists can finally find the tools they need to cure diseases and build better medicines.

1. Problem Statement

Cytochrome P450 monooxygenases (CYPs) are a highly diverse enzyme superfamily critical to biotechnology, pharmacology, and environmental science. Despite the existence of over 1,500 structures in the Protein Data Bank (PDB), researchers face significant challenges in retrieving and analyzing them due to:

Inconsistent Nomenclature: Many entries lack the standardized CYP identifier (CYPid, e.g., CYP101A1). Instead, they rely on legacy common names (e.g., P450cam, P450BM-3) or author-defined aliases that vary in formatting, specificity, and accuracy.
Annotation Errors: Deposits often contain missing family/subfamily data, incorrect classifications, or ambiguous labels (e.g., listing a remote homolog instead of the actual enzyme).
Search Limitations: Standard keyword searches in the PDB yield high false-positive rates or miss entries entirely due to the lack of standardized tags. Conversely, sequence-based methods (like BLAST) are insufficient due to the extreme sequence divergence within the P450 superfamily (often <20% identity), even though structural conservation remains high.

2. Methodology

The authors developed a structure-guided discovery and validation workflow to systematically identify, classify, and re-annotate all P450 structures in the PDB. The pipeline consists of the following steps:

Data Retrieval & Pre-processing:
- Retrieved all polymer entities from the PDB (accessed July 28, 2024).
- Filtered for the longest chain per deposit and excluded chains shorter than 200 amino acids or with fewer than five $\alpha$ -helices (based on the known structural constraints of P450s).
Two-Stage Identification:
1. Keyword & Heme Search: Searched for deposits containing "CYP" or "P450" keywords and heme cofactors. Sequences were submitted to the P450atlas server for initial classification.
2. Structural Similarity Search: To catch missed entries (e.g., those with non-standard heme variants or missing keywords), the authors performed structural alignments (using TM-align) of the filtered PDB chains against three representative P450 templates (3EL3, 7WEX, 7TLO). Chains with a TM-score > 0.6 were retained.
Classification & Validation:
- Automated Assignment: All identified sequences were processed through the P450atlas server, which uses Hidden Markov Models (HMM) and sequence alignment to assign families and subfamilies.
- Manual Curation: Experts manually verified assignments, resolved ambiguities (e.g., multiple aliases for one enzyme), and corrected misclassifications found in original deposits.
- New Subfamily Discovery: Sequences that fell below the standard subfamily identity threshold (55%) but showed strong evolutionary context were evaluated for potential new subfamilies.
Automation: An automated pipeline was established to periodically scan new PDB releases (quarterly) to maintain an up-to-date registry.

3. Key Contributions

Comprehensive Registry: Creation of the first rigorously curated, structure-linked registry of P450 enzymes, mapping 1,513 PDB deposits to 674 unique sequences.
Standardized Annotation: Every identified deposit was assigned a correct, standardized CYPid (Family and Subfamily), replacing inconsistent legacy names.
Discovery of New Subfamilies: The study identified five new CYP subfamilies (CYP165F, CYP152AX, CYP255D, CYP1251G, CYP107PW) that were previously unclassified or misclassified by automated tools alone.
Resource Integration: The dataset is integrated into the publicly accessible P450atlas.org website, enhancing the server's database with structural information and improving future assignment accuracy.
Analysis of Heme Variants: A detailed catalog of alternative heme cofactors (e.g., HEC, MI9, metal-substituted hemes) found in PDB deposits, clarifying cases where standard HEM codes were missing or erroneous.

4. Key Results

Dataset Statistics:
- Total Deposits: 1,513 (as of Jan 1, 2026).
- Unique Sequences: 674.
- Families: Represented by 86 different families, though highly skewed; 62.39% of deposits belong to the top 8 families.
- Top Families: CYP102 (197 deposits, mostly CYP102A1/P450-BM3), CYP101 (146 deposits, mostly CYP101A1/P450-CAM), and CYP3 (dominated by CYP3A4).
Annotation Quality:
- Only 905 of 1,513 deposits had correct family and subfamily information in the original PDB entry.
- 287 had correct families but missing subfamilies.
- 284 had no family data, relying solely on common names.
- 21 deposits contained incorrect family assignments or no P450 indication at all.
Structural vs. Sequence Divergence:
- While sequence identity between P450 pairs can drop below 20%, structural similarity (TM-score) remains consistently high (>0.7).
- The most probable pair in the dataset showed 22% sequence identity but a TM-score of 0.82, confirming that structural alignment is a robust method for identification where sequence methods fail.
Common Names: The study cataloged frequent aliases (e.g., P450 BM3, P450 CAM, P450-PCN1), noting that single aliases often map to multiple enzymes (e.g., P450scc referring to both CYP11A1 and CYP204A1) and vice versa.

5. Significance

This work establishes a reliable framework for the accurate retrieval, comparison, and large-scale analysis of P450 enzymes. By unifying structurally validated identification with standardized nomenclature, it:

Removes Ambiguity: Eliminates the confusion caused by inconsistent legacy names, allowing for precise cross-referencing in literature and databases.
Enables Automation: Provides a foundation for automated literature mining and bioinformatic pipelines that were previously hindered by data inconsistency.
Supports Drug Discovery & Biotechnology: Given the role of P450s in drug metabolism (e.g., CYP3A4) and biocatalysis (e.g., CYP102A1), a clean, annotated dataset accelerates the design of inhibitors and engineered enzymes.
Future-Proofing: The automated, quarterly-updating pipeline ensures the registry remains current as new structures are deposited, preventing the accumulation of future annotation errors.

Identification and classification of all Cytochrome P450 deposits in the Protein Data Bank

The Problem: A Library with No Catalog

The Solution: A Smart Detective Team

The Results: A Clean, Organized Library

Why Does This Matter?

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance

More like this

Functional-space alignment resolves the eco-evolutionary landscape of siderophore biosynthesis across bacteria

Exploring molecular signatures of senescence with markeR, an R toolkit for evaluating gene sets as phenotypic markers

Longevity Bench: Are SotA LLMs ready for aging research?

TFBindFormer: A Cross-Attention Transformer for Transcription Factor-DNA Binding Prediction

A little longer, a lot better: simulation-guided exploration of extended-length single-end barcoded reads for structural variant detection