StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks

Imagine you are a doctor holding a stethoscope. For centuries, listening to a patient's heart and lungs has been one of the most important, yet difficult, skills in medicine. It's like trying to identify a specific instrument in a complex orchestra just by listening to a few seconds of music. You need years of training to hear the difference between a healthy rhythm and a subtle warning sign of trouble.

For a long time, computers tried to help with this, but they were like very strict, one-note robots. If you asked them, "Is there a murmur?" they could say "Yes" or "No." But if you asked, "What does this sound like, and why is it happening?" they would freeze. They couldn't explain their reasoning or compare today's sound to last week's.

Enter StethoLM, a new kind of AI that acts less like a robot and more like a medical apprentice with a super-powered ear.

What is StethoLM?

Think of StethoLM as a bilingual translator who speaks two languages fluently:

The Language of Sound: It can hear the tiny, split-second differences in a heartbeat or a wheeze (like the difference between a fine crackle and a coarse one).
The Language of Doctors: It can write reports, explain why a sound is bad, and even suggest what disease might be causing it, just like a human doctor would.

Instead of just giving a "Yes/No" answer, StethoLM can have a conversation. You can ask it, "Compare this recording to the one from last month," or "What are the top three possibilities for this cough?" and it will give you a detailed, written answer.

How Did They Teach It? (The "StethoBench" Library)

To teach a computer to listen like a doctor, you need a massive library of examples. The researchers built something called StethoBench.

Imagine a giant library with 77,000 flashcards. On one side of the card is a recording of a heart or lung sound. On the other side is a question a doctor might ask (like "Is this normal?") and the perfect answer a senior doctor would give.

They didn't just write these cards by hand; they used a smart AI to generate them from over 16,000 real patient recordings. This created a "school" where StethoLM learned to:

Detect problems (like hearing a sneeze vs. a cough).
Report findings (writing a summary for a patient's file).
Reason (explaining why a sound suggests asthma).
Compare (spotting the difference between a healthy lung and a sick one).

How Does It Work?

StethoLM is built like a three-part team:

The Ear (Audio Encoder): This part listens to the raw sound and turns it into a digital map of frequencies, spotting the tiny details humans might miss.
The Bridge (Projection): This acts like a translator, turning that sound map into a format the "brain" can understand.
The Brain (Medical Language Model): This is the part that knows medical facts. It takes the sound information and the doctor's question, then writes a response.

The magic happens because they trained this team specifically on medical sounds, not just general noises like music or traffic. It's the difference between teaching a dog to sit (general training) and teaching a dog to detect a specific scent of a disease (specialized training).

Did It Work?

The results were impressive. When tested against other smart AI models:

General AI models (trained on all kinds of sounds) were like a student who studied for a general exam but failed the specific medical test. They could guess, but they often got the details wrong.
StethoLM was like the top of the class. It understood the specific "dialect" of heart and lung sounds much better.

However, it's not perfect yet.

The "Hallucination" Risk: In one test, when they removed the audio but kept the text, the AI still tried to give a medical diagnosis! This is like a student guessing the answer even when the test question is missing. This is dangerous in real life, so doctors must always double-check the AI's work.
The "New Noise" Problem: When the AI heard sounds it had never seen before (like a cough recorded in a noisy, crowded room), it struggled a bit more. It's like a musician who is great in a concert hall but gets confused by a noisy street.

Why Does This Matter?

StethoLM isn't here to replace doctors. Instead, think of it as a super-powered assistant.

For Rural Clinics: A nurse in a remote village with no specialist nearby could use StethoLM to get a "second opinion" on a heart sound, helping them decide if a patient needs to travel to a big city.
For Busy Hospitals: It can listen to hundreds of recordings a day, flagging the ones that sound suspicious so the doctor can focus their time on the most critical cases.
For Learning: It can act as a tutor, explaining to medical students why a specific sound indicates a specific disease.

The Bottom Line

StethoLM is a giant leap forward. It moves medical AI from being a simple "Yes/No" checker to a thinking partner that can listen, reason, and explain. While it still needs human doctors to supervise it (especially to catch its occasional mistakes), it promises to make the ancient art of listening to the body more accessible, accurate, and scalable for everyone.

1. Problem Statement

Cardiopulmonary auscultation (listening to heart and lung sounds) is a fundamental clinical skill, but it requires years of expertise to interpret subtle acoustic cues. While deep learning has automated sound analysis, current approaches are limited by a classification paradigm:

Narrow Scope: Models are typically trained for single-label tasks (e.g., "murmur vs. no murmur") and cannot perform complex clinical reasoning.
Lack of Interpretability: They output fixed labels rather than free-text explanations, reports, or differential diagnoses.
Domain Gap: General-purpose audio-language models (trained on speech or music) fail to capture the fine-grained, millisecond-scale temporal and spectral patterns critical for medical diagnosis (e.g., distinguishing fine vs. coarse crackles).

The paper addresses the need for an AI system that can follow natural language instructions to perform the full spectrum of auscultation tasks, from detection to differential diagnosis, with clinical interpretability.

2. Methodology

A. StethoBench: A New Benchmark

The authors introduce StethoBench, a comprehensive multimodal benchmark designed to train and evaluate models on diverse clinical reasoning tasks.

Data Sources: Aggregates 16,125 labeled cardiopulmonary recordings from 7 datasets (including ICBHI, KAUH, COUGHVID, CirCor, etc.).
Instruction-Response Pairs: Synthesizes 77,027 instruction-response pairs using Large Language Models (LLMs like GPT-4o) based on ground-truth metadata.
Task Categories: The benchmark covers seven distinct clinical tasks:
1. Binary Classification (Normal vs. Abnormal)
2. Detection (Identifying specific events like wheezes)
3. Clinical Reporting (Generating structured summaries)
4. Diagnostic Reasoning (Explaining pathophysiology)
5. Differential Diagnosis (Ranking potential conditions)
6. Comparative Analysis (Longitudinal comparison of recordings)
7. Location-Based Analysis (Anatomical localization)
Evaluation Strategy: Includes In-Distribution (ID) datasets for training/testing and Out-of-Distribution (OOD) datasets (different devices, populations, and protocols) to test generalization.

B. StethoLM Architecture

StethoLM is an audio-language model specifically designed for cardiopulmonary analysis, consisting of three core modules:

Audio Encoder ( $E_A$ ): An EfficientNet pretrained on medical sounds that converts 16kHz waveforms into mel-spectrograms and extracts feature vectors.
Projection Network ( $M_P$ ): A multi-layer perceptron (MLP) that maps audio features into the language model's embedding space, converting them into a sequence of 4 prefix tokens.
Language Model Backbone ( $G_{LLM}$ ): Uses MedGemma-4B-IT, a medical-specialized LLM, chosen for its strong domain knowledge. The model employs LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning of the backbone.

C. Training Strategy

The model undergoes a two-stage training process:

Supervised Fine-Tuning (SFT): The model is trained on the 77k instruction-response pairs to learn the conditional probability $p(y|x, q)$ , where $x$ is audio, $q$ is the instruction, and $y$ is the response.
Direct Preference Optimization (DPO): The authors explored mDPO (multimodal DPO) to refine response quality by ranking generated responses based on similarity to ground truth and audio clarity. However, experiments showed that mDPO did not improve final performance, likely because SFT already provided strong alignment and BERTScore is an imperfect proxy for diagnostic correctness.

3. Key Contributions

StethoLM: The first audio-language model specialized for cardiopulmonary auscultation capable of performing seven distinct, instruction-driven clinical reasoning tasks.
StethoBench: A novel benchmark comprising 77,027 instruction-response pairs derived from 16,125 recordings, moving beyond simple classification to support generalist auscultation models.
Comprehensive Evaluation: Extensive experiments demonstrating that domain-specialized training is critical. StethoLM establishes a new state-of-the-art, significantly outperforming general-purpose multimodal models (e.g., Qwen2.5-Omni, Gemini-2.5-Flash) and general audio-language models.

4. Results

In-Domain Performance

On the StethoBench test set, StethoLM achieved:

BERTScore: 71.8% (vs. 56.5% for the best baseline, Qwen2.5-Omni).
Clinical Accuracy: 47.8% (vs. 21.2% for Qwen2.5-Omni).
Task Performance: It excelled in binary classification (66.4% accuracy) and detection (47.9% accuracy). Performance dropped on subjective tasks like differential diagnosis (30.6% accuracy), reflecting the inherent difficulty and subjectivity of such clinical tasks.

Out-of-Distribution (OOD) Generalization

StethoLM maintained superiority on three of four OOD datasets (TR, CinC, BMD-HS), showing robustness to different recording devices and patient populations.
Limitation: On the FluSense dataset (crowdsourced, spontaneous respiratory events like sneezing), StethoLM underperformed compared to Qwen2.5-Omni. This suggests that domain-specialized training can struggle with "out-of-vocabulary" everyday sounds not present in structured clinical training data.

Zero-Shot Classification

Despite being trained only for generative instruction-following, StethoLM demonstrated competitive zero-shot classification capabilities (via report generation + text similarity), outperforming specialized classification models like AudioMAE and CLAP on specific tasks like COPD detection and exhalation-based COVID detection.

Ablation Studies

Audio Grounding: Removing audio input caused a massive drop in accuracy (47.8% $\to$ 28.5%), proving acoustic information is irreplaceable.
Safety Concern: The multimodal model lost its ability to "refuse" answering when audio was missing or mismatched, generating plausible but incorrect diagnoses. This highlights a safety risk in clinical AI where sensor failures could lead to hallucinations.

5. Significance and Future Directions

Paradigm Shift: The work shifts cardiopulmonary AI from narrow classification to instruction-following clinical reasoning, enabling models to generate reports, explain findings, and compare patient histories.
Domain Specialization: It proves that massive scale alone (general multimodal models) cannot replace domain-specific training for medical audio; specialized features (temporal patterns, spectral nuances) require targeted learning.
Clinical Utility: StethoLM is positioned as a decision support tool rather than an autonomous diagnostician. Its ability to provide interpretable reasoning makes it suitable for augmenting clinician judgment, particularly in resource-limited settings.
Future Work: The authors suggest integrating patient demographics/history, enabling interactive refinement, quantifying uncertainty, and conducting prospective clinical validation to ensure safety and workflow integration.

In conclusion, StethoLM represents a significant step toward trustworthy, generalist AI for clinical auscultation, bridging the gap between raw audio data and actionable clinical insights.