Identifying Anomalous DESI Galaxy Spectra with a Variational Autoencoder

C. Nicolaou, R. P. Nathan, O. Lahav, A. Palmese, A. Saintonge, J. Aguilar, S. Ahlen, C. Allende Prieto, S. Bailey, S. BenZvi, D. Bianchi, A. Brodzeller, D. Brooks, T. Claybaugh, A. de la Macorra, J. Della Costa, Arjun Dey, P. Doel, J. E. Forero-Romero, E. Gaztañaga, S. Gontcho A Gontcho, G. Gutierrez, K. Honscheid, C. Howlett, M. Ishak, R. Kehoe, D. Kirkby, T. Kisner, A. Kremin, A. Lambert, M. Landriau, L. Le Guillou, A. Meisner, R. Miquel, J. Moustakas, S. Nadathur, F. Prada, I. Pérez-Ràfols, G. Rossi, E. Sanchez, M. Schubnell, M. Siudek, D. Sprayberry, G. Tarlé, B. A. Weaver, H. Zou

Published Thu, 12 Ma

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

Imagine you are a librarian in charge of a library that is growing so fast it's impossible to read every single book. This library is the Dark Energy Spectroscopic Instrument (DESI), and instead of books, it's collecting millions of "light fingerprints" (spectra) from stars, galaxies, and quasars across the universe.

The problem? With tens of millions of these fingerprints, some are messy, some are broken, and some might be completely new types of objects we've never seen before. If you tried to look at them all by eye, you'd go crazy. You need a smart assistant to help you find the weird ones.

This paper introduces that assistant: a Variational Autoencoder (VAE), which is a type of Artificial Intelligence (AI). Here's how it works, explained simply:

1. The "Compression" Trick

Imagine you have a massive, detailed painting of a galaxy. It has millions of tiny brushstrokes (data points).

The Old Way: You try to memorize every single brushstroke.
The VAE Way: The AI acts like a super-smart artist who looks at the painting and says, "I can describe this entire scene using just 10 numbers." It compresses the massive painting into a tiny, 10-digit "ID card" (called the latent space).
The Test: The AI then tries to rebuild the painting from just those 10 numbers. If it can rebuild a perfect copy, it understands the data. If it tries to rebuild a weird, broken painting and produces a mess, it knows something is wrong.

2. Finding the "Weirdos" (Anomalies)

The AI uses two main tricks to spot the oddballs in the crowd:

Trick A: The "Reconstruction Error" (The Copycat Test)
The AI tries to copy the spectrum. If the spectrum is normal, the copy looks great. If the spectrum has a weird glitch (like a broken camera sensor) or a strange physical feature (like a galaxy with an unusually bright flash), the AI struggles to copy it. The "messier" the copy, the more suspicious the object is.
- Analogy: Imagine a photocopier. If you put in a normal document, it comes out perfect. If you put in a document with a coffee stain or a torn edge, the copy looks terrible. The AI flags the ones that look terrible.
Trick B: The "Isolation" Test (The Party Test)
The AI organizes all the spectra into a giant, invisible map (the latent space). Normal galaxies cluster together in a big group, like people at a party who all like the same music.
- If a spectrum lands way out in the middle of an empty field, far away from the main group, the AI flags it as an outlier.
- Analogy: If you walk into a room full of people wearing t-shirts, and you see one person wearing a tuxedo, they stand out immediately. The AI spots the "tuxedo" spectra.

3. What Did They Find?

The AI found two main types of "weirdos":

The Broken Ones: These are spectra with errors, like bad camera calibration, cosmic rays hitting the sensor, or the wrong distance (redshift) assigned to them. Finding these helps the scientists fix their equipment and software.
The New Discoveries: These are spectra with unique physical features, like a galaxy with an incredibly bright burst of star formation or a star that looks nothing like the others. These could be the "unknown unknowns"—new physics waiting to be discovered.

4. The "Human-in-the-Loop" (Astronomaly)

The AI found too many weird things to check one by one. So, the scientists used a tool called Astronomaly.

How it works: Think of it as a smart filter. You tell the AI, "I'm only interested in finding new types of stars, not broken cameras." The AI learns from your feedback and re-ranks the list, putting the most interesting "new stars" at the top and hiding the "broken cameras" at the bottom.
Analogy: It's like a music streaming service. At first, it guesses what you like. But once you start skipping songs you hate and loving the ones you like, it gets better at curating a playlist just for you.

5. The "Secret Map" (Interpretability)

One of the coolest parts of this paper is that the AI didn't just find weird things; it organized the data in a way that makes sense to humans, even though it was never taught the names of the objects.

The AI naturally separated Stars, Galaxies, and Quasars into different neighborhoods on its map.
It even found "tracks" or paths. If you walk along a path in the AI's map, you can see a galaxy slowly changing from "old and red" to "young and blue," or a star changing from "cool" to "hot."
Analogy: It's like the AI built a map of the universe where the "latitude" represents how old a star is, and the "longitude" represents how hot it is, all without anyone telling it to do that.

The Bottom Line

This paper shows that by using a smart AI "compression" tool, astronomers can sift through millions of data points to find the needles in the haystack. It helps clean up bad data so the pipeline works better, and it highlights the most exciting, unusual objects for human scientists to study, potentially leading to new discoveries about how the universe works.

Here is a detailed technical summary of the paper "Identifying Anomalous DESI Galaxy Spectra with a Variational Autoencoder."

1. Problem Statement

The Dark Energy Spectroscopic Instrument (DESI) is generating an unprecedented volume of astronomical data, with approximately 40 million spectra collected (aiming for 35 million unique spectra). Traditional visual inspection is infeasible at this scale.

The Challenge: Astronomical datasets contain "anomalies" which fall into two distinct categories:
1. Instrumental/Artifactual: Errors due to calibration issues, bad pixels, incorrect redshift assignments, or fiber placement errors.
2. Astrophysical: Rare or novel objects (e.g., unusual AGNs, extreme star-forming galaxies, white dwarfs) that differ significantly from the norm.
The Goal: Develop an automated, unsupervised machine learning framework to detect these anomalies, curate them for human inspection, and interpret the underlying physical or instrumental causes without relying on pre-existing labels.

2. Methodology

The authors employ a Variational Autoencoder (VAE), a probabilistic generative model, to learn a low-dimensional representation of the spectral data.

Data Preprocessing

Dataset: ~208,000 spectra from the DESI Early Data Release (EDR) Bright Galaxy Survey (BGS).
Selection: Objects with redshifts $0 \le z \le 0.3$.
Transformation: Spectra are converted to the rest-frame to align physical features regardless of redshift.
Resampling: Resampled to 1,000 wavelength bins (4 Å resolution) to reduce computational cost while preserving key line separations.
Cleaning: Bad pixels (CCD defects, cosmic rays) are masked and infilled via iterative PCA. Spectra with median Signal-to-Noise (S/N) < 5 are discarded.
Normalization: Individual spectra are normalized to unit norm.

VAE Architecture & Training

Architecture:
- Encoder: Input (1000 nodes) $\to$ 4 hidden layers (800, 600, 500, 300) $\to$ Latent Space (10 dimensions).
- Decoder: Mirrors the encoder structure.
- Activation: ReLU for hidden layers; Linear for output.
- Regularization: Dropout (0.2) and KL Divergence to enforce a continuous, Gaussian latent space.
Loss Function: Evidence Lower Bound (ELBO), comprising:
1. Reconstruction Likelihood: Weighted Gaussian log-likelihood (using inverse variance as weights to handle heteroscedastic noise).
2. KL Divergence: Regularizes the latent distribution to match a standard normal prior.
Training: Adam optimizer, batch size 512, 50 epochs.

Anomaly Detection Strategies

The paper utilizes two complementary approaches to identify outliers:

Deviation-Based (Reconstruction Error): Spectra with high Weighted Mean Squared Error (MSE) between the original and reconstructed spectrum are flagged. This captures "off-manifold" anomalies (features the model cannot reproduce).
Proximity-Based (Latent Space Isolation): Spectra located in low-density regions of the 10D latent space are identified using the Local Outlier Factor (LOF) algorithm. This captures "on-manifold" anomalies (spectra that look normal but are statistically rare in the distribution).

Curation & Active Learning

To address the subjectivity of anomaly relevance (e.g., a data quality engineer cares about artifacts, while a physicist cares about novel objects), the authors integrate Astronomaly.

Astronomaly uses Active Learning to allow human experts to label a small subset of outliers.
A Random Forest regressor learns the user's relevance criteria and re-ranks the remaining dataset, prioritizing anomalies that match the user's specific scientific goals.

3. Key Contributions

Dimensionality Reduction: Demonstrated that a VAE can compress 7,800-dimensional spectral data (DESI native) down to 10 dimensions while retaining sufficient information to accurately reconstruct complex spectral features (continua, absorption, and broad emission lines).
Dual-Strategy Detection: Showed that combining reconstruction error (MSE) and latent space isolation (LOF) provides broader coverage than either method alone. The overlap between the top 1% of outliers from both methods is only ~10%, indicating they detect different types of anomalies.
Unsupervised Interpretability: Proved that the VAE latent space naturally separates object classes (Galaxies, Stars, Quasars) and sub-classes (e.g., M vs. K stars, Seyfert 1 vs. Seyfert 2) without any labeled training data.
Synthetic Anomaly Injection: Validated the model's sensitivity by injecting controlled synthetic anomalies (e.g., amplified lines, noise, continuum removal) and tracking their displacement in the latent space and reconstruction error.

4. Results

Reconstruction Accuracy

The VAE achieved a mean weighted MSE of 1.12 on the validation set.
Performance by Class: Galaxies (MSE 1.09) were reconstructed best due to dataset dominance. Stars (1.87) and Quasars (2.57) had higher errors due to scarcity in the training set, yet the model still captured their key features (e.g., broad emission lines in AGNs).

Anomaly Identification

Reconstruction Error (MSE) Outliers: Primarily identified spectra with extreme physical features (e.g., extreme H $\alpha$ $α$ emission) or significant instrumental errors (e.g., misclassified redshifts, bad calibration between camera channels).
- Example: A spectrum with a massive H $\alpha$ emission line (MSE 46.47) was flagged; visual inspection revealed it was a fiber placed on a bright region of a nearby galaxy.
- Example: A high-redshift quasar misclassified as a low-redshift galaxy was flagged due to poor reconstruction of UV features.
Latent Space (LOF) Outliers: Identified spectra with low S/N or subtle artifacts that did not cause massive reconstruction errors but were isolated in the latent space.
- Example: A galaxy with a large flux dip (bad sky subtraction) was the lowest NOF outlier.
Synthetic Tests:
- Amplifying emission lines moved spectra into underdense regions.
- Adding new, unseen emission lines caused high reconstruction error but only small latent shifts (highlighting the need for both metrics).
- Broadening lines moved spectra toward the Quasar cluster in latent space, demonstrating physical interpretability.

Latent Space Interpretation

Class Separation: The 10D latent space clearly separated Galaxies, Stars, and Quasars.
Spectral Tracks: By traversing specific paths in the latent space, the authors identified continuous "tracks" corresponding to physical changes:
- Galaxies: Tracks from "Blue" (young, star-forming) to "Red" (old, metal-rich) continua.
- AGNs: Tracks separating Broad-line (Seyfert 1) and Narrow-line (Seyfert 2) objects.
- Stars: Tracks separating M-type (cool) from K-type stars, and identifying White Dwarfs and hot B/A stars.

5. Significance and Future Work

Pipeline Improvement: The method effectively identifies systematic errors (redshift failures, calibration issues) that can be fed back to improve the DESI spectroscopic pipeline.
Discovery Potential: By curating the list of outliers, astronomers can efficiently search for "unknown unknowns" (novel astrophysical objects) without sifting through millions of normal spectra.
Scalability: The approach is designed to scale to the full DESI dataset (millions of spectra) and is applicable to other surveys (Euclid, Roman, PFS).
Future Directions: The authors plan to apply these methods to the full DESI dataset, explore modified loss functions, and compare VAEs with other dimensionality reduction techniques (e.g., UMAP, t-SNE) for visualization.

In conclusion, this work establishes a robust, unsupervised framework for anomaly detection in large-scale spectroscopic surveys, successfully balancing the detection of data quality issues with the discovery of rare astrophysical phenomena.