LACON: Training Text-to-Image Model from Uncurated Data

Imagine you are trying to teach a robot artist how to paint.

The Old Way: The "Filter-First" Paradigm
Traditionally, when researchers built these robots, they acted like extremely picky art critics. They would gather a massive pile of photos from the internet—millions of them. But before showing them to the robot, they would aggressively throw away anything that looked "bad."

If a photo was blurry? Trash it.
If it had a watermark (like a logo)? Trash it.
If the colors were weird or the lighting was poor? Trash it.

They believed that only the "perfect" photos would teach the robot to be good. It's like trying to teach a chef to cook by only showing them Michelin-star meals and throwing away every burnt toast or slightly over-salted soup. The robot learns what "good" looks like, but it never learns what "bad" looks like. It has no idea how to avoid making mistakes because it's never seen a mistake.

The New Way: LACON (Labeling-and-Conditioning)
The authors of this paper, LACON, asked a simple question: "What if the 'bad' photos aren't actually trash? What if they are just... different?"

Instead of throwing away the blurry photos or the ones with watermarks, LACON says: "Let's keep them all, but let's label them."

Think of LACON as a smart librarian instead of a trash collector.

The Library: They take the entire messy library of 110 million images (the "uncurated" data).
The Labels: Instead of deleting a blurry photo, they put a tag on it that says, "This is a blurry photo." Instead of deleting a photo with a watermark, they tag it: "This has a watermark."
The Lesson: They teach the robot artist: "Here is a beautiful, sharp, watermark-free photo. Here is a blurry one. Here is one with a logo. Now, I want you to learn the difference between all of them."

The Superpower: The "Quality Dial"
Because the robot learned the entire spectrum of quality (from terrible to amazing), it gains a superpower the old robots don't have: Control.

Imagine the robot has a volume knob or a slider for quality.

If you want a photo that looks like a high-end magazine cover, you slide the dial to "High Quality." The robot knows exactly what that looks like because it studied the good photos.
If you want a photo that looks like a grainy, old security camera feed, you slide the dial to "Low Quality." The robot knows exactly how to make it look grainy and blurry because it studied those photos too!

The old robots (trained only on "good" data) get confused if you ask for a "bad" photo. They might try to make it look good anyway, or they might glitch out. But the LACON robot understands the full range of human visual experience.

Why is this a big deal?

Efficiency: The old way wasted over 50% of the data. LACON uses 100% of it. It's like using every ingredient in the fridge instead of throwing half away.
Better Results: Surprisingly, the robot trained on everything (with labels) actually makes better high-quality images than the robot trained only on the "perfect" subset. By understanding what "bad" looks like, it knows exactly how to avoid those mistakes when asked to make "good" art.
Knowledge: The "bad" photos often contain rare things (like weird animals or obscure objects) that get filtered out in the "perfect" datasets. By keeping them, the robot learns more about the world.

In a Nutshell
LACON is like teaching a student not just by showing them the "A+ essays," but by showing them the "A+ essays," the "C- essays," and the "failed drafts" all at once, and explaining why they are different. The result is a student who is smarter, more versatile, and can produce exactly what you ask for, whether it's a masterpiece or a rough sketch.

1. Problem Statement

Current state-of-the-art (SOTA) text-to-image (T2I) models rely on a "filter-first" paradigm. In this approach, massive raw datasets (e.g., LAION-5B) are aggressively pre-processed to discard "low-quality" data (images with watermarks, low aesthetic scores, poor clarity, or bad text-image alignment) before training.

The authors identify two critical flaws in this paradigm:

Data Waste & Statistical Inefficiency: The filter-first approach discards a significant portion of data (often >50%), throwing away a rich source of information regarding rare concepts and world knowledge contained in "bad" samples.
Knowledge Gap: By training exclusively on high-quality data, models fail to learn the explicit boundary between high and low quality. They develop a strong prior for "goodness" but lack an understanding of "badness," making them unable to distinguish or control quality effectively during inference. This forces reliance on heuristic inference-time tricks (like negative prompts) which are often suboptimal.

Core Question: Can we utilize the entire uncurated dataset, including "bad" data, to train a model that learns the full spectrum of data quality and achieves superior generation performance?

2. Methodology: LACON (Labeling-and-Conditioning)

The authors propose LACON, a training framework that reframes data curation from a destructive filtering process into a constructive conditioning process. Instead of discarding data based on quality signals, LACON repurposes these signals as explicit, quantitative condition labels.

Key Components:

Data Attribute Labeling:
Instead of filtering, the authors annotate every image in the raw dataset (110M images) with a 5-dimensional conditioning vector $s$ :
1. Aesthetic Score ( $s_{aes}$ ): Visual quality/aesthetics.
2. Watermark Probability ( $s_{wat}$ ): Likelihood of containing a watermark.
3. Clarity Score ( $s_{cla}$ ): Perceptual sharpness (Laplacian variance).
4. Entropy Score ( $s_{ent}$ ): Information density (Shannon entropy).
5. Luminance Score ( $s_{luma}$ ): Visual brightness.
Condition Embedding (Gaussian-weighted Cluster Centroids - GCC):
To feed continuous scalar scores into the model, LACON uses a novel embedding strategy:
- Fixed Anchors: A set of non-learnable scalar points partitioning the score range (e.g., 0.5, 1.5, ... 9.5 for aesthetics).
- Learnable Centroids: Each anchor has an associated learnable embedding vector.
- Soft Assignment: For a given score $s_k$ , the model computes affinity weights using a Gaussian Radial Basis Function (RBF) kernel relative to the anchors. The final embedding is a weighted sum of the centroid tokens. This allows for smooth, differentiable interpolation between quality levels.
Training Strategy:
- Objective: The model is trained to predict velocity $v_\theta(x_t, t, y, s)$ , where $y$ is the text prompt and $s$ is the quality attribute vector.
- Architecture: Implemented on a Diffusion Transformer (DiT) backbone (Sana). The attribute embeddings are concatenated with the noisy latents ( $x_t$ ) to provide context from the first layer.
- Data Usage: Trained on 100% of the uncurated raw data, unlike baselines that use filtered subsets.
Inference Strategies:
- LACON-S (Standard Guidance): Uses Classifier-Free Guidance (CFG) on the text prompt while holding the quality vector $s$ constant at a target value (e.g., high aesthetic score).
- LACON-A (Aggressive Multi-Condition Guidance): Applies CFG independently to the text prompt and each quality attribute. This allows dynamic, per-attribute control (e.g., maximizing aesthetic score while minimizing watermark probability) during generation.

3. Key Contributions

Paradigm Shift: Introduces LACON, which moves away from the "filter-first" standard to a "label-and-condition" approach, utilizing 100% of available raw data.
Superior Performance: Demonstrates that models trained on full uncurated data with explicit quality conditioning outperform models trained on heavily filtered high-quality subsets, even with the same compute budget.
Quantitative Controllability: Enables fine-grained, interpretable control over generation quality (e.g., explicitly requesting "low watermark" or "high clarity") without needing ad-hoc negative prompts or post-hoc adapters.
Knowledge Completeness: Proves that "bad" data contains valuable world knowledge and rare concepts that are lost when filtered, leading to better generalization on long-tail concepts.

4. Experimental Results

The authors evaluated LACON on Sana-0.6B/1.6B (Diffusion) and Qwen3-0.6B/1.7B (Autoregressive) architectures.

Quantitative Metrics:
- LACON consistently outperformed all baselines (including those trained on filtered subsets and those trained on raw data without conditioning) on GenEval, DPG, and FID.
- Example (Sana-1.6B at 512x512):
  - Baseline-B (Filtered ~65% data): GenEval 68.0, FID 13.1.
  - LACON-A (Full raw data + conditioning): GenEval 71.6, FID 11.2.
- LACON-A (Aggressive guidance) generally outperformed LACON-S.
Qualitative Analysis:
- Knowledge Gap: Baselines trained on filtered data failed to generate images for long-tail concepts (e.g., specific butterfly species, mineral terraces) present in the raw data but removed by filters. LACON successfully generated these.
- Artifact Suppression: Models trained on raw data without conditioning (Baseline-C) often produced artifacts like blank boundaries (low entropy) or watermarks. LACON, by conditioning on high entropy or low watermark scores, successfully suppressed these artifacts.
- Controllability: Visualizations showed smooth transitions in image quality (aesthetics, clarity, luminance) as the conditioning scores were varied.
Ablation Studies:
- The Gaussian-weighted Cluster Centroids (GCC) embedding strategy outperformed linear interpolation, discrete binning, and Fourier feature embeddings, proving the importance of smooth, overlap-aware encoding of quality scores.

5. Significance

The paper fundamentally challenges the assumption that "bad data" is detrimental to model performance. It demonstrates that:

Data Efficiency: Discarding data is statistically inefficient; the "noise" in uncurated data often contains valuable semantic information and quality boundaries.
Explicit Modeling: By explicitly modeling the distribution of quality (from bad to good) during training, the model internalizes the boundary between them, leading to better generation and controllability than inference-time heuristics.
Scalability: This approach is model-agnostic (works on Diffusion and Autoregressive models) and offers a scalable path to building stronger, more controllable foundation models without the massive computational cost of aggressive data filtering pipelines.

In conclusion, LACON proves that uncurated data is not a liability but a resource, provided it is treated with explicit conditioning rather than exclusion.