MEMO: Human-like Crisp Edge Detection Using Masked Edge Prediction

Imagine you are trying to trace the outline of a cat in a photograph with a thick marker. If you are a computer using standard methods, it often gets a bit nervous and draws a thick, fuzzy line around the cat, like a fuzzy caterpillar. It doesn't know exactly where the edge stops and the background begins, so it just shades a wide area to be safe.

But humans? We draw crisp, single-pixel lines. We know exactly where the cat's ear ends and the air begins.

This paper introduces a new AI model called MEMO (Masked Edge Prediction MOdel) that learns to draw these perfect, human-like lines without needing complex new math or expensive hardware. Here is how it works, explained simply:

1. The Problem: The "Fuzzy Caterpillar"

Most AI edge detectors are trained to guess "Is this pixel an edge?" or "Is it background?" using a standard scoring system. The problem is that this system is too polite. It says, "Well, this pixel is probably an edge, and the one next to it is also probably an edge." So, it highlights a whole row of pixels, creating a thick, blurry line instead of a sharp one.

2. The Solution: The "Confidence Game"

The authors realized that when AI gets confused, it usually makes a mistake in a specific way: It is most confident in the middle of the thick line and less confident at the edges.

Think of it like a group of people trying to guess the location of a hidden treasure.

Old AI: Everyone shouts "It's here!" and "It's there!" and "It's over there!" creating a big, messy crowd.
MEMO's Strategy: MEMO plays a game of "Hot and Cold." It looks at the crowd and says, "Okay, the person in the very center is shouting the loudest. Let's lock in their spot as the true location. Everyone else, be quiet and wait."

3. How MEMO Works (The Three Magic Tricks)

Trick A: The "Blindfolded Practice" (Masked Training)

To teach MEMO how to be decisive, the researchers didn't show it the whole picture at once. Instead, they masked (hid) parts of the edge map during training.

Analogy: Imagine a teacher giving a student a puzzle but covering 50% of the pieces. The student has to guess what the missing pieces look like based on the ones they can see.
The Result: MEMO learns to say, "I see a curve here, so I know exactly where the line must go, even if I can't see the whole thing." This forces it to be precise rather than guessing broadly.

Trick B: The "Local King" (Confidence-Ordered Inference)

When MEMO actually draws the line, it doesn't just pick the "best" pixels globally. It uses a rule called LocMax (Local Maximum).

Analogy: Imagine a neighborhood election. In a normal election, you might pick the person with the most votes in the whole city. But in MEMO's election, a candidate only wins if they have the most votes in their immediate block (a 3x3 neighborhood).
Why this helps: This prevents "clumping." If you have a thick fuzzy line, the pixels in the middle are all high-confidence. If you pick them all at once, you get a thick line. But if you only pick the "king" of each tiny block, you end up with a single, thin, perfect line running through the center of the crowd.

Trick C: The "Synthetic Gym" (Pre-training)

Real-world photos with human-drawn edges are rare and expensive to get. To get enough practice, MEMO first trained on a massive synthetic dataset created by a computer program.

Analogy: Before playing in the big leagues (real photos), MEMO went to a gym where it practiced on perfect, computer-generated shapes. It learned the concept of a sharp edge perfectly. Then, when it moved to real photos, it just needed a tiny bit of fine-tuning (like a warm-up) to adapt to the messy real world.

4. The Bonus Feature: "Zoomable" Edges

One of the coolest things about MEMO is that you can control how detailed the drawing is just by turning a knob (a parameter called $s$ ).

Low setting: It draws only the most important, big outlines (like a sketch).
High setting: It draws every tiny detail, like the texture of a leaf or a hair strand.
Why it's special: Other models need to be retrained to do this. MEMO just changes its "mindset" at the moment of drawing, no extra training needed.

The Bottom Line

MEMO proves that you don't need to invent complicated new math to get perfect results. You just need to:

Practice on synthetic data to learn the rules.
Hide parts of the image during training to force the AI to think harder.
Draw slowly, locking in the most confident pixels first, and letting the neighbors settle down.

The result? An AI that draws edges as cleanly and precisely as a human artist, without any fuzzy caterpillars.

1. Problem Statement

Edge detection is a fundamental computer vision task, typically formulated as a binary classification problem optimized with cross-entropy loss. However, models trained with standard cross-entropy loss often produce thick, ambiguous edge regions rather than the crisp, single-pixel-wide contours provided by human annotators.

Existing solutions to achieve "crispness" usually involve:

Designing specialized loss functions (e.g., sparsity-promoting losses).
Modifying network architectures.
Using generative models (e.g., diffusion backbones).
Refined supervision labels.

Despite these efforts, many methods still fail to achieve high crispness scores on benchmarks like BSDS and Multicue, often falling below 50%. The authors argue that the root cause is not the model architecture or loss function, but rather the training and inference strategy. Specifically, they observe that thick predictions exhibit a confidence gradient (high confidence in the center, lower at boundaries), which standard inference fails to exploit.

2. Methodology: MEMO

The authors propose MEMO (Masked Edge Prediction MOdel), a framework that achieves human-like crisp edges using only standard cross-entropy loss, relying instead on a novel training and inference pipeline.

A. Core Architecture

MEMO consists of three components:

Image Encoder ( $F_I$ ): A pre-trained vision backbone (DINOv2) that remains frozen.
Masked Edge Encoder ( $F_E$ ): Encodes the current state of the partially revealed edge map.
Edge Decoder ( $D$ ): Predicts edge probabilities for masked regions.

The prediction is formulated as $p(E | I, E_r) = \text{Sigmoid}(D(F_I(I), F_E(I, E_r, r), r))$ , where $E_r$ is the masked edge input at ratio $r$ .

B. Masked Edge Training

To enable the model to handle partially revealed maps during inference, MEMO is trained using a masked edge training scheme:

Process: Ground-truth edge pixels are randomly hidden (masked) at varying ratios ( $r$ ) using Bernoulli masking.
Objective: The model learns to reconstruct the masked pixels given the image and the partially visible edge map.
Embedding: The masking ratio $r$ is embedded via sinusoidal positional encoding and injected into the encoder and decoder features to inform the model of the current visibility level.
Loss: Binary cross-entropy is applied only to the masked pixels.

C. Confidence-Ordered Inference (LocMax)

Instead of a single forward pass, MEMO uses a recursive, multi-step inference process:

Initialization: Start with a fully masked edge map.
Iteration: Predict probabilities for all masked pixels.
Selection (LocMax Strategy): Instead of selecting the top $K$ $K$ global confidence scores (which leads to thick clusters), MEMO selects pixels for finalization only if they are local maxima within their $3 \times 3$ $3 \times 3$ neighborhood.
- If a pixel has the highest confidence in its local neighborhood, it is finalized (unmasked).
- Otherwise, it remains masked for the next iteration.
Termination: The process repeats until convergence or a fixed number of steps (early stopping).

This strategy ensures that edges are finalized from the "center" (highest confidence) outward, naturally suppressing redundant activations in the neighborhood and resulting in thin, crisp contours.

D. Multi-Granularity Prediction

MEMO supports generating edges at different levels of detail (sparsity) without retraining:

Technique: Adaptation of Classifier-Free Guidance.
Mechanism: The model is trained both with image conditioning and without (using a zero tensor). During inference, predictions are extrapolated between these two states using a user-defined scale parameter $s$ .
Result: Increasing $s$ shifts the output from sparse, high-level contours to dense, fine-grained details.

E. Pre-training Strategy

To prevent overfitting on small real-world datasets and improve generalization:

Synthetic Dataset: A large-scale dataset (400k image-edge pairs) is generated using the Segment Anything Model (SAM) to extract object boundaries.
Two-Stage Training:
1. Pre-training: Train on the synthetic dataset.
2. Fine-tuning: Use lightweight LoRA adapters (1.2% additional parameters) to fine-tune on specific real-world datasets (BSDS, BIPED, Multicue) while keeping the main backbone frozen.

3. Key Contributions

MEMO Framework: A novel edge detection model that produces crisp, human-like edges using only cross-entropy loss, eliminating the need for specialized loss functions or complex architectures.
Masked Edge Training & Inference: Introduction of a training scheme that teaches the model to recover edges from partial inputs, coupled with a LocMax inference strategy that iteratively finalizes edges based on local confidence, ensuring single-pixel width.
Synthetic Pre-training: Construction of a large-scale synthetic edge dataset that significantly boosts generalization and prevents edge duplication artifacts.
Inference-Time Multi-Granularity: The ability to control edge density and detail purely at inference time via classifier-free guidance, without requiring paired annotations for different granularities.

4. Experimental Results

The authors evaluated MEMO on BSDS, BIPED, and Multicue datasets.

Crispness-Aware Evaluation (CEval): MEMO achieves State-of-the-Art (SOTA) results.
- On BSDS, MEMO achieves an ODS of 0.836 (vs. 0.749 for DiffEdge) and an Average Crispness (AC) of 0.663 (vs. 0.476 for DiffEdge).
- On BIPED, MEMO achieves an AC of 0.841, outperforming DiffEdge (0.849) in standard metrics but significantly better in crispness alignment.
Visual Similarity: MEMO achieves the lowest Fréchet Inception Distance (FID) and LPIPS scores, indicating its outputs are perceptually closest to human annotations.
Ablation Studies:
- LocMax vs. TopK/Random: LocMax is crucial; TopK produces thick edges, while Random produces fragmented edges.
- Pre-training: Pre-training on synthetic data prevents "edge duplication" (multi-line artifacts) seen in models trained only on real data.
- Inference Steps: 10 steps are sufficient for visually crisp results, balancing speed and quality.

5. Significance

Paradigm Shift: The paper demonstrates that training and inference strategies are more critical than architectural complexity or specialized losses for achieving human-like edge crispness.
Efficiency: MEMO avoids heavy post-processing (like Non-Maximum Suppression) and complex generative backbones, relying on a lightweight, iterative refinement process.
Flexibility: The multi-granularity capability allows a single model to serve diverse applications (from coarse object boundaries to fine texture details) simply by adjusting a scalar parameter at inference time.
Generalization: The use of synthetic pre-training with LoRA fine-tuning offers a scalable path for adapting edge detectors to new domains with minimal computational cost.

In conclusion, MEMO sets a new benchmark for edge detection by aligning machine predictions with human visual perception, achieving superior crispness and structural coherence without compromising standard detection accuracy.