A Boundary-Metric Evaluation Protocol for Whiteboard Stroke Segmentation Under Extreme Imbalance

Imagine you are trying to digitize a whiteboard from a photo. You want a computer to draw a perfect outline around every marker stroke so you can copy it into a digital note-taking app. Sounds easy, right?

The problem is that whiteboard markers are tiny compared to the giant white space of the board. In fact, the ink only takes up about 2% of the picture. The rest is just empty white background.

This paper is like a detective story about how to teach a computer to find those tiny, fragile lines without getting distracted by the massive empty space. Here is the breakdown using simple analogies.

1. The Problem: The "Needle in a Haystack"

Imagine you are looking for a single grain of sand on a beach. If you ask a computer, "Is this pixel sand or water?" and it just guesses "Water" for every single pixel, it would be right 98% of the time!

The Trap: Standard computer training methods (called "Cross-Entropy") are like that lazy computer. They get a high score for being right about the background, but they completely miss the tiny strokes. It's like a student who gets an 'A' for knowing the alphabet but fails to read the actual words.
The Thin Strokes: Some marker lines are so thin they are barely visible. Standard methods often erase these completely because they are too small to "feel" during training.

2. The Solution: Changing the Rules of the Game

The researchers tried five different ways to "grade" the computer's performance (called Loss Functions). Think of these as different teachers with different grading styles:

The Old Teacher (Cross-Entropy): Counts every correct pixel equally. Since there are so many background pixels, the teacher ignores the ink.
The New Teachers (Dice, Tversky, Focal): These teachers care more about the ink than the background. They say, "I don't care if you got the background right; if you missed the ink, you fail."
- The Result: Switching to these "New Teachers" improved the computer's ability to find the ink by 20 points. It went from barely finding anything to actually seeing the lines.

3. The New Scorecard: Looking at the Edges

The paper argues that just measuring "how much ink did you find?" (F1 score) isn't enough. You also need to know "how smooth is the line?"

The Analogy: Imagine two students draw a circle.
- Student A draws a circle that is slightly too big but perfectly round.
- Student B draws a circle that is the right size but looks like a jagged, scribbled mess.
- Standard metrics might say they are equal because they both "covered the area."
- The Paper's New Metric (Boundary Metrics): This is like a judge with a magnifying glass looking only at the edge of the circle. It reveals that Student B's line is messy and inaccurate. The paper shows that the "New Teachers" (Dice/Tversky) not only found more ink but drew much smoother, cleaner lines.

4. The "Fairness" Test: Thick vs. Thin

The researchers split the test images into two groups:

The "Core" Group: Thick, easy-to-see marker lines.
The "Thin" Group: Faint, hair-thin lines.

They found that the old methods were unfair. They did okay on thick lines but completely failed on thin ones. The new methods (specifically Tversky Loss) were like a fair referee: they treated the thick and thin lines equally, ensuring the computer didn't ignore the delicate details.

5. The "Consistency vs. Accuracy" Trade-off

The researchers compared their smart computer model against old-school, non-AI math tricks (like Sauvola Thresholding).

The Old-School Math: On average, these math tricks actually got a higher score than the AI. They were great on easy, well-lit photos.
The Catch: The math tricks were unreliable. If the photo had a shadow or bad lighting, the math trick would crash and produce garbage results.
The AI Model: It had a slightly lower average score, but it was rock solid. It never failed catastrophically. Even in the worst lighting, it still found the lines.
The Lesson: If you are archiving perfect photos, use the old math. If you are building a real-time app where you can't afford for the system to crash on a bad photo, use the AI.

6. The Resolution Bottleneck

Finally, they found that the computer was being asked to see too much detail at once.

The Analogy: Trying to read a tiny font on a billboard from 100 feet away.
The Fix: When they zoomed in (increased the image resolution), the computer's performance jumped significantly. The thin lines became thick enough to be seen clearly.

Summary: What Did We Learn?

Don't use the old grading system: Standard methods ignore tiny details. Use "Overlap-based" methods (like Dice or Tversky) to force the computer to care about the ink.
Check the edges: Don't just measure how much ink you found; measure how clean the lines are.
Consistency wins: An AI that is "good enough" all the time is better than a math trick that is "perfect" sometimes and "broken" other times.
Zoom in: Higher resolution images make a huge difference for finding thin lines.

The paper provides a new "rulebook" for testing these systems, ensuring that future whiteboard apps won't just work on perfect photos, but will actually work for real people in real classrooms.

1. Problem Statement

The paper addresses the challenge of binary segmentation of whiteboard strokes in hybrid learning environments. The task involves extracting clean ink strokes from photographs of whiteboards to be imported into note-taking applications.

Core Challenges:

Extreme Class Imbalance: Stroke pixels constitute only 1.79% of the image on average (ranging from 0.52% to 4.94%). A specific subset of "thin-stroke" images averages just 1.14%.
Standard Metric Failure: Traditional region-based metrics (F1, IoU) are dominated by the background class. A model predicting "all background" achieves >98% pixel accuracy, masking the failure to detect thin strokes.
Thin-Structure Sensitivity: Standard losses (like Cross-Entropy) cause models to under-predict thin strokes because the overwhelming background gradient dominates the training signal. Downsampling during training further erodes fine details.
Boundary Quality: Standard metrics do not penalize ragged or dilated contours, which are critical for the fidelity of thin structures.

2. Methodology

Dataset and Preprocessing

Data: 34 real-world whiteboard images captured with smartphones (resolutions: $3712 \times 2784 $and$ 3968 \times 2232$).
Annotations: Manually created binary masks (stroke vs. background).
Augmentation: Offline augmentation (340 variants) using weak (brightness, contrast) and strong (glare, shadows) transforms. Online augmentation during training includes random flips, rotations, color jitter, and specific mask erosion to teach the model to preserve separation between adjacent thin strokes.
Splits: 12 images held out for testing (7 "Core" with thicker strokes, 5 "Thin" with very fine strokes). The remaining 22 images form the training/validation pool.

Architecture

Model: DeepLabV3 with a MobileNetV3-Large backbone.
Rationale: Chosen for its lightweight footprint (~11M parameters) suitable for real-time consumer GPU deployment and its Atrous Spatial Pyramid Pooling (ASPP) for multi-scale context.
Training: All layers fine-tuned; no frozen parameters.

Loss Functions Evaluated

Five loss functions were compared, each trained three times with different random seeds:

Cross-Entropy (CE): Standard pixel-wise loss (baseline).
Focal Loss: Down-weights easy examples.
Dice Loss: Optimizes overlap directly.
Dice + Focal: Weighted combination ($0.6 \times \text{Dice} + 0.4 \times \text{Focal}$).
Tversky Loss: Generalizes Dice with separate weights for False Positives ( $\alpha=0.3$ ) and False Negatives ( $\beta=0.7$ ), biasing toward recall.

Evaluation Protocol (The Core Contribution)

The authors propose a rigorous protocol that goes beyond mean scores:

Multi-Metric Analysis:
- Region Metrics: F1, IoU.
- Boundary Metrics: Boundary F1 (BF1) and Boundary IoU (B-IoU), which restrict evaluation to a narrow band around the contour.
Equity Analysis: Separating performance into "Core" (thick strokes) and "Thin" (fine strokes) subsets to measure the performance gap.
Robustness Statistics: Reporting median, Interquartile Range (IQR), and worst-case (minimum) F1 per image across seeds, rather than just mean $\pm$ std.
Statistical Significance: Non-parametric Wilcoxon signed-rank tests with Bonferroni correction on per-image scores.
Baseline Comparison: Comparison against classical non-learning methods (Adaptive Thresholding, Otsu, Sauvola) at native resolution.

3. Key Results

Loss Function Performance

Dominance of Overlap-Based Losses: Dice-family losses (Dice, Dice+Focal, Tversky) outperformed distribution-based losses (CE, Focal) by a massive margin.
- F1 Improvement: Tversky achieved 0.663 vs. CE's 0.438 (a >20 point gain, $p < 0.001$ ).
- Boundary Quality: Boundary metrics (BF1, B-IoU) confirmed that overlap-based losses also significantly improved contour precision.
Thin-Stroke Equity: CE and Focal showed a large performance gap between Core and Thin subsets ( $\approx 0.10$ ). Dice-family losses halved this gap to $\approx 0.06$ , with Tversky showing the most balanced performance.

Resolution Impact

Doubling the training resolution from $1024 \times 768 $to$ 1536 \times 1152 $increased F1 by **12.7 points** (0.530$ \to$ 0.657) and BF1 by 18.5 points. This highlights resolution as a critical bottleneck for thin structures.

Learned Models vs. Classical Baselines

The Consistency-Accuracy Trade-off:
- Classical (Sauvola): Achieved the highest mean F1 (0.787) but suffered from high variance and poor worst-case performance (Min F1 = 0.452). It failed catastrophically on low-contrast images.
- Learned (Tversky): Had a lower mean F1 (0.663) but a significantly higher worst-case F1 (0.565) and a much tighter IQR (0.066 vs. 0.081).
Conclusion: Classical methods are better for batch processing of high-quality images, while learned models are superior for real-time applications requiring consistent reliability across variable lighting conditions.

4. Key Contributions

Boundary-Aware Evaluation Protocol: A framework that jointly reports region metrics, boundary metrics, and thin-subset equity analysis to reveal hidden failures invisible to standard F1/IoU.
Robustness Profiling: Introduction of per-image robustness statistics (median, IQR, worst-case) and non-parametric significance testing to quantify practical stability, not just average accuracy.
Equity Analysis: A diagnostic measure showing how different losses treat fine vs. thick strokes, revealing that overlap-based losses provide more equitable performance.
Trade-off Discovery: Empirical evidence of a consistency–accuracy trade-off, where classical baselines offer higher peak accuracy but lower reliability, while learned models offer higher worst-case reliability.
Reproducibility: Full release of code, data, and evaluation scripts, with experiments run under strict deterministic settings (3 seeds).

5. Significance and Implications

For Document Analysis: The paper demonstrates that standard segmentation benchmarks are insufficient for tasks with extreme foreground sparsity (like whiteboards or medical vessels). Boundary metrics are essential.
For Deployment: For real-time whiteboard digitization, Tversky or Dice losses are recommended over Cross-Entropy, especially when stroke coverage is $<5\%$ .
Strategic Insight: While increasing resolution is the most effective way to boost absolute performance, the choice of loss function determines the model's ability to handle difficult, low-contrast, or thin-stroke scenarios consistently.
Future Work: The authors suggest exploring boundary-optimized losses (e.g., clDice, InverseForm) and training at native resolutions to close the gap with classical baselines.

In summary, this paper provides a critical re-evaluation of how to train and evaluate segmentation models for extreme imbalance, proving that overlap-based losses and boundary-aware metrics are necessary to achieve reliable, thin-structure segmentation.