Partial Weakly-Supervised Oriented Object Detection

Imagine you are trying to teach a robot to spot airplanes, ships, and cars in a massive collection of satellite photos. The problem? The robot needs to know not just where these objects are, but exactly which way they are facing (their orientation) and how big they are (their scale).

In the world of AI, teaching a robot usually requires a human teacher to draw a perfect, rotated box around every single object in every single photo. This is like hiring a team of artists to meticulously outline every bird in a flock. It's incredibly accurate, but it's also slow, expensive, and exhausting.

This paper introduces a new, smarter way to teach the robot called PWOOD (Partial Weakly-Supervised Oriented Object Detection). Here is how it works, explained through simple analogies:

1. The Problem: The "Perfect Box" is Too Expensive

Traditionally, to train the robot, humans had to draw Rotated Boxes (OBBs)—boxes that tilt to match the object.

The Cost: Imagine paying a worker $86 to label 1,000 images with these perfect tilted boxes.
The Alternative: Some researchers tried using Horizontal Boxes (just a square around the object, ignoring the tilt) or even just Single Points (a dot in the center). This is cheaper ($17 or even free!), but the robot often gets confused about the angle or size because the teacher didn't give enough detail.

2. The Solution: The "Apprentice and the Master" (Teacher-Student)

The authors created a system where a Master Teacher helps train an Apprentice Student.

The Setup: They give the Apprentice a tiny amount of cheap, imperfect data (like horizontal boxes or dots) to start with.
The Magic: Once the Apprentice learns the basics from this small, cheap dataset, it becomes the "Master." The Master then looks at thousands of unlabeled photos (where no one drew anything) and guesses where the objects are. These guesses are called Pseudo-Labels.
The Loop: The Apprentice then studies these guesses to get even better, and in turn, helps the Master get better. It's a self-improving cycle.

3. The Secret Sauce: Three Smart Tricks

To make this work without the expensive "perfect boxes," the authors added three special tools:

A. The "Mirror and Spin" Trick (Orientation Learning)

Since the cheap data (horizontal boxes) doesn't tell the robot the angle, the robot needs to figure it out itself.

Analogy: Imagine you are learning to recognize a car. If you see a picture of a car, and then you flip the picture upside down or rotate it, you still know it's a car. The robot does the same thing. It takes an image, flips or rotates it, and forces itself to predict that the car's angle changes in the exact same way. By playing this "mirror game," the robot learns to guess the correct angle even without being told.

B. The "Size Guessing" Trick (Scale Learning)

Sometimes the data is just a single dot. The robot has no idea if the object is a tiny toy car or a giant truck.

Analogy: Imagine you are in a dark room and you feel a single point on a table. To guess the size of the object, you look at how close other objects are. The robot uses a mathematical "fence" (Voronoi diagram) to see how much space is around that dot. If the dot is surrounded by a huge empty space, it guesses the object is small. If it's crowded, it guesses the object is big. This helps the robot learn size without a box.

C. The "Smart Filter" (Class-Agnostic Pseudo-Label Filtering)

This is the most crucial part. When the Master Teacher guesses labels for unlabeled photos, it sometimes makes mistakes. If the robot learns from bad guesses, it gets confused.

The Old Way: Previous methods used a static rule, like "Only trust guesses that are 80% sure." This is like a strict bouncer who lets everyone in who looks 80% like a VIP, regardless of the party's mood. Sometimes the bouncer is too strict; sometimes too loose.
The New Way (CPF): The authors built a dynamic bouncer that uses a "Gaussian Mixture Model" (a fancy way of saying it looks at the shape of the confidence scores).
- Analogy: Instead of a fixed rule, this bouncer looks at the crowd. If the party is quiet and everyone is unsure, it lowers the bar. If the party is loud and confident, it raises the bar. It constantly adjusts the "trust level" based on how well the teacher is doing right now. This prevents the robot from learning from bad guesses.

4. The Result: High Quality, Low Cost

The team tested this on massive datasets of satellite images (DOTA and DIOR).

The Outcome: Their system, using only cheap, partial data (like horizontal boxes or dots) plus unlabeled photos, performed just as well as, or even better than, systems trained with expensive, perfect rotated boxes.
The Savings: They achieved the same level of intelligence for a fraction of the price. It's like getting a Ferrari engine but only paying for a bicycle frame.

Summary

PWOOD is a clever way to teach AI to spot angled objects. Instead of paying humans to draw perfect, tilted boxes for every image, it uses a "Master-Apprentice" system that learns from a few cheap hints and millions of unlabeled photos. It uses mirror tricks to learn angles, space-fencing to learn sizes, and a smart, adjusting filter to ignore bad guesses. The result is a super-smart detector that saves time and money without losing accuracy.

1. Problem Statement

Oriented Object Detection (OOD) is critical for applications involving aerial and remote sensing imagery, where objects often appear at arbitrary angles. However, the dominant Fully Supervised approach requires expensive, labor-intensive Rotated Bounding Box (RBox) annotations.

Current Limitations:
- Semi-Supervised OOD (SOOD): Reduces labeled data requirements but still relies on a subset of expensive RBox annotations.
- Weakly Supervised OOD (WOOD): Uses cheaper annotations (Horizontal Boxes or Points) but often suffers from performance degradation because models struggle to learn precise orientation and scale information without explicit RBox supervision.
- The Gap: Existing methods either incur high annotation costs (RBox) or sacrifice significant accuracy when using weak annotations. There is a lack of a framework that effectively combines partial weak annotations (e.g., only 20% of data has weak labels) with unlabeled data to achieve high performance at low cost.

2. Methodology: The PWOOD Framework

The authors propose PWOOD, a framework based on a Teacher-Student paradigm that utilizes a small portion of weakly annotated data (Horizontal Boxes or Single Points) and a large amount of unlabeled data.

A. Core Architecture

Teacher-Student Paradigm: A teacher model generates pseudo-labels for unlabeled data, which are used to train the student model. The student updates the teacher via Exponential Moving Average (EMA).
OS-Student (Orientation-and-Scale-aware Student): The student model is specifically designed to overcome the lack of explicit orientation/scale data in weak annotations through two specialized modules:
1. Symmetry-aware Orientation Learning:
  - Uses data augmentation (vertical flipping and random rotation) to create transformed views.
  - Enforces a deterministic mapping between predictions on original and transformed images.
  - Introduces an Angle Loss ( $L_{Ang}$ ) to ensure the model learns consistent orientation information even when trained on horizontal boxes.
2. Self-supervised Scale Learning:
  - Designed to handle even weaker annotations (e.g., single points) that lack scale.
  - Upper Bound: Models bounding boxes as Gaussian distributions and minimizes the Bhattacharyya coefficient (Gaussian overlap) between predicted boxes to prevent overlap and infer scale limits.
  - Lower Bound: Uses Voronoi diagrams and Watershed algorithms to segment foreground (points) from background, calculating a Voronoi Watershed Loss to estimate object width and height.
  - Combines these with standard classification, centerness, and box regression losses.

B. Class-Agnostic Pseudo-Label Filtering (CPF)

A major bottleneck in semi-supervised learning is the reliance on static thresholds for filtering pseudo-labels, which causes instability across different training stages and datasets.

Solution: The authors propose CPF, which models the distribution of the teacher's confidence scores using a Gaussian Mixture Model (GMM) consisting of positive and negative sample distributions.
Mechanism: It employs the Expectation-Maximization (EM) algorithm to dynamically calculate the optimal threshold ( $T_d$ ) that maximizes the likelihood of a detection being a true positive. This makes the filtering adaptive to the data distribution and training progress, reducing sensitivity to manual threshold tuning.

C. Training Strategy

Pre-training: Train the OS-Student on the small subset of weakly annotated data.
Burn-in: Mirror weights to the Teacher.
Main Training: Feed unlabeled data (with weak augmentations) to the Teacher to generate pseudo-labels. Apply CPF to filter these labels. Train the Student on both the weakly labeled data and the filtered pseudo-labels.

3. Key Contributions

First PWOOD Framework: The first work to address Oriented Object Detection using partial weak annotations (a mix of weak labels and unlabeled data), significantly lowering annotation costs while maintaining high performance.
OS-Student Model: A novel student architecture capable of learning precise orientation and scale information solely from weak annotations (Horizontal Boxes or Points) via symmetry and spatial layout learning.
Class-Agnostic Pseudo-Label Filtering (CPF): A dynamic filtering mechanism based on GMM and EM that eliminates the need for static thresholds, improving model robustness and generalization.
Versatility: The framework is generalized to support various annotation forms (HBox, Point) and can even handle joint training with mixed annotation types.

4. Experimental Results

The framework was evaluated on DOTA-v1.0/v1.5/v2.0 and DIOR datasets.

Performance vs. Semi-Supervised (SOOD):
- On DOTA-v1.5, PWOOD (using 20% Partial HBox) achieved 59.36 mAP, outperforming the Vanilla SOOD baseline (trained with 20% RBox) which scored 58.28 mAP.
- On DIOR, PWOOD with 20% HBox achieved 57.89 mAP, comparable to the baseline using 20% RBox (57.07 mAP).
- Conclusion: PWOOD achieves comparable or superior performance to methods using expensive RBox annotations, but at a fraction of the annotation cost.
Performance vs. Weakly Supervised (WOOD):
- PWOOD significantly outperforms pure weakly supervised methods (like H2RBox-v2 and Point2RBox-v2).
- On DOTA-v1.5 with 20% HBox, PWOOD improved mAP by 10.35% over H2RBox-v2.
- On DOTA-v1.5 with 20% Points, PWOOD improved mAP by 5.51% over Point2RBox-v2.
Robustness:
- Noise Resistance: PWOOD showed significantly less performance degradation than WOOD methods when noise was added to annotations.
- Threshold Sensitivity: The CPF mechanism reduced the sensitivity to threshold selection, stabilizing performance where static thresholds caused large drops in mAP.

5. Significance

Cost-Efficiency: PWOOD offers a practical solution for real-world scenarios where obtaining RBox annotations is prohibitively expensive. It demonstrates that high-accuracy OOD is achievable using cheap annotations (HBox/Points) combined with unlabeled data.
Bridging the Gap: It effectively bridges the performance gap between fully supervised and weakly supervised learning, making oriented object detection more accessible for large-scale remote sensing applications.
Generalizability: The proposed CPF and OS-Student strategies are not limited to OOD and could potentially be adapted to other detection tasks requiring orientation or scale inference from weak labels.

In summary, PWOOD represents a significant advancement in efficient oriented object detection, proving that strategic use of unlabeled data and specialized learning modules can overcome the limitations of weak supervision without sacrificing accuracy.