From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications

Imagine you are teaching a robot to play a game of "Pin the Tail on the Donkey," but instead of a donkey, it's an airplane, and instead of a tail, it has to find 23 specific spots on the plane (like the wingtips, the nose, and the landing gear) to figure out exactly where the plane is and how it's facing.

This is what Keypoint Detection does. It's the robot's way of saying, "I see the plane, and here are the exact coordinates of its important parts."

The Problem: The Robot is Easily Fooled

The problem is that modern robots (neural networks) are like nervous students. If you slightly dim the lights, add a little static noise, or if a person walks in front of the camera, the robot might get confused. It might think the wingtip is two inches to the left, or the nose is slightly higher.

In the real world (like self-driving cars or drones), being "a little bit wrong" can be dangerous. We need to know: Is this robot reliable enough to trust?

The Old Way: Checking One Dot at a Time

Previously, researchers tried to verify the robot's safety by checking each of the 23 dots individually.

Analogy: Imagine a teacher grading a test with 23 questions. The old method checks Question 1, then Question 2, then Question 3, completely ignoring how they relate to each other.
The Flaw: This approach is too strict and too pessimistic. It assumes that if any single dot moves even a tiny bit, the whole system has failed. In reality, if the nose moves slightly left, the wingtip might naturally move slightly left too. The robot is still doing a good job, but the old "checklist" method says, "Fail!" because it didn't account for the dots moving together.

The New Solution: The "Group Hug" Approach

This paper proposes a new way to check the robot: Coupled Verification. Instead of checking dots one by one, they check the entire group of dots as a team.

The Analogy: Think of a dance troupe. If you check if every dancer is standing perfectly still, you might fail them if they all take a small step to the left in unison. But if you check the formation, you see they are still dancing perfectly together.
The Innovation: The authors created a mathematical framework that understands that the 23 dots are connected. They ask: "Even if the image is blurry or someone walks by, do the 23 dots stay in a formation that is still good enough to calculate the plane's position?"

How They Did It: The "Impossible Puzzle" Trick

To prove the robot is safe, they turned the problem into a giant logic puzzle (a Mixed-Integer Linear Program, or MILP).

The Reachable Set (The "Fog of War"): First, they calculate every possible way the robot's internal "heat map" (a blurry picture showing where the dots might be) could look if the image is slightly changed. Imagine a foggy window where the dots could be anywhere within a certain area.
The Polytope (The "Safe Zone"): They draw a 3D shape (a polytope) that represents all the "safe" positions for the 23 dots working together.
The Test: They ask a super-computer: "Is there any possible scenario where the dots land outside the Safe Zone?"
- If the answer is "No" (Infeasible): The robot is certified as Robust. No matter how the image is tweaked, the dots stay in the safe zone.
- If the answer is "Yes" (Feasible): The computer finds a specific "trick" image that breaks the robot, showing us exactly where it fails.

The Results: Why It Matters

The researchers tested this on images of airplanes with people and vehicles walking in front of them (real-world chaos).

The Old Method: When the rules got strict (meaning the dots had to be very precise), the old method gave up immediately, saying "I can't prove it's safe" for almost every image. It was too scared to give a green light.
The New Method: It successfully proved that the robot was safe in 99% of the cases, even when the rules were strict. It realized that the dots were moving together, so the robot was still doing its job.

The Bottom Line

This paper is like upgrading a security guard's checklist.

Before: The guard checked if every single person in a crowd was standing perfectly still. If one person shifted, he sounded the alarm.
Now: The guard checks if the crowd is moving in a safe, organized way. Even if everyone shifts a little bit together, the guard knows the crowd is safe.

This allows us to trust AI vision systems more, especially in critical situations like flying drones or driving cars, where we need to know the system won't panic just because the lighting changed or a bird flew by.

Here is a detailed technical summary of the paper "From Decoupled to Coupled: Robustness Verification for Learning-based Keypoint Detection with Joint Specifications."

1. Problem Statement

Keypoint detection is a critical computer vision task underpinning applications like pose estimation, 3D reconstruction, and autonomous driving. While modern deep neural networks (DNNs) perform well, they are vulnerable to adversarial perturbations and distribution shifts.

The Core Challenge:
Existing formal robustness verification methods for keypoint detection are largely decoupled. They verify each keypoint independently, treating them as separate classification problems. This approach has two major limitations:

Conservatism: It ignores the interdependencies between keypoints. A small deviation in one keypoint might be acceptable if others compensate, but independent verification often flags this as a failure.
Task Misalignment: Downstream tasks (e.g., pose estimation via PnP algorithms) rely on the joint configuration of all keypoints, not just individual coordinates. Independent checks fail to capture the actual safety requirements of these tasks.

The paper addresses the need for a coupled robustness verification framework that bounds the joint deviation of all keypoints simultaneously, ensuring the model remains robust under perturbations that respect the collective geometric constraints of the task.

2. Methodology

The authors propose a framework that formulates robustness verification as a Mixed-Integer Linear Programming (MILP) falsification problem. The method combines reachability analysis with a polytope encoding of joint deviation constraints.

Key Components:

Architecture: The method targets heatmap-based keypoint detectors (e.g., backbone $\to$ heatmap generation $\to$ argmax for coordinate extraction).
Input Representation: Perturbations are modeled as a convex hull $\mathcal{X}$ formed by a seed image and $n$ perturbed images (covering local occlusions or global brightness/contrast changes).
Reachability Analysis: The backbone network's output (heatmaps) is over-approximated as a zonotope $\mathcal{Z}$ . This provides a compact, convex representation of all possible heatmap outputs for the given input set.
Coupled Specification: Instead of checking individual errors, the method defines a polytope $\delta\mathcal{V} = \{ \delta v \mid P_v \delta v \leq b_v \}$ that represents the allowable joint deviation vector for all $K$ keypoints. This captures task-level requirements (e.g., pose error thresholds).

The Verification Process (Falsification):

The goal is to prove that no reachable heatmap yields a keypoint configuration outside the allowable polytope. The authors formulate this as an MILP to find a counterexample:

Reachable Set Constraints: Encode the zonotope $\mathcal{Z}$ using linear constraints involving generator coefficients $\alpha_k \in [-1, 1]$ .
Dynamic Indexing: Since the keypoint location is determined by the argmax of the heatmap, the method must dynamically select pixel values based on the perturbed coordinates ( $v^* + \delta v$ ). This is handled using binary indicator variables and Big-M constraints to extract the correct pixel values from the heatmap matrix.
Maximality Check: The MILP checks if the extracted pixel values at the perturbed locations are indeed the maximum values in their respective channels (validating the argmax operation).
Violation Encoding: The method encodes the condition that the resulting deviation $\delta v$ $δ v$ lies outside the allowable polytope $\delta\mathcal{V}$ $δ V$ .
- Infeasibility: If the MILP has no solution, it proves that no reachable heatmap produces an out-of-bounds joint deviation. The model is certified as robust.
- Feasibility: If a solution exists, a counterexample (a specific perturbation causing failure) is identified.

Optimization:

To improve computational efficiency, the authors introduce a pruning strategy. They eliminate "in-bound" indices (pixel locations) that cannot possibly be the maximum compared to other candidates, significantly reducing the number of binary variables and constraints in the MILP.

3. Key Contributions

First Coupled Verification Framework: The paper introduces the first formal verification method for keypoint detectors that verifies joint behavior rather than independent keypoints, directly addressing the limitations of prior decoupled approaches.
MILP Formulation for Continuous Outputs: It successfully bridges the gap between continuous coordinate outputs and discrete verification by modeling the argmax operation and joint constraints within a single MILP.
Soundness Guarantee: The authors prove the method is sound: if the MILP is infeasible, the model is mathematically guaranteed to be robust against the specified perturbations.
Handling Semantic Perturbations: The framework supports realistic semantic perturbations (e.g., local occlusions by people/vehicles) modeled via convex hulls, rather than just pixel-level noise.

4. Experimental Results

The method was evaluated on a keypoint-based airplane pose estimation task (7,320 images, 23 keypoints) under various perturbations:

Local Occlusions: Random patches of objects (overlapping and non-overlapping with the airplane).
Global Perturbations: Brightness and contrast variations.
Baselines: Compared against a state-of-the-art decoupled verification method (Luo et al., 2025).

Key Findings:

Superior Verified Rates: The coupled approach achieved significantly higher verified rates than the decoupled baseline.
- Under strict error thresholds ( $\alpha = 0.1$ ), the baseline failed to verify 0% of images, whereas the proposed method maintained verified rates between 5% and 10% (and up to 70%+ for looser thresholds).
- For non-overlapping occlusions, the proposed method achieved verified rates close to the empirical testing upper bound (e.g., ~99% vs. 100% empirical), while the baseline dropped to ~47%.
Robustness to Perturbation Types: The method demonstrated consistent performance across local occlusions and global brightness/contrast changes.
Scalability & Efficiency:
- The pruning strategy reduced the MILP size by orders of magnitude (up to $10^3$ reduction for non-overlapping cases).
- Verification time was generally competitive, though it increased with the complexity of overlapping perturbations due to larger reachable sets.
- Interestingly, for the proposed method, verification time decreased as error thresholds became tighter (smaller $\alpha$ ) in non-overlapping cases, because the problem became easier to prove infeasible (finding a counterexample was harder).

5. Significance and Conclusion

This work represents a significant leap in the formal verification of vision-based perception systems. By shifting from decoupled to coupled verification, the authors provide a more realistic and less conservative assessment of safety for critical applications like autonomous driving and robotics.

Practical Impact: The ability to certify robustness under strict error thresholds where previous methods fail is crucial for safety-critical deployment.
Limitations: A gap remains between the verified rate and the empirical testing rate, primarily due to the conservativeness of the zonotope over-approximation of the reachable set.
Future Directions: The authors plan to develop tighter reachability approximations and scalable strategies to further close this gap and handle larger, more complex networks.

In summary, this paper establishes a new standard for verifying learning-based keypoint detectors by rigorously accounting for the geometric interdependencies of keypoints, offering a powerful tool for ensuring the safety of vision systems in dynamic environments.