Markerless 6D Pose Estimation and Position-Based Visual Servoing for Endoscopic Continuum Manipulators

Imagine a surgeon trying to perform delicate surgery inside a patient's body using a long, flexible, snake-like robot arm. This arm, called a continuum manipulator, can twist and bend like an octopus tentacle to reach tricky spots.

The problem? These robots are tricky to control. They are made of soft materials and long cables, so they bend, stretch, and "remember" their previous shapes (a bit like a rubber band that doesn't snap back perfectly). Because of this, the computer controlling the robot often doesn't know exactly where the tip of the robot is.

Usually, to fix this, engineers stick physical stickers (markers) on the robot or put tiny sensors inside it. But in surgery, you can't always stick things on the robot, and adding sensors makes the robot bulky and expensive.

This paper presents a solution: Teaching the robot to "see" and "know" where it is using only its eyes (cameras), without any stickers or internal sensors.

Here is how they did it, explained with some everyday analogies:

1. The "Video Game" Training Ground (Simulation)

You can't teach a robot to drive by letting it crash real cars. You use a simulator.

The Analogy: The researchers built a hyper-realistic video game of the surgical robot.
The Magic: In this game, they can generate thousands of hours of training data instantly. They know exactly where the robot is in the game because they programmed it. They use this to teach an AI brain how to recognize the robot's shape in a camera image.
The Twist: They made the game look exactly like a real surgery (lighting, textures, background) so the AI doesn't get confused when it switches from the game to the real world.

2. The "Super-Detective" AI (Multi-Feature Fusion)

Most old methods tried to guess the robot's position by looking at just one thing, like "where are the dots?" or "what is the outline?"

The Analogy: Imagine trying to find a friend in a crowd.
- Old Way: You only look for their hat. If they take it off, you lose them.
- This Paper's Way: The AI acts like a super-detective. It looks at the outline (silhouette), the key joints (like elbows and knees), the bounding box (a rectangle around the robot), and heatmaps (a heat map showing where the robot is most likely to be).
The Result: By combining all these clues at once, the AI gets a much clearer picture of where the robot is, even if parts of it are hidden or the lighting is weird.

3. The "Instant Reality Check" (Feed-Forward Refinement)

Usually, when a computer guesses a position, it checks its work by drawing a picture of what it thinks it sees, comparing it to the real photo, and adjusting. It does this over and over until it's right.

The Analogy: This is like a student taking a test, checking their answer, erasing it, rewriting it, checking again, and repeating this 10 times before handing in the paper. It's accurate but slow.
This Paper's Way: They created a "magic shortcut." The AI guesses the position, then instantly "dreams" what the image should look like, compares the two, and calculates the correction in one single step.
The Result: It's like the student taking the test, instantly knowing exactly what to fix, and handing it in immediately. It's fast enough to control the robot in real-time.

4. The "Self-Correcting" Lesson (Sim-to-Real Adaptation)

Even with a perfect video game, the real world is messy. The camera might be slightly tilted, or the robot might be a different color.

The Analogy: Imagine practicing piano on a keyboard that has slightly different keys than the real one. You practice perfectly, but when you sit at the real piano, you hit the wrong notes.
The Solution: The researchers let the AI practice on a few real photos (about 150) without needing a human to tell it the right answers. The AI uses its "dreaming" ability to figure out the difference between the game and reality, and it teaches itself to adjust.
The Result: The robot goes from being "okay" to being "expert" just by looking at a few pictures of itself in the real world.

The Grand Finale: The Robot Controls Itself

Once the AI knows exactly where the robot tip is (within less than 1 millimeter of error!), they used it to perform Visual Servoing.

What is that? It's like a self-driving car that sees a target and steers itself to it without a human touching the wheel.
The Test: They made the robot trace a square path and touch specific points.
- Without this system (Open Loop): The robot was like a drunk sailor; it missed the target by over 13 millimeters.
- With this system: The robot was like a surgeon; it missed by only 2 millimeters.
- Comparison: This is almost as good as if they had stuck physical stickers on the robot, but without the stickers!

Why Does This Matter?

This is a huge step forward for minimally invasive surgery.

No Stickers Needed: Surgeons don't have to worry about attaching markers that could fall off or interfere with the surgery.
Cheaper & Safer: No need for expensive internal sensors.
Real-Time: The system is fast enough to control the robot while it's moving, making autonomous surgery a real possibility.

In short, the authors taught a flexible robot to look in a mirror, realize where it is, and steer itself to a target with incredible precision, all without any physical help.

1. Problem Statement

Continuum manipulators used in flexible endoscopic surgery offer high dexterity for minimally invasive procedures but suffer from significant control challenges:

Control Difficulties: They exhibit strong hysteresis, compliance, and non-linearities due to long tendon-sheath transmission paths, making open-loop control inaccurate.
Sensing Limitations: Traditional methods rely on embedded sensors (e.g., Fiber Bragg Grating, Electromagnetic tracking) which increase hardware complexity, cost, and susceptibility to interference.
Vision-Based Gaps: Existing vision-based approaches often rely on monocular imagery and single-modality cues (e.g., segmentation masks or keypoints alone), leading to insufficient geometric observability and depth ambiguity. Furthermore, many methods require iterative optimization (render-and-compare) for refinement, which incurs high computational overhead, preventing real-time closed-loop control.
Data Scarcity: Acquiring large-scale, pixel-accurate 6D pose ground truth for continuum manipulators in real surgical environments is inherently difficult and expensive.

2. Methodology

The authors propose a unified framework comprising four main components:

A. Physics-Grounded Synthetic Data Generation

To overcome data scarcity, the authors developed a high-fidelity simulation pipeline using NVIDIA Isaac Sim:

Modeling: The continuum manipulator is discretized into a Pseudo-Rigid-Body (PRB) model using a custom URDF with slit-type revolute joints to approximate continuous bending.
Annotation: The pipeline automatically generates synchronized stereo RGB images with pixel-accurate ground truth for 6D poses, segmentation masks, keypoints (65 points), and bounding boxes.
Domain Randomization: To bridge the sim-to-real gap, the system randomizes background textures, backbone colors, and jaw material properties (metallic reflectance).

B. Stereo-Aware Multi-Feature Fusion Network (MFFN)

Instead of regressing pose directly from raw images, the network jointly learns multiple geometric cues to enhance observability:

Architecture: A shared ResNet-50 encoder feeds into branches for:
1. Segmentation: Four-class semantic masks (background, jaw 1, jaw 2, hinge).
2. Keypoints: 65 geometric landmarks with visibility confidence.
3. Heatmaps & Bounding Boxes: Refined spatial extents.
Stereo Fusion: A Multi-Head Attention mechanism fuses features from left and right camera views, implicitly learning epipolar constraints without explicit triangulation.
Output: An initial 6D pose estimate for the tool center point (TCP).

C. Feed-Forward Rendering-Based Refinement

To enforce geometric consistency without the latency of iterative optimization:

Mechanism: The initial pose is rendered using a differentiable renderer to generate synthetic keypoints and masks.
Correction: A lightweight network compares the rendered features against the observed features (from the MFFN) in a single forward pass.
Result: It predicts residual pose corrections ( $\Delta t, \Delta q$ ) to refine the initial estimate, eliminating the need for slow, iterative gradient descent.

D. Self-Supervised Sim-to-Real Adaptation

To adapt the simulation-trained model to the real world without manual labeling:

Pseudo Ground Truth: The system uses the initial pose prediction to render geometric features, then iteratively minimizes the alignment loss between rendered and real features to generate a "pseudo" ground truth pose.
Fine-Tuning: Only the pose regression heads and refinement module are fine-tuned using 150 unlabeled real images and these pseudo labels, correcting extrinsic calibration biases while preserving the robust visual features learned from synthetic data.

E. Position-Based Visual Servoing (PBVS)

The estimated 6D pose is fed into a closed-loop controller:

Control Law: A Jacobian-based PBVS approach computes joint updates using the pseudoinverse of the geometric Jacobian.
Thresholding: Convergence thresholds are relaxed based on the known estimation error statistics to ensure stability despite pose uncertainty.

3. Key Contributions

First Fully Markerless PBVS Framework: The first system to achieve closed-loop position-based visual servoing for continuum manipulators using purely vision-based 6D pose estimation, eliminating the need for physical markers or embedded sensors.
Unified Multi-Feature Stereo Network: A novel architecture that fuses segmentation, keypoints, heatmaps, and bounding boxes with stereo attention, significantly improving geometric observability over monocular single-cue methods.
Efficient Feed-Forward Refinement: A rendering-based refinement module that achieves geometric consistency in a single forward pass (87 ms), reducing computational overhead by an order of magnitude compared to iterative render-and-compare methods.
Self-Supervised Adaptation: A strategy that reduces real-world pose error by ~50% using only 150 unlabeled images, effectively bridging the sim-to-real gap without manual annotation.

4. Experimental Results

Pose Estimation Accuracy

Synthetic Data: The full framework (Stereo + Refinement) achieved 0.14 mm translation error and 0.44° rotation error.
Real-World Validation (1,000 samples):
- Mean Translation Error: 0.83 mm (Standard Deviation: 0.70 mm).
- Mean Rotation Error: 2.76° (Standard Deviation: 1.74°).
- Improvement: This represents a 34.6% improvement in translation and 13.8% in rotation over the previous state-of-the-art (Zhou et al.), while reducing inference time from 849 ms to 180 ms.
- Adaptation Impact: Self-supervised adaptation reduced real-world errors by approximately 50% compared to the non-adapted model.

Closed-Loop Control Performance

Trajectory Tracking: In square-shaped trajectory tracking tasks, the markerless controller achieved a mean translation error of 2.07 mm and rotation error of 7.41°.
Comparison:
- vs. Open-Loop: 85% reduction in translation error and 59% reduction in rotation error.
- vs. Marker-Based Servoing: The markerless approach achieved performance very close to the marker-based baseline (1.06 mm / 2.00°), demonstrating high clinical viability.
Repeatability: In repeated point-reaching tasks, the system showed high repeatability with standard deviations of 0.17 mm (translation) and 0.49° (rotation).

5. Significance

This work addresses a critical bottleneck in surgical robotics: the lack of accurate, low-cost, and real-time state estimation for flexible continuum robots.

Clinical Relevance: The achieved accuracy (~2 mm) is sufficient for manipulating small endoscopic lesions (e.g., colorectal polyps), enabling precise minimally invasive procedures.
Scalability: By removing the dependency on specialized sensors and physical markers, the system simplifies hardware integration and reduces costs, making advanced robotic assistance more accessible.
Real-Time Feasibility: The shift from iterative optimization to feed-forward refinement makes high-frequency closed-loop control feasible, a prerequisite for safe and responsive surgical automation.

The paper concludes that combining physics-grounded simulation, multi-feature stereo fusion, and efficient geometric refinement enables a robust, markerless closed-loop control system for continuum manipulators, paving the way for more autonomous and dexterous surgical robots.