SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients

Imagine you are trying to teach a robot how to walk, or a computer how to sort a list of names. To do this, the computer uses a powerful tool called Automatic Differentiation. Think of this tool as a "learning guide" that tells the computer, "If you move your foot a tiny bit to the left, you get closer to the goal. If you move it right, you get farther away." This guide relies on gradients (mathematical slopes) to know which direction to push.

However, the real world is full of "hard" decisions that break this guide.

The Problem: Imagine a light switch. It's either ON or OFF. There is no "halfway." If you try to nudge the switch slightly, nothing happens until you hit the exact moment it flips. In math terms, the "slope" is zero. The learning guide gets confused and says, "I can't tell you which way to go because the slope is flat."
The Consequence: Many useful computer operations—like sorting a list, picking the top 3 items, or making a true/false decision—are like these light switches. They are "hard" and "discrete." When a computer tries to learn through them, the learning guide stops working because the gradients (the instructions) disappear.

The Solution: SoftJAX and SoftTorch

The authors of this paper created two new toolkits called SoftJAX and SoftTorch. Their goal was to replace those "hard" light switches with "soft" dimmer switches.

Instead of a switch that is strictly ON or OFF, a dimmer switch allows you to be 90% ON or 10% ON. This creates a smooth slope. Now, the learning guide can see the direction and say, "Okay, if you turn the knob just a tiny bit more, you'll get closer to the goal!"

Here is how they did it, using some creative analogies:

1. The "Soft" Surrogate (The Dimmer Switch)

The paper introduces "soft" versions of hard functions.

Hard: Sign(x) says "Positive" or "Negative."
Soft: SoftSign(x) says "Mostly Positive" or "Slightly Negative."
Analogy: Imagine you are judging a race. A hard judge says, "Runner A won." A soft judge says, "Runner A is 95% likely to have won, but there's a 5% chance Runner B was faster." This 5% uncertainty gives the learning algorithm a tiny bit of information to work with, rather than a dead end.

2. The "Straight-Through" Trick (The Ghost Guide)

Sometimes, you need the computer to make a hard decision in the real world (like a robot actually turning a physical switch ON), but you still want the learning guide to work.

The Trick: The authors use a clever magic trick called Straight-Through Estimation.
The Analogy: Imagine you are driving a car with a strict rule: "You must stay in the lane."
- Forward Pass (Driving): You drive exactly in the lane (the hard decision).
- Backward Pass (Learning): When you look in the rearview mirror to see how to improve, you pretend the lane lines are actually soft, fuzzy clouds that you can drift through slightly.
- Result: The car stays safe and follows the rules, but the driver learns how to steer better because they imagined a smoother path.

3. Sorting and Ranking (The Traffic Jam vs. The Flow)

Sorting a list of numbers is a classic "hard" problem. If you have 100 cars and need to sort them by speed, a computer usually picks the fastest, then the second fastest, and so on. If two cars are tied, the computer gets stuck or the gradient vanishes.

The paper offers several ways to "soften" this:

Optimal Transport (The Moving Truck): Imagine you have a pile of sand (your unsorted numbers) and a set of holes (the sorted positions). Instead of picking one grain of sand for one hole, you imagine the sand flowing like water into the holes. You pay a "cost" to move the sand. This creates a smooth flow where every grain of sand contributes a little bit to every hole, making the math smooth and learnable.
Sorting Networks (The Assembly Line): Imagine a factory line where pairs of items swap places if they are in the wrong order. The authors replaced the "hard swap" (if A > B, swap) with a "soft swap" (if A > B, move A 90% of the way to the other side). This turns a rigid assembly line into a fluid conveyor belt.

Why Does This Matter?

Before this paper, if a researcher wanted to use these "soft" tricks, they had to hunt for different code snippets scattered across the internet. Some were in one project, some in another, and they didn't always work well together.

SoftJAX and SoftTorch are like a universal toolbox.

They work with the two most popular AI frameworks (JAX and PyTorch).
They provide a "drop-in" replacement. You don't have to rewrite your whole program; you just swap hard_sort for soft_sort.
They offer different "modes" of softness. Sometimes you want a very smooth, fuzzy guess (high softness). Sometimes you want a sharp decision that is almost hard (low softness). The user can dial this up or down like a volume knob.

The Real-World Test

The authors tested this on a robot collision detection system.

The Hard Way: The robot checks if two objects are touching. If they are, it picks specific points to calculate the bounce. If the objects move slightly, the "touching" points jump wildly, and the robot's learning algorithm crashes because the math breaks.
The Soft Way: Using SoftJAX, the robot calculates a "probability" of touching. The points it picks move smoothly as the objects move. The robot can now learn to avoid collisions much faster and more efficiently because the "learning guide" never gets lost.

Summary

In short, SoftJAX and SoftTorch take the rigid, broken logic of "hard" computer decisions and turn them into smooth, flowing, learnable processes. They allow AI and scientific simulations to learn from problems that were previously impossible to solve with gradient-based methods, acting as a bridge between the rigid digital world and the smooth, continuous world of learning.

Here is a detailed technical summary of the paper "SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients."

1. Problem Statement

Automatic Differentiation (AD) frameworks like JAX and PyTorch have revolutionized scientific computing and machine learning. However, many critical operations in these libraries are "hard" primitives (e.g., thresholding, Boolean logic, discrete indexing, sorting, ranking, and top-k selection).

The Issue: These operations are either non-differentiable or yield zero gradients (or arbitrary subgradients) in large regions of their domain.
The Consequence: This prevents gradient-based optimization in applications requiring discrete decisions, such as differentiable rendering, combinatorial optimization, structured prediction, and physics simulation.
Current Limitations: While various "soft" relaxations exist (e.g., Softplus for ReLU, Gumbel-Softmax, Optimal Transport sorting), they are fragmented across different research projects. There is no unified library offering a comprehensive, drop-in replacement for hard operations with consistent interfaces, multiple smoothing modes, and support for Straight-Through Estimation (STE).

2. Methodology

The authors introduce SoftJAX and SoftTorch, open-source libraries that provide feature-complete, unified soft relaxations for JAX and PyTorch. The methodology relies on two core concepts:

A. Soft Surrogates

The libraries replace hard functions $f$ with differentiable surrogates $f_\tau$ controlled by a softness parameter $\tau > 0$ .

Properties: $f_\tau$ must be continuous/differentiable, provide informative gradients (avoiding zero-gradient regions), and recover the original hard function as $\tau \to 0^+$ .
Implementation:
- Elementwise Operators: Based on the relaxation of the Heaviside step function using sigmoidal functions. The paper defines multiple modes:
  - Smooth ( $C^\infty$ ): Standard exponential sigmoid.
  - Piecewise ( $C^0, C^1, C^2$ ): Polynomial-based sigmoids ensuring specific orders of differentiability.
- Axiswise Operators (Sorting/Ranking): Implemented via three main algorithmic families:
  1. Optimal Transport (OT): Formulates sorting as moving probability mass between input and anchor points. Uses entropic, Euclidean, or $p$ -norm regularization.
  2. Unit Simplex Projections: Approximates OT by projecting onto the simplex (e.g., SoftSort, NeuralSort). Includes extensions to $p$ -norm regularizers for higher-order smoothness.
  3. Permutahedron Projections: Directly relaxes the value-space operators (sort/rank) by projecting onto the convex hull of permutations (FastSoftSort, SmoothSort). This avoids $O(n^2)$ cost matrices.
  4. Sorting Networks: Implements differentiable bitonic sorting networks using soft comparisons.

B. Straight-Through Estimation (STE)

To prevent soft relaxations from altering the forward pass (e.g., producing non-physical simulation states), the libraries implement STE.

Mechanism: The forward pass uses the hard function $f(x)$ , while the backward pass uses the gradient of the soft surrogate $\nabla f_\tau(x)$ .
The "STE Pitfall": The authors identify a critical issue where applying STE to individual components of a multiplicative composite function (e.g., $f_{STE}(x) \cdot g_{STE}(y)$ ) can result in zero gradients if the hard functions evaluate to zero.
Solution: The paper proposes applying the STE decorator to the entire composite function rather than its primitives, ensuring gradients flow correctly through the product rule.

3. Key Contributions

Unified Libraries: The first feature-complete, open-source libraries (SoftJAX/SoftTorch) offering drop-in replacements for hard JAX/PyTorch operations.
Comprehensive Operator Coverage:
- Elementwise: sign, abs, round, clip, relu, and comparison/logic operators (greater, and, or, where).
- Axiswise: sort, argsort, rank, argmax, top-k, quantile, median.
Flexible Smoothing Modes: Users can select between different smoothness guarantees ( $C^0, C^1, C^2, C^\infty$ ) and regularization types (Entropic, Euclidean, $p$ -norm) to balance differentiability and sparsity.
Theoretical Unification: The paper systematically derives connections between elementwise relaxations (Heaviside) and axiswise relaxations (OT/Simplex projections), showing that elementwise operators are special cases of axiswise operators for $n=2$ .
Practical Case Study: A demonstration of softening a collision detection subroutine in MuJoCo XLA, converting a non-differentiable mesh-mesh collision algorithm into a fully differentiable one using SoftJAX.

4. Results & Benchmarks

The authors evaluated the libraries on an Nvidia RTX 3060 GPU, comparing runtime, memory usage, and JIT compilation time against hard baselines and existing methods.

Performance Trade-offs:
- Sorting Networks: Fastest method for small-to-medium arrays ( $O(n \log n)$ ), with runtime only $\sim3.8\times$ the hard baseline for $n=4096$ .
- SoftSort: Offers a good balance, significantly faster than NeuralSort.
- NeuralSort & OT-based: Slower due to iterative optimization (Sinkhorn) or larger cost matrices, but provide dense Jacobians.
- FastSoftSort: Most memory-efficient ( $O(n)$ scaling) as it avoids materializing $n \times n$ matrices.
- SmoothSort (Novel): The slowest method due to $O(n^2)$ preprocessing for smooth bounds but offers $C^\infty$ differentiability.
Gradient Quality: In the collision detection case study, the soft version provided smooth, non-zero gradients for all input vertices, whereas the hard version produced zero gradients for unselected vertices, hindering optimization.
Correctness: The soft operators successfully recover hard behavior as $\tau \to 0$ and provide stable gradients for optimization tasks.

5. Significance

Lowering the Barrier to Entry: By unifying fragmented research into a single, well-tested library, SoftJAX and SoftTorch make soft differentiable programming accessible to a broader audience of ML practitioners and scientists.
Enabling New Applications: The ability to differentiate through discrete operations (sorting, ranking, logic) unlocks new possibilities in:
- Differentiable Simulation: Physics engines with contact resolution.
- Combinatorial Optimization: Learning to solve NP-hard problems via gradient descent.
- Structured Prediction: Ranking and matching tasks.
- Neural Architecture Search: Discrete topology optimization.
Reproducibility: The standardization of "softness" knobs and modes ensures that experiments using soft relaxations are reproducible and comparable across different studies.

In summary, SoftJAX and SoftTorch provide the necessary infrastructure to bridge the gap between discrete, hard operations and continuous, gradient-based optimization, effectively empowering the next generation of differentiable scientific computing.