JANUS: Structured Bidirectional Generation for Guaranteed Constraints and Analytical Uncertainty

Imagine you are a chef trying to create a perfect replica of a complex, multi-layered cake (the real world data) for a party. You need to do three things at once:

Taste exactly like the original (Fidelity).
Follow strict rules (e.g., "The frosting must be above the cake," or "No chocolate in the vanilla layer") (Control).
Know exactly how confident you are that your cake won't collapse, and do it quickly (Reliability & Efficiency).

For a long time, chefs (AI researchers) had to choose between these. If they used a fancy, high-tech 3D printer (Deep Learning models like CTGAN), the cake looked amazing, but they couldn't guarantee the frosting stayed on top without throwing away 99% of the failed cakes. If they used a strict recipe book (Causal Models), they could follow the rules perfectly, but the cake often tasted bland or didn't look like the original.

Enter JANUS, a new "smart kitchen" that solves this problem.

The Core Idea: The Two-Way Street

Most AI generators work like a one-way street: they start with the ingredients (parents) and try to guess the final dish (children). If the result breaks a rule, they throw it away and try again. This is called Rejection Sampling, and it's like trying to hit a bullseye by throwing darts blindly until you get lucky. It's slow and wasteful.

JANUS is different. It uses a Two-Way Street approach:

Looking Forward: It predicts the dish from the ingredients (standard AI).
Looking Backward: It asks, "If I must have this specific topping, what ingredients must I have started with?"

The Secret Sauce: The "Back-Fill" Algorithm

Imagine you are building a house, but you have a strict rule: "The roof must be exactly 10 feet high."

Old Way (Rejection): You build a random house, measure the roof, and if it's 9 feet or 11 feet, you demolish it and start over. You might have to demolish 100 houses to get one right.
JANUS Way (Reverse-Topological Back-filling): You start with the rule. You say, "Okay, the roof must be 10 feet." You then work backward to figure out exactly what size of walls and foundation are required to support a 10-foot roof. You build only the parts that fit the rule.

This is Reverse-Topological Back-filling. Instead of guessing and hoping, JANUS calculates the valid path backward through the "family tree" of data. It guarantees that every single piece of data it creates follows the rules, with zero waste.

The "Smart Tree" Architecture

JANUS doesn't use a black-box neural network. Instead, it builds a Decision Tree (like a flowchart) for every piece of data.

The Hybrid Split: Usually, these trees only learn "If A, then B." JANUS teaches the tree to learn both "If A, then B" AND "If B, then A."
Why? This allows the tree to act like a two-way mirror. If you tell it "I need a high salary," it can instantly look up the specific combination of education and experience that leads to that salary, ensuring the logic holds up.

The "Crystal Ball" (Uncertainty)

When you ask a standard AI, "How sure are you?" it usually has to run the simulation 100 times and average the results to guess. This is slow.

JANUS has a built-in Crystal Ball. Because it uses a specific mathematical trick (Bayesian statistics), it can calculate its own confidence level instantly in a single step.

Aleatoric Uncertainty: "The data is just noisy and messy." (Unavoidable).
Epistemic Uncertainty: "I haven't seen enough examples of this type of data." (Fixable by learning more).

JANUS can tell you, "I'm confident here because I've seen this before," or "I'm unsure here because this is a weird edge case," 128 times faster than other methods.

Why This Matters: The "Fairness" Test

The paper highlights a huge problem in AI fairness. We often say, "This AI is fair because it hires men and women at the same rate." But what if the AI is secretly cheating?

JANUS allows researchers to inject known biases into the data to test fairness tools.

Example: Imagine a rule: "Salary Offered must be greater than or equal to Salary Requested."
Old AI models often fail this, offering people less than they asked for, or having to throw away millions of samples to find a few that work.
JANUS guarantees this rule is followed 100% of the time, ensuring that if you ask for $50k, you never get an offer for $40k. It enforces individual fairness (treating the specific person right) rather than just group averages.

The Bottom Line

JANUS is like a master architect who can:

Build a perfect replica of a city (High Fidelity).
Guarantee that no building violates the zoning laws (100% Control).
Tell you exactly how likely a building is to collapse before you even lay the first brick (Instant Reliability).
Do it all without wasting a single brick (High Efficiency).

It bridges the gap between "cool AI that looks good" and "trustworthy AI that actually works in the real world."

1. Problem Statement: The Synthetic Data Quadrilemma

The paper identifies a fundamental "Quadrilemma" in high-stakes synthetic data generation, where existing models fail to simultaneously achieve four critical goals:

Fidelity: Accurately replicating the original data distribution.
Control: Strictly satisfying complex logical constraints (e.g., continuous ranges like $Age \in [25, 65]$ or inter-column logic like $Salary_{offered} \geq Salary_{requested}$ ).
Reliability: Providing trustworthy uncertainty estimates (distinguishing between data noise and model ignorance).
Efficiency: Generating data with low computational cost.

Current Limitations:

Deep Generative Models (e.g., CTGAN, TabDDPM): Excel at fidelity but rely on inefficient rejection sampling for constraints. As constraints tighten, the probability of a random sample satisfying them ( $p$ ) drops, causing computational costs to explode ( $O(1/p)$ ). They also lack native uncertainty quantification.
Structural Causal Models (e.g., DCM, CAREFL): Offer logical control but struggle with high-dimensional fidelity and often fail when inverting non-additive noise mechanisms (numerical instability).
Probabilistic Graphical Models: Often lack the flexibility to handle complex, non-linear dependencies without exponential data requirements.

2. Methodology: The JANUS Framework

JANUS (Joint Ancestral Network for Uncertainty and Synthesis) unifies these capabilities using a Directed Acyclic Graph (DAG) of Bayesian Decision Trees.

A. Data Representation & Structure

DAG Learning: JANUS accepts a DAG from domain experts or learns it via causal discovery algorithms (PC, GES) or heuristics (Random Forest). It treats the DAG as a factorization of the joint distribution: $P(X) = \prod P(X_i | Pa(X_i))$ .
Discretization: Continuous variables are discretized into $K$ bins via quantile binning. This transforms the problem into a discrete domain, enabling exact posterior updates and fast intersection operations for constraints.

B. Probabilistic Architecture: Hybrid Splitting Criterion

Each node in the DAG is modeled by a Bayesian Decision Tree. The core innovation is the Hybrid Splitting Criterion, which optimizes a tree to learn both:

Supervised ( $P(Y|X)$ ): Standard prediction.
Unsupervised ( $P(X|Y)$ ): The distribution of input features given the output.

The splitting score is defined as:
$S_{split} = \log P(Y | split) + \lambda_{unsup} \cdot \log P(X | split) + \lambda_{div} \cdot D_{KL}(P_{child} || P_{parent})$

Why it matters: Standard trees stop splitting when the target $Y$ is pure (homogeneous), losing information about the input distribution $X$ . The unsupervised term forces the tree to continue splitting to organize $X$ even when $Y$ is constant. This enables bidirectional sampling: the tree can sample $Y$ given $X$ (forward) and $X$ given $Y$ (backward).

C. Algorithm: Reverse-Topological Back-filling

To satisfy constraints without rejection sampling, JANUS uses a two-phase algorithm:

Phase 1 (Backward Pass): Starting from constrained child nodes, the algorithm propagates constraints upstream to parent nodes.
- It identifies the subset of tree leaves that predict the constrained value.
- It samples parent values exclusively from the backward histograms stored in those leaves.
- Domain Intersection: If a parent has multiple constrained children, it computes the intersection of valid parent domains. If the intersection is non-empty, a valid parent is sampled.
Phase 2 (Forward Pass): Once all constrained nodes and their ancestors are fixed, the remaining nodes are sampled top-down from the learned conditional distributions.

Complexity: This approach achieves 100% constraint satisfaction with $O(d \cdot L \cdot K)$ complexity (where $d$ is features, $L$ is leaves, $K$ is bins), avoiding the exponential cost of rejection sampling.

D. Analytical Uncertainty Quantification

JANUS leverages Dirichlet-Multinomial conjugacy to decompose uncertainty analytically (closed-form) without ensembles or multiple forward passes:

Aleatoric Uncertainty: Inherent data noise (irreducible).
Epistemic Uncertainty: Model ignorance due to limited data (reducible).
Speed: This decomposition is 128× faster than Monte Carlo Dropout methods.

3. Key Contributions

Hybrid Splitting Criterion: A novel tree-building objective that learns bidirectional distributions ( $P(Y|X)$ and $P(X|Y)$ ), enabling constraint propagation.
Reverse-Topological Back-filling: An algorithm guaranteeing 100% constraint satisfaction for feasible sets with $O(d)$ complexity, eliminating the need for rejection sampling.
Analytical Uncertainty: A closed-form decomposition of aleatoric and epistemic uncertainty using Dirichlet priors, providing fast, theoretically grounded confidence estimates.
Comprehensive Benchmarking: Validation across 15 datasets and 523 constrained scenarios, demonstrating state-of-the-art performance.

4. Results & Evaluation

A. Constrained Generation (Control)

Constraint Satisfaction Rate (CSR): JANUS achieved 100% CSR across all 523 experiments, including tight constraints where baselines (CTGAN, TabDDPM) failed or required exponential time.
Speed: It was 49.6× faster than Deep Causal Models (DCM) on hard constraints and significantly faster than rejection sampling baselines.
Fidelity: Despite discretization, JANUS maintained a high fidelity score (0.939), only 5.2% below the oracle (ground truth).

B. Unconditional Generation (Fidelity)

Detection Resistance: JANUS achieved a Detection Score of 0.497 (ideal is 0.5), outperforming TabDDPM (0.580), CTGAN (0.634), and TVAE (0.609). This indicates synthetic data is nearly indistinguishable from real data.
Correlation Preservation: It ranked second in feature correlation preservation, significantly outperforming deep learning baselines.
Mode Collapse: On imbalanced datasets, JANUS showed superior stability (Mode Collapse Score 0.946 vs. 0.742 for CTGAN) with 6× lower variance, ensuring minority classes are preserved.

C. Causal Validity & Counterfactuals

Non-Additive Noise: On graphs with non-additive (multiplicative) noise, JANUS achieved 18× to 47× lower error than flow-based methods (ANM, DCM, CAREFL).
Reason: Flow-based methods fail due to numerical instability when inverting multiplicative noise. JANUS avoids inversion entirely by using discrete bin lookups.

D. Reliability & Fairness

Uncertainty Detection: JANUS was the only method to successfully detect injected label noise (Ratio > 1.0), correctly distinguishing epistemic from aleatoric uncertainty.
Fairness Testbed: JANUS enables rigorous fairness auditing by allowing researchers to inject known biases and test algorithms. It successfully enforced inter-column constraints (e.g., $Salary_{offered} \geq Salary_{requested}$ ) with 0% violations, a capability baselines lack without massive rejection overhead.

5. Significance

JANUS represents a paradigm shift in synthetic data generation by breaking the trade-off between control and fidelity.

For High-Stakes Applications: It provides a "white-box" generator where constraints are mathematically guaranteed, not probabilistically approximated. This is crucial for privacy-preserving analytics, fairness auditing, and scientific simulation.
Efficiency: By replacing rejection sampling with deterministic back-filling and ensembles with analytical uncertainty, it makes real-time, constrained data generation feasible.
Interpretability: Unlike black-box deep learning models, JANUS provides explicit decision rules and causal structures, making it suitable for domains requiring explainability.

Limitations: Global discretization may lose precision for heavy-tailed distributions; the back-filling algorithm uses a greedy approach for complex constraint intersections; and uncertainty estimates are data-level, not prediction-level.

In conclusion, JANUS offers a unified framework that guarantees logical consistency, provides fast and accurate uncertainty quantification, and maintains high data fidelity, making it a robust tool for the next generation of synthetic data applications.