Discrete Tokenization Unlocks Transformers for Calibrated Tabular Forecasting

The Big Problem: The "Square Peg in a Round Hole"

Imagine you have a very smart, flexible robot (a Transformer) designed to understand stories, sentences, and music. It's great at seeing patterns in a flowing stream of data.

Now, imagine you try to teach this robot to predict how fast a marathon runner will finish a race based on their past history, the weather, and their age. This data is "tabular"—it's like a spreadsheet with rows and columns.

For years, the best tool for this job has been XGBoost (a type of Gradient Boosting). Think of XGBoost as a lumberjack. It chops the data into neat, straight lines (splits). If it's hot, the runner is slow; if it's cold, they are fast. It creates a "step-ladder" of rules. This works perfectly for messy, real-world data.

The problem? The smart robot (Transformer) is used to smooth, flowing curves. It tries to draw a smooth line through the data, but real life is full of sharp steps and sudden jumps. The robot kept losing to the lumberjack because it was trying to be too smooth.

The Solution: Teaching the Robot to "Chunk"

The authors of this paper realized the robot wasn't failing because it wasn't smart enough; it was failing because it was looking at the data the wrong way. They decided to teach the robot to speak the same language as the lumberjack: Discrete Chunks.

Here is how they did it, using three main tricks:

1. The "Pixelated" Map (Discretization)

Instead of asking the robot, "What is the exact temperature?" (which could be 72.43 degrees), they turned the data into pixels.

They grouped temperatures into buckets: "Cool," "Warm," "Hot."
They grouped running speeds into buckets: "Slow," "Medium," "Fast."
The Analogy: Imagine looking at a high-definition photo. If you zoom in too far, you see individual pixels. The robot was trying to see the whole smooth photo, but the authors forced it to look at the pixels. By treating the data as distinct "tokens" (like words in a sentence), the robot could finally use its superpower: Attention. It could now say, "Ah, when the runner is in the 'Hot' pixel and the 'Wind' pixel, they usually land in the 'Slow' pixel."

2. The "Soft Landing" (Gaussian Smoothing)

If you tell a robot, "If the temperature is 72 degrees, the speed is exactly 5 minutes," it gets confused if the temperature is 72.1 degrees. It treats 72.1 as a completely different world.

The authors gave the robot a soft landing.

Instead of saying "The answer is exactly this bucket," they said, "The answer is mostly this bucket, but there's a little bit of chance it's the neighbor bucket too."
The Analogy: Think of throwing a dart at a target. A "hard" target says, "If you miss the bullseye by a millimeter, you get zero points." A "soft" target (Gaussian smoothing) says, "If you miss by a millimeter, you still get most of the points." This helps the robot understand that the world isn't black and white; it's a gradient.

3. The "Timekeeper" (Time-Delta Tokens)

In a story, the time between events matters. If a runner runs a race today, and then runs another one tomorrow, that's different from running one next year.

The authors added special "Time Tokens" to the robot's vocabulary.

The Analogy: Imagine reading a diary. If the entry says "I ran a race," you need to know when the last race was to understand the context. The authors gave the robot a special "Time Delta" token that acts like a timestamp, saying, "It has been 4 weeks since the last race." This helps the robot understand the runner's "cadence" or rhythm.

The Results: The Robot Wins

When they put all these pieces together, the result was surprising:

The Robot (Transformer) beat the Lumberjack (XGBoost) by a significant margin (about 10% better accuracy).
The Calibration: The robot didn't just guess a number; it gave a Probability Distribution.
- Old way: "The runner will finish in 3 hours."
- New way: "There is a 60% chance they finish in 3 hours, a 30% chance in 3 hours 10 mins, and a 10% chance in 2 hours 50 mins."
- This is like a weather forecast saying "60% chance of rain" instead of just "It will rain." This makes the prediction much more useful for planning.

Why This Matters

This paper proves that you don't need a bigger, more complex robot to solve tabular problems. You just need to simplify the vocabulary (discretize) and teach it to be flexible (smoothing).

For the General Public: It means that in the future, AI might be better at predicting things like insurance risks, loan approvals, or sports outcomes, not by being a "super-brain," but by learning to think in simple, manageable chunks, just like we do.
The "Secret Sauce": The key wasn't making the AI smarter; it was making the data easier for the AI to digest by turning smooth numbers into distinct "words" and teaching the AI to respect the time gaps between events.

In a nutshell: They took a fancy, smooth-thinking AI, taught it to speak in "chunks" and "time-stamps," and suddenly, it became the best predictor in the room, beating the old-school experts at their own game.

1. Problem Statement

Despite the dominance of Transformers in unstructured data (text, images), Gradient Boosting Machines (GBMs) like XGBoost remain the state-of-the-art for tabular data forecasting.

The Core Conflict: GBMs excel because they create axis-aligned splits, naturally modeling discrete regimes and irregular decision boundaries common in tabular data. Standard Transformers, being smooth function approximators, struggle to capture these discontinuities and often fail to outperform tree-based models.
The Specific Challenge: The paper focuses on irregular time-series forecasting within a tabular context (predicting a runner's future pace based on past races, demographics, and environmental conditions). The data involves:
- Irregular time deltas between events.
- Heterogeneous bin widths for continuous variables.
- The need for calibrated Probability Density Functions (PDFs) rather than simple point estimates to handle uncertainty.

2. Methodology: RunTime

The authors propose RunTime, a Transformer-based architecture that bridges the gap between discrete tree logic and continuous attention mechanisms through three core innovations:

A. Discrete Tokenization Strategy

Instead of treating inputs as continuous vectors, the model converts all features into a discretized vocabulary:

Quantile-based Binning: Continuous inputs (pace, temperature, humidity) are binned using quantile-based discretization to ensure roughly equal sample sizes per bin, mimicking tree splits.
Token Types:
- Quantized Continuous Tokens: Represent discretized numeric ranges (e.g., pace bins).
- Categorical Tokens: Represent semantics (e.g., gender, weather conditions).
- Time-Delta Tokens: Crucially, temporal gaps (weeks since last race, weeks to target) are encoded as explicit tokens rather than continuous values, allowing the model to learn cadence.
Sequence Structure: The model processes "strides" (events) autoregressively. Each stride follows a strict grammar: [Features + Demographics] → [Pace] → [Time Delta Next] → [Time Delta Final]. This enforces causality and allows the model to predict the next event's pace based on the full context.

B. Gaussian-Integrated Soft Targets

To address the loss of ordinal information inherent in hard classification, the authors replace standard one-hot targets with Gaussian-smoothed soft targets:

Fixed Smoothing: A standard Gaussian kernel is applied around the true value.
Adaptive Smoothing (Key Innovation): The smoothing strength ( $\sigma$ $σ$ ) scales dynamically with the bin width ( $w_i$ $w_{i}$ ) of the target value:
$\sigma_i = \sqrt{\sigma_{floor}^2 + (k \cdot w_i)^2}$
- $\sigma_{floor}$ : Enforces a minimum smoothing floor for narrow bins (preventing over-smoothing).
- $k$ : Controls how aggressively bin width influences smoothing.
- Benefit: This ensures narrow bins remain sharp (high confidence) while wide bins receive proportionally more probability mass, preserving the ordinal structure and handling heterogeneous bin widths effectively.

C. Architecture

Model: A decoder-only causal Transformer (6 layers, 8 heads, 512-dim embeddings).
Training: Trained with a Gaussian-smoothed cross-entropy objective.
Evaluation: Uses entity-disjoint splits (no runner overlap between train/test) to ensure the model generalizes to unseen individuals rather than memorizing specific runners.

3. Key Contributions

Architectural Insight: Demonstrates that discretization, not just model scale, is the key to unlocking Transformer performance on tabular data. It aligns the inductive bias of Transformers with the discrete regimes trees naturally capture.
Adaptive Gaussian Smoothing: Introduces a novel smoothing technique that adapts to heterogeneous bin widths, enabling the training of distributional outputs without post-hoc calibration.
Explicit Cadence Modeling: Proves that explicitly tokenizing time deltas is critical for learning temporal progression patterns in irregular sequences.
Stratified Calibration Analysis: Provides a methodology for diagnosing miscalibration across different confidence levels and runner percentiles, moving beyond global metrics.

4. Experimental Results

The model was evaluated on a dataset of 600,000 runners (5 million training examples) from the NYRR 9+1 Program.

Performance vs. XGBoost:
- RunTime achieved a Median MAE of 35.94s, outperforming tuned XGBoost (40.31s) by 10.8%.
- It also significantly outperformed the physics-based Riegel formula (49.74s).
Calibration:
- Achieved a Kolmogorov-Smirnov (KS) statistic of 0.0045 (using the adaptive- $\sigma$ checkpoint optimized for KS), indicating near-perfect calibration of the predicted PDFs.
- Unlike point-estimate models, RunTime provides distinct Mean, Median, and Mode MAEs, revealing the shape of the prediction distribution.
Ablation Studies:
- Time Delta Tokens: Removing them increased Median MAE by ~1.8% and increased training time significantly (60h vs 107h), proving their role in both accuracy and convergence speed.
- Temporal Ordering: Shuffling the race history increased Median MAE by ~2.0%, confirming the model relies on chronological progression patterns.
- Architecture: The gains are attributed to the specific design (discretization + smoothing) rather than just model capacity.

5. Significance and Implications

Paradigm Shift: The paper challenges the dogma that Transformers are unsuitable for tabular data. It suggests that discretizing the input/output space allows attention mechanisms to focus on specific regimes, effectively mimicking the strengths of decision trees while retaining the sequence modeling power of Transformers.
Uncertainty Quantification: By producing calibrated PDFs rather than point estimates, the model offers a superior interface for risk-aware decision-making (e.g., in sports strategy, finance, or healthcare).
Generalizability: The "discretization + adaptive smoothing" recipe is applicable to any ordinal regression task with heterogeneous bin widths, extending beyond running to other domains with irregular trajectories.
Future Directions: The authors propose extending this to generative simulation (creating "digital twins" via Monte Carlo sampling) and survival analysis (handling censored data), leveraging the same discretized grammar.

In summary, RunTime demonstrates that by respecting the discrete nature of tabular data through tokenization and adaptive smoothing, Transformers can surpass the long-standing dominance of Gradient Boosting in forecasting tasks while providing superior calibration and uncertainty estimates.