From Tokenizer Bias to Backbone Capability: A Controlled Study of LLMs for Time Series Forecasting

Imagine you have a brilliant, world-famous chef (the Large Language Model or LLM) who has spent years mastering the art of cooking complex French cuisine. Now, you want to use this chef to bake a simple, perfect loaf of bread (a Time Series Forecast, like predicting tomorrow's stock price or weather).

The current trend in the tech world is to say, "Let's just give this French chef some bread dough and see what happens!" But there's a catch: the chef doesn't speak "dough." So, researchers built a translator (the Tokenizer) to turn the dough into French words, and a reverse translator (the Detokenizer) to turn the chef's French instructions back into bread.

This paper, titled "From Tokenizer Bias to Backbone Capability," is a controlled experiment to answer one big question: Is the chef actually doing the cooking, or is the translator just doing all the work?

Here is the breakdown of their findings using simple analogies:

1. The Problem: The "Over-Adapting" Translator

The authors noticed that in previous studies, researchers tested these chefs on very small batches of dough (small datasets).

What happened: The translators (Tokenizer/Detokenizer) were so good at adapting to that specific small batch of dough that they memorized the recipe perfectly. They learned to translate "dough" into "French" so well that even if you swapped the French chef for a random person, the bread still turned out great.
The Flaw: Because the translators were so specialized to the small data, they masked the chef's actual talent. It looked like the chef was a genius, but really, the translators were just doing all the heavy lifting. The researchers call this "Tokenizer Bias."

2. The Experiment: The Three "Chef" Scenarios

To find out the truth, the researchers set up a fair test using three different scenarios, all using the exact same kitchen setup (architecture) but different training histories:

Scenario A (The Real Chef): They took the famous French chef (pre-trained LLM) and kept them frozen (didn't let them learn new recipes). They only trained the translators on a massive amount of dough data (100 million samples) so the translators wouldn't be biased toward a small batch.
- Result: The bread was okay, but not amazing. The French chef didn't seem to add much magic to the bread.
Scenario B (The Blank Slate Chef): They took a random person with no cooking experience (randomly initialized weights) and trained them on the same massive dough data, but kept the translators frozen (using the ones from Scenario A).
- Result: Surprisingly, this random person performed almost as well as the famous French chef.
Scenario C (The Full Team): They trained the random person and the translators together from scratch on the massive data.
- Result: This team did the best. But the key takeaway is that the "famous chef" (the pre-trained LLM) didn't seem to provide a significant advantage over the random person who just learned the specific task.

3. The Big Revelations

The study uncovered a few surprising truths:

The "Vocabulary" Mismatch: The researchers tried to force the dough to look like words the chef already knew (matching time series to the LLM's vocabulary). It was like trying to explain baking a loaf of bread using only words about painting a portrait. It didn't help; in fact, it made the results worse. The chef's "language brain" doesn't naturally understand "time patterns."
Size Doesn't Matter (Much): They tested bigger, smarter chefs (larger LLMs like LLaMA-8B). You'd think a bigger brain would be better at predicting the future, but no. The bigger chefs performed slightly worse or the same as the smaller ones. Their massive knowledge of human language didn't translate to predicting weather or traffic.
The "Small Data" Trap: When you test on small datasets, the translators can cheat by memorizing the data. You need to test on "Zero-Shot" scenarios (predicting a brand new type of dough the chef has never seen) to see if the chef actually has the skill. Even then, the pre-trained knowledge didn't help much.

The Bottom Line

The paper concludes that using a pre-trained Large Language Model as the "brain" for time series forecasting is currently overhyped.

Think of it like this: You wouldn't hire a Nobel Prize-winning physicist to fix your car engine just because they are smart. They might be brilliant, but they haven't learned the specific mechanics of engines. Similarly, LLMs are brilliant at understanding human language, but they haven't learned the specific "mechanics" of time series data.

The researchers found that a model trained specifically on time series data (even a simpler one) often outperforms a massive, pre-trained language model. The "magic" of the LLM isn't transferring to this new job because the job requires a different kind of intelligence.

In short: Don't just throw a giant language model at a time series problem and hope for the best. The translators are doing the work, not the giant brain. To get good results, you need a model specifically trained on the data, not just a generalist with a big vocabulary.

From Tokenizer Bias to Backbone Capability: A Controlled Study of LLMs for Time Series Forecasting

1. The Problem: The "Over-Adapting" Translator

2. The Experiment: The Three "Chef" Scenarios

3. The Big Revelations

The Bottom Line

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

A. The "Bias" Effect

B. Zero-Shot Performance

C. Vocabulary and Fine-tuning Insights

5. Significance and Conclusion

From Tokenizer Bias to Backbone Capability: A Controlled Study of LLMs for Time Series Forecasting

1. The Problem: The "Over-Adapting" Translator

2. The Experiment: The Three "Chef" Scenarios

3. The Big Revelations

The Bottom Line

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

A. The "Bias" Effect

B. Zero-Shot Performance

C. Vocabulary and Fine-tuning Insights

5. Significance and Conclusion

More like this

ARC-AGI-3: A New Challenge for Frontier Agentic Intelligence

When Is Collective Intelligence a Lottery? Multi-Agent Scaling Laws for Memetic Drift in LLMs

AutoSAM: an Agentic Framework for Automating Input File Generation for the SAM Code with Multi-Modal Retrieval-Augmented Generation

Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach