IntSeqBERT: Learning Arithmetic Structure in OEIS via Modulo-Spectrum Embeddings

Here is an explanation of the paper IntSeqBERT, translated into simple language with creative analogies.

The Big Picture: Teaching a Robot to Count Like a Mathematician

Imagine you have a giant library called the OEIS (The On-Line Encyclopedia of Integer Sequences). It contains hundreds of thousands of number patterns, from simple ones like "1, 2, 3, 4" to incredibly complex ones involving massive factorials and astronomical numbers.

The goal of this paper is to teach an AI to look at a sequence of numbers, hide some of them, and guess what the missing numbers are. This is like a "fill-in-the-blanks" game for math.

However, standard AI models (like the ones that power chatbots) are terrible at this for two reasons:

They run out of words: Standard models treat numbers like words in a dictionary. If a number is too big (like a number with 50 zeros), the model has never seen it before and just says, "I don't know."
They miss the rhythm: Math isn't just about size; it's about patterns. For example, every second number in a sequence might be even, or every third number might end in a 5. Standard models struggle to "hear" these rhythmic patterns when numbers get huge.

The Solution: IntSeqBERT (The Dual-Brain Robot)

The authors built a new model called IntSeqBERT. Instead of treating numbers as single words, they gave the model a "dual-brain" approach to understand numbers in two different ways simultaneously.

Think of it like describing a person. You wouldn't just say their name (which might be unique and hard to remember); you would describe their height and their clothing style.

1. The Magnitude Stream (The "Height" Sensor)

This part of the model looks at how big the number is.

The Analogy: Imagine a ruler that measures the "loudness" of a number. Instead of counting every single digit (which is hard for huge numbers), it measures the volume of the number on a logarithmic scale.
What it does: It tells the model, "This number is roughly as big as a mountain," or "This number is as big as a grain of sand." This helps the model handle numbers that are too big to write down.

2. The Modulo Stream (The "Rhythm" Sensor)

This is the paper's secret sauce. This part looks at the remainders when numbers are divided by small numbers (like 2, 3, 4... up to 101).

The Analogy: Think of a clock. No matter how many hours pass, the clock always resets to 1–12. Similarly, if you divide any number by 7, the remainder will always be between 0 and 6.
The Magic: Even if a number is astronomically huge (like $10^{100}$), its "remainder" when divided by 7 follows a simple, repeating pattern. By analyzing these remainders for 100 different "clocks" (moduli), the model learns the hidden rhythm of the sequence.
Why it works: It's like knowing that a song always has a drumbeat on the 4th count. Even if the song gets louder and louder (the number gets bigger), the drumbeat pattern stays the same.

3. The Fusion (The "Conductor")

The model uses a technique called FiLM (Feature-wise Linear Modulation) to combine these two streams.

The Analogy: Imagine the "Magnitude" stream is the singer, and the "Modulo" stream is the conductor. The conductor tells the singer, "You are singing a very loud note (big number), but remember to keep the rhythm of the drumbeat (modulo pattern)."
This allows the model to predict the size of the number while strictly adhering to the mathematical rules of the sequence.

The Results: Beating the Competition

The researchers tested this new robot against a standard "dictionary-based" AI (Vanilla Transformer) and a version of their own robot that only looked at size (Ablation).

The "Dictionary" AI: When numbers got too big, it failed completely. It was like trying to read a book where half the words were replaced with "UNKNOWN."
IntSeqBERT: It crushed the competition.
- It predicted the size of numbers with 95.8% accuracy.
- It correctly guessed the mathematical "rhythm" (modulo) 50% of the time (which is huge for such complex math).
- The "Solver" Trick: The model doesn't just guess a number; it uses a mathematical tool called the Chinese Remainder Theorem (think of it as a super-smart puzzle solver) to combine all its small guesses (remainders) into one giant, correct number.
- The Win: When asked to predict the next number in a sequence, IntSeqBERT was 7.4 times better than the standard AI.

The Big Discovery: Composite Numbers are Superheroes

The paper found something fascinating about the "rhythm" part.

They tested 100 different "clocks" (moduli).
They discovered that composite numbers (numbers made of smaller factors, like 60 or 96) were much better at capturing the sequence's structure than prime numbers.
The Analogy: Imagine trying to guess a secret code. If you only check if a number is even (divisible by 2), you get some info. But if you check if it's divisible by 60, you are simultaneously checking if it's divisible by 2, 3, 4, 5, 6, 10, 12, 15, 20, and 30. It's like checking 10 clues at once with a single question. The model learned that these "multi-clue" clocks were the most efficient way to understand the math.

Summary

IntSeqBERT is a new AI that learns math not by memorizing a dictionary of numbers, but by understanding how big a number is and what pattern it follows. By listening to the "rhythm" of numbers (remainders) and combining it with their "size," it can solve math puzzles that stump standard AI, especially when the numbers get astronomically large. It proves that to understand the universe of numbers, you need to listen to the beat, not just count the notes.

Here is a detailed technical summary of the paper "IntSeqBERT: Learning Arithmetic Structure in OEIS via Modulo-Spectrum Embeddings."

1. Problem Statement

The paper addresses the challenge of masked sequence modelling for integer sequences found in the On-Line Encyclopedia of Integer Sequences (OEIS). The core difficulties in this domain are:

Extreme Heterogeneity: Sequence values range from single-digit constants to astronomically large factorials and exponentials (differing by tens of orders of magnitude).
Out-of-Vocabulary (OOV) Issues: Standard token-based Transformers assign discrete tokens to integers. This approach fails for values outside a fixed vocabulary (requiring an "UNK" token) and cannot represent the infinite set of integers.
Loss of Arithmetic Structure: Discrete token IDs obscure the underlying arithmetic relationships (e.g., periodicity, parity, divisibility) that govern integer sequences. Standard models struggle to learn multiplicative laws or handle large numbers effectively.

2. Methodology: IntSeqBERT

The authors propose IntSeqBERT, a dual-stream Transformer encoder designed to explicitly encode both the magnitude and the modular arithmetic structure of integers.

A. Dual-Stream Input Representation

Instead of tokenizing integers, each element $x_i$ is encoded along two complementary axes:

Magnitude Stream:
- Encodes the absolute value on a continuous log-scale ( $v_i = 1 + \log_{10}|x_i|$ ).
- Includes a one-hot sign representation ( $+, -, 0$ ).
- Handles extremely large numbers by using decimal digit counts if they exceed float64 limits.
Modulo Stream:
- Encodes the residue of the integer modulo $m$ for $m \in \{2, 3, \dots, 101\}$ (100 moduli).
- Uses sin/cos embeddings ( $\sin(2\pi r/m), \cos(2\pi r/m)$ ) to represent residues as points on a unit circle. This preserves the cyclic group structure of $\mathbb{Z}/m\mathbb{Z}$ and avoids discontinuities at wrap-around boundaries.

B. Fusion via FiLM

The two streams are fused using Feature-wise Linear Modulation (FiLM):

The modulo embedding generates scale ( $\gamma_i$ ) and shift ( $\beta_i$ ) parameters.
These parameters modulate the magnitude embedding: $e_i = (1 + \gamma_i) \odot h^{mag}_i + \beta_i$ .
This mechanism allows the periodic arithmetic structure (modulo stream) to dynamically constrain and refine the magnitude estimation.

C. Multi-Task Training Objective

The model is trained jointly on three prediction heads for masked positions:

Magnitude Regression: Predicts the log-scale magnitude (using Huber loss).
Sign Classification: Predicts the sign ( $+, -, 0$ ) (using Cross-Entropy).
Modulo Prediction: Predicts residues for all 100 moduli (using Cross-Entropy, normalized by $\ln m$ ).

The loss function weights the modulo stream higher ($2 \times L_{mod}$) to emphasize the learning of arithmetic structure.

D. The Solver (Inference)

To recover concrete integers from the model's probabilistic outputs, a probabilistic Chinese Remainder Theorem (CRT)-based Solver is employed:

It derives a search range from the magnitude prediction ($3\sigma$ interval).
It uses the predicted residue distributions to filter candidates.
It operates in three modes based on search width: Dense (enumeration), Sieve (CRT beam search), and CRT (direct generation for huge numbers).
Candidates are scored based on magnitude likelihood and residue consistency.

3. Key Contributions

Architecture Innovation: Introduction of a dual-stream Transformer that fuses continuous magnitude embeddings with explicit modulo-spectrum embeddings via FiLM, overcoming the limitations of tokenization for integer sequences.
Empirical Performance: Demonstrated that IntSeqBERT significantly outperforms standard tokenized Transformers (Vanilla) and magnitude-only ablations, particularly in handling large integers and learning periodic structures.
Number-Theoretic Insight: Discovered a strong negative correlation ( $r = -0.851$ ) between Normalized Information Gain (NIG) and Euler's totient ratio ( $\phi(m)/m$ ). This provides empirical evidence that composite moduli (which aggregate information from multiple prime factors via CRT) capture arithmetic structure more efficiently than prime moduli.
Scaling Behavior: Showed that arithmetic reasoning (modulo accuracy) benefits disproportionately from increased model capacity compared to magnitude regression.

4. Experimental Results

The model was evaluated on 274,705 OEIS sequences across three sizes (Small, Middle, Large) on a single RTX 3070 Ti GPU.

Magnitude Accuracy: At the Large scale (91.5M parameters), IntSeqBERT achieved 95.85% accuracy, outperforming the Vanilla baseline by +8.9 percentage points. The Vanilla model suffered catastrophic degradation for large integers due to OOV tokens.
Modulo Accuracy (MMA): IntSeqBERT achieved 50.38% Mean Modulo Accuracy, surpassing the Vanilla baseline by +4.5 points. An ablation study (removing the modulo stream) showed a -15.2 point drop, confirming the stream's critical role.
Next-Term Prediction (Solver): The most significant improvement was in exact next-term prediction. IntSeqBERT achieved a Top-1 accuracy of 19.09%, a 7.4-fold improvement over the Vanilla baseline (2.59%).
Scale Stratification: IntSeqBERT maintained meaningful accuracy for "Medium" and "Large" magnitude buckets where the Vanilla baseline collapsed to near 0% accuracy.

5. Significance and Conclusion

This work establishes a new representational foundation for AI in mathematics, specifically for integer sequences. By moving away from discrete tokenization toward continuous magnitude and explicit modular arithmetic embeddings, IntSeqBERT successfully internalizes the arithmetic laws governing OEIS sequences.

The findings suggest that:

Modular arithmetic is a superior feature for learning multiplicative structures in deep learning, reducing the network depth required to "rediscover" these laws.
Composite moduli are highly effective for aggregating arithmetic information, validating the use of CRT-based approaches in neural architectures.
Future Directions: The authors propose extending the modulo spectrum, improving large-integer prediction via approximate CRT, and applying these embeddings to downstream tasks like conjecture generation and sequence classification.

In summary, IntSeqBERT demonstrates that explicitly encoding the number-theoretic properties of data allows neural networks to generalize far better than standard token-based approaches, particularly in domains involving large-scale arithmetic and combinatorial patterns.