CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics

Imagine you are trying to teach a very smart robot how to understand the world. You show it a library full of books, and it learns to understand words like "apple," "run," and "happy" perfectly. It knows that "apple" and "pear" are similar because they are both fruits, and that "run" and "sprint" are related actions.

But then, you show it a spreadsheet full of numbers. You ask it to compare "50 years" (a person's age) with "50 kilograms" (a person's weight).

To a standard robot (like the ones we use today), these look almost identical. It sees the number "50" in both cases. It might think, "Oh, 50 is 50, so these two things must be the same!" It fails to realize that 50 years is a lifetime, while 50 kilograms is a heavy backpack. It treats numbers like regular words, ignoring their true meaning, their units, and how they relate to each other.

This is the problem the paper CONE (Complex Numerical Data Embeddings) is trying to solve.

The Problem: The Robot's "Number Blindness"

Current AI models are like a chef who can taste a soup perfectly but can't tell the difference between a cup of sugar and a cup of salt because they are both "white powder."

The Tokenization Trap: When a standard AI sees the number "28,600," it might chop it up into tiny pieces like "28" and "600," losing the meaning of the whole number.
The Unit Confusion: It doesn't understand that "10 miles" and "10 kilograms" are completely different concepts, even though the number "10" is the same.
The Range Blindness: If you tell it a patient's age is "60–70," it struggles to understand that this is a range of time, not just two random numbers.

The Solution: CONE (The "Smart Label" System)

The authors created a new system called CONE. Think of CONE as a super-organized filing cabinet for numbers. Instead of just filing a number under "50," CONE files it under a specific, detailed label that includes three things:

The Value: The actual number (e.g., 50).
The Unit: What it measures (e.g., years, kg, miles).
The Attribute: What it describes (e.g., Age, Weight, Distance).

The Creative Analogy: The "ID Card" for Numbers

Imagine every number in a database gets an ID Card.

Old System (BioBERT): The ID card just says "50." It's like a person walking around with a name tag that only says "John." You don't know if this John is a doctor, a baker, or a teacher.
CONE System: The ID card says: "50 - Years - Age."
- Now, if you see another card that says "50 - Kilograms - Weight," the robot immediately knows these are different people, even though they share the number "50."
- If you see a card that says "50 - Miles - Distance," it knows that's a third, totally different person.

CONE builds these ID cards by combining the number, the unit, and the context into a single, complex "fingerprint" (an embedding vector). This allows the AI to understand that Age: 50 and Weight: 50 are as different as apples and oranges, even though the number is the same.

Handling Complex Shapes: Ranges and Clouds

The paper also tackles tricky number shapes that standard models hate:

Ranges (e.g., "10–20 years"):
- Analogy: Imagine a rubber band. Standard models see the two ends (10 and 20) but forget the stretch in between. CONE sees the whole rubber band. It understands that "10–20" is a specific span of time, distinct from "15–25."
Gaussians (e.g., "1302 ± 0.25"):
- Analogy: Imagine a cloud of dust. The center is the main number (1302), and the "± 0.25" tells you how spread out the cloud is. Standard models see the center but ignore the cloud's size. CONE sees the whole cloud, understanding that a "tight" cloud (small error) is different from a "loose" cloud (big error).

Why Does This Matter? (The Results)

The researchers tested CONE on massive datasets from medicine, finance, and government records. They asked the AI to find similar columns in tables or answer questions that required math.

The "Age vs. Follow-up" Test: In a medical table, there was a column for "Age" and a column for "Follow-up time." They both had numbers like 30, 40, 50. Old AI models thought these columns were 99% identical because the numbers looked the same. CONE realized they were different and separated them correctly.
The Score: On a tough math quiz for AI (called the DROP dataset), CONE scored 87.28%, beating the previous best models by a significant margin. It's like going from a B+ student to an A+ student in a math class.

The Takeaway

CONE is a new way of teaching AI to respect numbers. It stops treating numbers like random words and starts treating them like the precise, meaningful tools they are. By giving every number a full "ID card" that includes its unit and context, CONE allows AI to finally understand the difference between 50 years and 50 kilograms, making it much smarter at reading medical records, financial reports, and scientific data.

In short: CONE teaches the robot that numbers have a story, a unit, and a context, not just a value.

Here is a detailed technical summary of the paper "CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics".

1. Problem Statement

Large Language Models (LLMs) and pre-trained Language Models (LMs) excel at capturing textual semantics but struggle significantly with numerical reasoning and structured data. The core limitations identified are:

Tokenization Issues: Standard tokenizers (e.g., BERT's subword approach) often split numbers into arbitrary parts (e.g., "28,600" becoming "28" and "-600"), distorting the original magnitude and semantics.
Lack of Unit/Attribute Awareness: Models treat numbers as generic tokens, failing to distinguish between semantically distinct values that share the same magnitude (e.g., "50 years" vs. "50 kg" or "5 mg dosage" vs. "5 mg weight").
Inability to Handle Complex Structures: Existing models struggle with numerical ranges (e.g., "10–20 years") and Gaussian distributions (e.g., "1302 ± 0.25 nm"), often losing the structural semantics of these data types.
Contextual Blindness: In tabular data, columns with similar numerical distributions but different attributes (e.g., "Age" vs. "Follow-up duration") are often embedded too closely, leading to poor retrieval and reasoning performance.

2. Methodology: The CONE Model

The authors propose CONE, a hybrid transformer encoder pre-trained model designed to encode numbers, ranges, and Gaussians into a vector space that preserves distance, units, and variable semantics.

A. Architecture Overview

CONE builds upon standard transformer encoders (specifically BioBERT) but introduces a composite embedding structure and a numerical fusion mechanism.

Serialization:
- Column-level: Serialized as [CLS] AttributeName [SEP] Value1 [SEP] Value2 ...
- Tuple-level: Serialized as [CLS] Attr1 Val1 [SEP] Attr2 Val2 ...
Numerical Value Embeddings (Fusion):
- Instead of relying solely on token embeddings, CONE extracts numerical tokens and generates specific numerical value embeddings ( $M_N$ ) using a method like DICE (Deterministic Independent-of-Corpus Embeddings) to capture magnitude.
- These are fused with the contextual encoder embeddings ( $M_E$ ) via element-wise summation.
- The fused representation passes through a lightweight transformer block for number-specific reasoning, producing contextualized numerical embeddings ( $M_O$ ).
Composite Embedding Construction:
- The final embedding for a data point is a concatenation of three distinct components: Attribute ( $E_a$ ), Value ( $E_v$ ), and Unit ( $E_u$ ).
- Handling Ranges: A range $[a, b]$ is decomposed into its center ( $\frac{a+b}{2}$ ) and length ( $|b-a|$ ) to provide an orthogonal representation of location and scale.
- Handling Gaussians: A Gaussian ( $\mu \pm \sigma$ ) is decomposed into mean-minus-SD, mean, and mean-plus-SD components.
- Dimensionality Reduction: To handle variable numbers of components, a slot-based concatenation scheme with zero-padding and a binary mask is used, followed by a linear autoencoder to project the result into a fixed-size vector space ( $d=768$ ).

B. Training Objective

The model is trained using a Masked Numeral Prediction task. The loss function ( $L_{num}$ ) combines:

Magnitude Regression: Predicting the log-magnitude of masked numbers.
Classification Loss: Predicting the class of the masked token.
This ensures the model learns both the quantitative value and the semantic context of the number.

3. Key Contributions

Novel Composite Embedding Structure: A method that jointly encodes the attribute name, numerical value (scalar, range, or Gaussian), and unit. This ensures that identical numbers with different contexts (e.g., 5 km vs. 5 kg) are represented distinctly.
Specialized Embeddings for Complex Data: The introduction of specific algorithms to encode ranges (via center/length) and Gaussians (via mean/SD decomposition), preserving their structural semantics.
Two Novel Algorithms:
- Algorithm 1: Training via Masked Numeral Prediction.
- Algorithm 2: Composite Embedding Construction (handling scalar, range, and Gaussian inputs).
Extensive Evaluation: Validation across diverse domains (Medical, Finance, Government, Web) and tasks (Numerical Reasoning, Schema Matching, Column/Tuple Retrieval).

4. Experimental Results

The paper evaluates CONE against State-of-the-Art (SOTA) baselines including BERT, BioBERT, NumNet, TAPAS, NC-BERT, and general-purpose embedding models (BGE-M3, Stella, etc.).

Numerical Reasoning (DROP Benchmark):
- CONE achieved an F1 score of 87.28% on the DROP dataset.
- This represents a 9.37% improvement in F1 over the previous best baseline (NC-BERT) and outperforms NumNet and AeNER.
- It showed significant gains in tasks requiring counting, sorting, and addition.
Distance Preservation Analysis:
- Scalars: CONE achieved a Pearson correlation ( $r$ ) of 0.989 between embedding distances and actual numerical differences, compared to ~0.07 for BioBERT.
- Ranges: Achieved $r=0.997$ for Euclidean distance correlation, demonstrating accurate preservation of range proximity.
- Gaussians: Achieved $r=0.689$ for Wasserstein distance correlation, significantly outperforming baselines.
Downstream Tasks (Column & Tuple Matching):
- Recall@10: CONE achieved up to a 25% improvement in Recall@10 over SOTA baselines (e.g., 95% Recall on WebTables vs. 70-80% for others).
- Semantic Separation: In a case study, BioBERT assigned a similarity of 0.9998 between "Age" and "Follow-up (months)" (semantically distinct but numerically similar). CONE reduced this to 0.82, correctly separating them while keeping semantically similar attributes close.
Ablation Studies:
- Removing the numerical module (CONE1) or the composite structure (CONE2) caused significant performance drops (up to 16.7% Recall decrease), confirming the necessity of both components.

5. Significance

Bridging the Gap: CONE addresses a critical gap in LLMs: the inability to handle structured numerical data with units and complex distributions.
Unit and Attribute Disambiguation: By explicitly encoding units and attributes, the model solves the "polysemy" problem of numbers (e.g., distinguishing "5 years" from "5 kg"), which is crucial for scientific and medical data analysis.
Scalability and Efficiency: The model uses a standard transformer architecture with a lightweight addition, making it compatible with existing pipelines while offering superior performance without requiring LLM-based re-ranking (unlike some SOTA schema matching tools).
Generalizability: The approach is domain-agnostic, validated across medical (CancerKG, CovidKG), government (SAUS, CIUS), and web (WebTables) datasets, proving its utility for real-world structured data management.

In conclusion, CONE demonstrates that treating numbers as mere text tokens is insufficient. By decomposing numerical data into its semantic constituents (attribute, value, unit) and encoding them with magnitude-aware mechanisms, the model achieves state-of-the-art performance in numerical reasoning and structured data retrieval.