Floating-Point Usage on GitHub: A Large-Scale Study of Statically Typed Languages

Imagine you are trying to teach a robot how to do math. But not just simple addition; you want it to handle the messy, imperfect math that computers actually use in the real world (like calculating the trajectory of a rocket or the price of a stock). This is called floating-point arithmetic.

The problem is that this kind of math is notoriously tricky. Computers can't store every decimal perfectly, so they round things off, which introduces tiny errors. These errors can pile up and cause disasters.

For years, researchers have been building tools to check if this math is safe. But to test their tools, they've been using a very specific, tiny set of practice problems (called benchmarks). It's like trying to teach a pilot to fly a commercial airliner by only practicing in a small, empty hangar with a toy plane. The researchers in this paper asked: "Is our practice hangar actually like the real sky?"

Here is the story of their investigation, broken down simply:

1. The Great GitHub Treasure Hunt

The researchers decided to stop guessing and start looking at the real thing. They went to GitHub, which is like a massive, public library containing millions of software projects written by people all over the world.

The Challenge: There are too many projects to read them all. It's like trying to read every book in a library the size of a city.
The Strategy: They used a clever "fishing net" (a computer program they built called Scyros) to cast a wide net. They didn't just grab the most popular projects (which might be biased); they grabbed a random sample of millions of projects to get a fair picture.
The Filter: They only looked at languages where the "rules" of the game are strict (statically typed languages). Why? Because in these languages, the computer tells you exactly what kind of numbers you are using (like "this is a decimal," "this is an integer"). In other languages, the computer guesses, which makes it impossible to scan millions of projects automatically.

2. What They Found: The "Real World" vs. The "Practice Field"

After sifting through 447,000 projects and extracting 10 million functions (little chunks of code), they found some surprising things:

Floating-point math is everywhere: They confirmed that about 62% of all software projects use this tricky math. It's not rare; it's the backbone of modern computing.
The "Toy" Benchmarks are too simple: The practice problems researchers usually use (like the FPBench suite) are very clean. They are like isolated math equations on a whiteboard.
- Real Code: In the real world, these math functions are messy. They are wrapped in loops (repeating actions), if-statements (decisions like "if the temperature is too high, stop"), and they call other functions constantly.
- The Mismatch: The practice benchmarks rarely have these messy features. It's like testing a pilot's ability to handle turbulence by only having them fly in perfectly calm weather. The tools built to test the math are great at solving the "whiteboard equations" but might fail when faced with the "messy cockpit" of real software.
The "Special" Libraries are rare: Researchers often use functions from the GNU Scientific Library (GSL) to test their tools. The study found that in real-world code, people almost never use these specific library functions. They use their own custom math or standard math libraries instead.

3. The Analogy: The "Cooking Class"

Imagine you are a cooking instructor trying to teach students how to make a perfect soufflé (the floating-point math).

The Old Way: You give them a recipe card with just the ingredients and the mixing instructions. You test them on this card. They pass! But then, you ask them to make a soufflé in a busy, noisy kitchen with a broken oven and a customer yelling at them (the real world). They fail miserably because the test didn't prepare them for the chaos.
The New Way (This Paper): The researchers went into thousands of real kitchens (GitHub), took photos of how chefs actually cook, and realized: "Hey, real chefs deal with broken ovens and shouting customers all the time!"
The Result: They created a new set of practice problems (the 59 Challenge Benchmarks) that include the broken ovens and the shouting. These are harder, messier, and much better at preparing tools for the real world.

4. Why This Matters

The paper concludes that we need to stop building tools that only work on "perfect" code. We need tools that can handle the messy, decision-heavy, looping code that developers actually write.

The Dataset: They released a massive dataset of 10 million real-world math functions for other researchers to use.
The Goal: To help build the next generation of tools that can actually keep our software safe, whether it's controlling a self-driving car, managing a bank account, or launching a satellite.

In a nutshell: The researchers looked at the real world to realize our practice tests were too easy. They built a new, harder, more realistic set of tests so that the tools we use to keep our software safe are actually ready for the job.

Here is a detailed technical summary of the paper "Floating-Point Usage on GitHub: A Large-Scale Study of Statically Typed Languages" by Andrea Gilot, Tobias Wrigstad, and Eva Darulova.

1. Problem Statement

Reasoning about floating-point arithmetic is notoriously difficult due to rounding errors, special values (NaN, Infinity), and the non-intuitive nature of floating-point semantics. While static and dynamic analysis techniques exist, they are often evaluated on small, hand-picked benchmark suites (e.g., FPBench) that may not reflect "real-world" code.

The Gap: There is a lack of large-scale empirical data regarding how floating-point arithmetic is actually used in open-source software.
The Risk: Current benchmarks may be biased toward the capabilities of existing tools (e.g., avoiding loops or conditionals) rather than reflecting actual user expectations, potentially steering research in unproductive directions.
The Challenge: Analyzing floating-point usage at scale is difficult because:
- Dynamically typed languages (like Python) require runtime execution to determine operand types, making large-scale static analysis infeasible.
- GitHub contains massive amounts of data, including duplicates, "toy" projects, and inactive repositories, requiring rigorous filtering to avoid measurement bias.

2. Methodology

The authors developed Scyros, a fully automated, reproducible framework for large-scale code mining. The study focuses exclusively on statically typed languages to enable scalable, type-aware analysis without runtime execution.

Data Collection Pipeline

The methodology follows a multi-step pipeline designed to reduce data volume while increasing granularity (Repository $\to$ File $\to$ Function):

Uniform Sampling: Randomly sampled public GitHub repository IDs using the REST API to avoid selection bias (e.g., avoiding "star-count" filtering which skews results).
Metadata Filtering: Excluded forks, projects smaller than 50 kB, and projects with an age of less than two months (to remove student assignments/toy projects).
Language Filtering: Identified projects containing at least one of 51 statically typed languages (e.g., C, C++, Java, Go, Rust, TypeScript).
Keyword-Based Detection: Instead of running complex type checkers (which is language-specific and hard to scale), the authors used regular expressions to search for floating-point indicators in source code:
- Primitive Types: float, double, real, float32, etc.
- Transcendental Functions: sin, cos, exp, log, etc.
- Special Values: nan, inf, fma, ulp.
- Arbitrary Precision: mpfr, bigfloat.
De-duplication: Removed token-level duplicates (files identical up to whitespace/reordering) using BLAKE3 hashing to prevent over-representation of popular libraries.
Parsing & Extraction: Used Tree-sitter to parse files, strip comments (to avoid false positives like "double" in natural language), and extract individual functions.
Validation: Manual inspection of random samples confirmed a high true-positive rate (93–98% for functions).

3. Key Contributions

Large-Scale Mining Methodology: A scalable, automated pipeline for detecting floating-point code in statically typed languages across millions of repositories.
The Dataset: Release of a dataset containing 10 million real-world floating-point functions extracted from 447,209 GitHub projects.
Challenge Benchmarks: A case study extracting 59 self-contained C benchmarks that preserve real-world dependencies and control flow, demonstrating how the dataset can generate realistic evaluation suites.
Empirical Evidence: The first large-scale study quantifying the prevalence and structural characteristics of floating-point code in open source.

4. Key Results

Prevalence (RQ1)

High Usage: With 95% confidence, over 62% of the sampled projects contain floating-point code.
Concentration: While prevalent, floating-point code is often concentrated in a small number of files within a project.

Characteristics of Real-World Code (RQ2)

Function Size: Most floating-point functions are small (median ~30 lines of code).
Control Flow:
- Conditionals: Very common (e.g., 63% in Go, 57% in C).
- Loops: Less frequent than conditionals but significantly more common in floating-point code than in general code. Even after controlling for function size, floating-point functions are 13% to 90% more likely to contain loops than non-floating-point functions.
- Nesting: Loops and conditionals frequently co-occur (e.g., 69–87% of loops are accompanied by conditionals).
Function Calls: Function calls are the most common construct, often involving non-transcendental helper functions.
Precision: Single (32-bit) and Double (64-bit) precision dominate. Mixed-precision and arbitrary-precision usage are rare (<1%).
Library Usage: Usage of the GNU Scientific Library (GSL) is extremely rare (<0.2% of files), contradicting its frequent use in literature benchmarks.

Benchmark Representativeness (RQ3)

FPBench Discrepancy: The widely used FPBench suite differs significantly from real-world code:
- Underrepresented: Conditionals, loops, and complex control flow.
- Overrepresented: Transcendental function calls (e.g., sin, exp).
- Conclusion: Current benchmarks reflect the limitations of existing tools (which struggle with loops/branches) rather than actual user code.

5. Significance and Implications

Shift in Research Focus: The study suggests that floating-point reasoning tools must evolve to handle modularity (many small functions calling each other) and complex control flow (loops and conditionals), rather than just isolated arithmetic expressions.
Benchmarking Reform: The authors argue that future benchmarks should be derived from real-world corpora (like the provided dataset) to ensure tools are evaluated on code that developers actually write.
Tool Development: The extracted 59 C benchmarks serve as a "challenge set" that includes features often unsupported by current tools (e.g., pointer manipulation, macros, low-level memory operations for NaN payloads).
Limitations & Future Work: The study is limited to statically typed languages. Extending this to dynamically typed languages requires runtime execution and dependency resolution, which remains a significant open challenge.

Conclusion

This paper provides the first empirical foundation for understanding floating-point usage in the wild. It reveals a disconnect between current research benchmarks and real-world software, urging the community to design and evaluate floating-point analysis tools against the complexities of actual codebases—specifically those involving branching, looping, and modular function calls.