Formally Verified Linear-Time Invertible Lexing

This paper presents ZipLex, a formally verified framework in Scala that achieves linear-time invertible lexical analysis with longest-match semantics by combining a novel token sequence abstraction with verified data structures like zippers and memoization, demonstrating performance comparable to unverified tools while providing rigorous proofs of correctness and invertibility.

Samuel Chassot, Viktor Kunčak

Published Mon, 09 Ma
📖 5 min read🧠 Deep dive

Imagine you are a master translator working in a bustling library. Your job is to take a long, messy stream of raw text (like a novel or a code file) and break it down into neat, labeled cards called tokens. These tokens are the building blocks that computers use to understand the text. This process is called lexing.

Usually, once the computer has these cards, it can rearrange them, sort them, or edit them. But here's the problem: if you take those cards, glue them back together into a sentence, and try to read them again, the computer might get confused. It might merge two words into one, or miss a space, changing the meaning forever.

ZipLex is a new, super-smart tool created by researchers Sam and Viktor. It solves this problem by guaranteeing that the translation process is perfectly reversible. You can break the text down, shuffle the cards, glue them back together, and the computer will read it exactly the same way as before. No information is ever lost.

Here is how they did it, explained with some everyday analogies:

1. The "Longest Match" Puzzle

Imagine you are reading a sentence and you see the letters a, a, a, b.

  • Rule A: "Match any letter a."
  • Rule B: "Match aaab."

A standard translator might grab the first a and stop. But a "smart" translator (using Longest Match) knows it should grab the whole aaab because that's the biggest chunk it can find.

The Problem: Sometimes, if you glue tokens back together without being careful, the "smart" translator gets tricked.

  • Example: You have a token for val and a token for x. If you glue them to make valx, the computer might think that's one single word (an identifier) instead of two separate words.
  • ZipLex's Solution: ZipLex checks a special "separability" rule before gluing anything. It asks: "If I glue these two cards together, will the computer still recognize them as two separate cards?" If the answer is no, it refuses to glue them or adds a tiny invisible spacer to keep them safe. This ensures that Lexing → Printing → Lexing always brings you back to the exact same starting point.

2. The "Magic Memo-Book" (Linear Time)

Usually, checking if a text matches a pattern is like searching a library for a specific book. If you do it the slow way, you might check every single book on every single shelf. If the library gets huge, this takes forever (quadratic time).

The researchers added a Magic Memo-Book (a verified hash table).

  • How it works: Every time the computer figures out a pattern for a specific chunk of text, it writes the answer in the book. Next time it sees that same chunk, it just flips to the page and reads the answer instantly.
  • The Result: Instead of taking hours to process a massive document, ZipLex does it in a time that grows perfectly in step with the document's size (linear time). It's like having a librarian who remembers every book you've ever asked for, so you never have to wait.

3. The "Zipper" Trick

To make the pattern matching fast, they used a concept called Zippers (named after the thing on your jacket).

  • The Analogy: Imagine a long train of train cars (the text). A normal way to look at the train is to walk from the front to the back every time you want to check a car.
  • The Zipper Way: A zipper lets you "focus" on one specific car while keeping the rest of the train organized around it. You can slide your focus forward or backward instantly without rebuilding the whole train. This makes the computer incredibly fast at scanning text.

4. Why "Verified" Matters

Most software is tested by running it a million times to see if it crashes. If it doesn't crash, we assume it's good.

  • ZipLex is different: The researchers didn't just test it; they used a mathematical proof assistant (called Stainless) to prove that the code is 100% correct.
  • Think of it like building a bridge. Instead of just driving a truck over it to see if it holds, they used math to prove that no matter what, the bridge will never collapse. This is crucial for things like compilers (which build other software) or security tools where a single mistake could be disastrous.

The Bottom Line

ZipLex is a super-accurate, super-fast, and mathematically proven tool for breaking text into pieces and putting it back together.

  • It's Invertible: You can edit the pieces and put them back together without losing any meaning.
  • It's Fast: It uses a "memory book" to avoid doing the same work twice, making it fast enough for real-world use (like processing JSON files or programming languages).
  • It's Safe: It has been mathematically proven to work correctly, unlike most other tools that just "seem" to work.

In short, ZipLex is the first tool that lets you play with the building blocks of code or text with the confidence that you can always rebuild the original structure perfectly, and do it lightning-fast.