On Solving String Equations via Powers and Parikh Images

Imagine you are a detective trying to solve a mystery where the clues are written in a secret code made of strings of letters. Your job is to figure out if there is a way to fill in the blanks (the variables) so that two long, complicated sentences become identical.

This paper introduces a new, super-powered detective toolkit called ZIPT (named by the authors) that solves these "string equations" much better than previous tools.

Here is how the paper works, explained through simple analogies:

The Problem: The "Infinite Loop" Trap

Imagine you are given a puzzle like this:

"The word x followed by b followed by x followed by a is the same as a followed by x followed by b followed by x."

If you try to solve this by just guessing what x is (like "Is x 'a'?", "Is x 'ab'?"), you might get stuck in an infinite loop. The older methods would keep breaking the word down into smaller and smaller pieces, never realizing that x is actually a repeating pattern. It's like trying to count the grains of sand on a beach one by one instead of realizing they are all part of a single pile.

The authors' new approach uses three "superpowers" to avoid these traps.

Superpower 1: The "Power Button" (String Powers)

The Analogy: Imagine you have a photocopier. Instead of writing out "AAAAA" (five A's) five times, you just write "A⁵" (A to the power of 5).

How it helps:
In old solvers, if a variable x had to be a very long repeating string (like "ababab..."), the computer would try to write out every single "ab" one by one. This takes forever and runs out of memory.
The new method introduces a Power Operator. It recognizes that x is actually just a pattern repeated $m$ times. Instead of expanding the whole string, it keeps it compressed as x = (pattern)ᵐ.

Real-world impact: This allows the solver to handle equations where the answer is a string that would be billions of characters long, without actually writing them all down.

Superpower 2: The "Scissors" (Equality Decomposition)

The Analogy: Imagine you have two long ribbons tied together, and you need to see if they match. If you know the first 5 inches of Ribbon A are the same as the first 5 inches of Ribbon B, you can just cut them off and focus on the rest.

How it helps:
Sometimes, the two sides of the equation are so long and messy that you can't see the pattern. The "Equality Decomposition" technique looks at the lengths of the different parts. If it knows that one part is exactly 3 characters longer than the other, it can "pad" the shorter side with a placeholder (like a blank space) and then cut the equation into two smaller, easier puzzles.

Real-world impact: It breaks a giant, scary problem into two small, manageable problems that are easy to solve.

Superpower 3: The "Inventory Counter" (Parikh Images)

The Analogy: Imagine you are checking if two bags of groceries are the same. Instead of looking at the order of the items (did the milk come before the eggs?), you just count the total number of apples, bananas, and oranges in each bag. If Bag A has 5 apples and Bag B has 3, you know immediately they aren't the same, even without looking at the order.

How it helps:
This is the "Parikh Image." It ignores the order of the letters and just counts how many times each letter appears.

The Twist: The authors improved this. They don't just count single letters (like 'a' or 'b'); they count patterns (like "abc").
Real-world impact: If one side of the equation has the pattern "abc" appearing 3 times, and the other side only has it appearing 2 times, the solver instantly knows the equation is impossible (unsatisfiable). It's a quick "sanity check" that catches impossible puzzles before the computer wastes time trying to solve them.

How They Work Together: The "Nielsen Graph"

The authors combine these three tools into a flowchart they call a Nielsen Graph. Think of this as a decision tree in a "Choose Your Own Adventure" book.

Start: You have a messy equation.
Check Inventory (Parikh): Can we prove it's impossible just by counting? If yes, stop! (Unsatisfiable).
Compress (Power): Can we turn a long repeating string into a "Power" token? If yes, do it to save space.
Cut (Decomposition): Can we split this into two smaller equations? If yes, do it.
Branch: If none of the above work, the computer tries different guesses (like "What if x is empty?"). It creates branches in the tree.

If they find a path where the equation works, they shout "Solved!" If they try every possible path and find a contradiction everywhere, they shout "Impossible!"

The Results

The authors built a prototype tool called ZIPT and tested it against the world's best existing solvers (like Z3 and cvc5).

The Verdict: ZIPT solved significantly more difficult puzzles, especially those involving long, repetitive strings (which are common in security analysis and software verification).
Why it matters: In the real world, this helps verify that software code doesn't have hidden bugs, that passwords are secure, and that data processing systems work correctly, even when dealing with massive amounts of text data.

Summary

The paper is about teaching computers to stop counting every single letter in a long string and start thinking like a smart human:

Group repeating patterns (Powers).
Cut the problem into smaller pieces (Decomposition).
Count the ingredients to spot impossible matches (Parikh Images).

This makes solving complex string mysteries faster, smarter, and capable of handling problems that were previously impossible.

Here is a detailed technical summary of the paper "On Solving String Equations via Powers and Parikh Images" by Eisenhofer et al.

1. Problem Statement

The paper addresses the challenge of solving string equations (word equations) within the context of Satisfiability Modulo Theories (SMT). While modern SMT solvers (e.g., Z3, cvc5) handle many string constraints, they struggle with equations involving:

Long repeated subsequences: Strings with high repetition that cause exponential blowup in standard solvers.
Mutually dependent variables: Cases where string variables depend on themselves or each other in complex cycles (e.g., $x \simeq axb$ ).
Complex interactions: Equations where standard decomposition fails to detect unsatisfiability or requires infinite expansion.

The authors aim to extend the capabilities of Nielsen transformations (a classic method for solving word equations) to handle these difficult cases efficiently.

2. Methodology

The proposed approach, implemented in a prototype solver named ZIPT, extends the standard Nielsen transformation framework with three core techniques:

A. Workflow: Extended Nielsen Graphs

The solver operates on Nielsen graphs, where nodes represent sets of string equations and integer constraints. The process involves:

Expansion: Applying rewriting and generating rules to decompose equations.
Simplification: Using lemma rules and term rewriting to simplify constraints.
Termination Check: If a node reduces to an empty set of equations with satisfiable integer constraints, the system is SAT. If all paths lead to contradictions ( $\bot$ ), it is UNSAT.

B. Key Technical Contributions

The paper introduces three specific extensions to the standard Nielsen rules:

1. Equality Decomposition

Concept: Standard decomposition splits equations $u_1u_2 \simeq v_1v_2$ only if the lengths of $u_1$ and $v_1$ are equal.
Innovation: The authors introduce padding using fresh symbolic characters. If the length difference $d = |u_1| - |v_1|$ $d = ∣ u_{1} ∣ - ∣ v_{1} ∣$ is known, they decompose the equation into:
- $u_1 \simeq v_1 \bar{o}_d$
- $\bar{o}_d u_2 \simeq v_2$
- (Where $\bar{o}_d$ represents $d$ symbolic characters).
Benefit: This allows decomposition at arbitrary positions, not just the start/end, enabling the solver to access internal variables for further rewriting.

2. Explicit Power Representation (Ground Power Introduction)

Concept: To handle repetitive structures (e.g., $x \simeq axb$ implies $x = a^m$ ), the solver introduces power tokens ( $u^m$ ).
Innovation: Instead of unwinding strings infinitely (e.g., $x \to ax' \to aax'' \dots$ ), the solver detects patterns where a variable $x$ must be a repetition of a base string. It replaces $x$ with $w^m$ (where $w$ is a ground term and $m$ is an integer variable).
Mechanism:
- Uses prefix analysis ( $pre(w)$ ) to determine valid decompositions.
- Handles "crossing" occurrences where a pattern spans across variable boundaries.
- Introduces integer constraints (e.g., $0 \le m' < m$) to manage the branching logic efficiently.
Benefit: Compresses potentially infinite chains of substitutions into finite power terms, allowing the solver to handle exponential growth in string length without exponential time complexity.

3. Generalized Parikh Images

Concept: Parikh images count character occurrences to detect unsatisfiability (e.g., if LHS has 3 'a's and RHS has 2, it's UNSAT). Standard Parikh images only count single characters.
Innovation: The authors define Multi-sequence Parikh Images based on unbordered patterns (patterns where no proper suffix is a prefix).
- They define over-approximations ( $\alpha^\uparrow$ ) and under-approximations ( $\alpha^\downarrow$ ) for pattern occurrences.
- They handle "crossing" occurrences within gaps (segments between variables) to ensure the bounds are tight.
Benefit: Detects unsatisfiability in cases where standard character counting fails (e.g., $x_1x_1acx_2x_2b \simeq x_2x_2abcx_1x_1$ ). By analyzing the pattern "abc", the solver can prove the equation is impossible because the pattern appears once on the RHS but zero times on the LHS (relative to variable positions).

3. Implementation

Solver: ZIPT, built on top of the Z3 SMT solver.
Framework: Uses the user-propagation framework to integrate the custom string solving logic with Z3's CDCL(T) engine and integer solver.
Heuristics:
- Prioritizes rules that eliminate string variables.
- Uses look-ahead heuristics to choose branches that lead to immediate conflicts.
- Employs iterative deepening for graph traversal to ensure robust termination on satisfiable instances.

4. Experimental Results

The authors evaluated ZIPT on the woorpje benchmark set (SMT-LIB QF_S tracks), consisting of 409 problems containing only string equations.

Performance: ZIPT outperformed state-of-the-art solvers (Z3, cvc5, OSTRICH, Z3-Noodler, Z3str3).
- Track 01: Solved 200/200 (Tied with best).
- Track 02: Solved 9/9 (Significantly better than competitors who solved 1–6). This track contains problems with exponential models, where ZIPT's power introduction was crucial.
- Track 03: Solved 195/200 (Best among all).
- Track 04: Solved 200/200 (Tied with best).
Key Finding: The ability to use nested power tokens allowed ZIPT to solve Track 02 problems without the exponential blowup that stalled other solvers. The generalized Parikh images were effective in pruning unsatisfiable branches early.

5. Significance and Conclusion

Theoretical Advancement: The paper bridges the gap between classical word equation solving (Nielsen transformations) and modern SMT requirements by introducing power operators and generalized Parikh images.
Practical Impact: It demonstrates that handling self-dependencies and repetitive strings does not require abandoning the Nielsen framework but rather extending it with algebraic abstractions (powers) and combinatorial abstractions (Parikh).
Future Work: The authors plan to refine Parikh approximations, introduce non-ground power terms, and extend support to regular expressions and other SMT-LIB string functions.

In summary, this work provides a robust, efficient method for solving complex string equations that are currently intractable for existing SMT solvers, primarily by compressing repetitive structures into power terms and using pattern-based counting to detect contradictions.