Categorical Calculus and Algebra for Multi-Model Data

This paper establishes a theoretical foundation for querying multi-model databases by proposing and proving the equivalence of categorical calculus and categorical algebra, while also providing transformation rules for optimization and analyzing their expressive power and computational complexity.

Jiaheng Lu (University of Helsinki)

Published Thu, 12 Ma
📖 6 min read🧠 Deep dive

Imagine you are the manager of a massive, chaotic library. But this isn't a normal library. It has three distinct wings:

  1. The Spreadsheet Wing: Rows and columns of data (like customer names and credit limits).
  2. The Family Tree Wing: Hierarchical documents where items are nested inside other items (like an order containing a list of products).
  3. The Social Network Wing: A web of connections where people are linked to other people (like "who knows whom").

In the past, if you wanted to ask a question that crossed these wings—like "Find all customers who know a person named John, and who also ordered a specific product"—you would need three different languages and three different sets of tools. It was like trying to speak English, French, and Morse code simultaneously just to order a coffee.

This paper, "Categorical Calculus and Algebra for Multi-Model Data," proposes a solution: A Universal Translator and Toolkit.

Here is the breakdown of their idea using simple analogies:

1. The Big Idea: The "Category" as a Universal Map

The authors suggest viewing all these different data types (spreadsheets, trees, webs) as just different shapes of the same thing: Categories.

  • The Analogy: Think of a "Category" as a map of a city.
    • In a spreadsheet, the "streets" are rows connecting to columns.
    • In a family tree, the "streets" are parent-child links.
    • In a social network, the "streets" are friendships.
  • The Magic: Instead of treating these as totally different cities, the authors say, "Let's just draw them all on one giant map." In this map, every piece of data is a Place (Object), and every relationship is a Road (Morphism/Function).

2. The Two Languages: The "Recipe" vs. The "Shopping List"

To ask questions (queries) on this giant map, the authors invent two new languages. They are equivalent, meaning you can translate one into the other, but they feel different.

A. Categorical Calculus (The "Descriptive Recipe")

  • What it is: A declarative language. You describe what you want to find, not how to find it.
  • The Analogy: Imagine you are ordering a custom cake. You tell the baker: "I want a cake that has chocolate frosting, is shaped like a car, and has a red ribbon." You don't tell them how to mix the batter or bake it; you just describe the final result.
  • In the paper: This language lets you say things like, "Find me a person who is an ancestor of John" or "Find a path from A to B." It uses logic symbols (AND, OR, IF) to describe the properties of the data you want.

B. Categorical Algebra (The "Step-by-Step Toolkit")

  • What it is: A procedural language. It gives you a set of tools to physically manipulate the data to get the answer.
  • The Analogy: This is the baker's instruction manual. It says: "Take the chocolate flour, sift it (Select), mix it with the eggs (Map), cut out the car shape (Project), and glue the ribbon on (Join)."
  • In the paper: They introduce special tools (operators) that work on all data types:
    • Map: Follow a road from one place to another.
    • Select: Filter out the places that don't match your criteria (e.g., "Only keep customers over 18").
    • Project: Look at only specific details (e.g., "Show me just the names, not the addresses").
    • Limit (The "Join" Tool): This is the super-power. It takes two separate maps and snaps them together where the roads connect. It's like merging the "Customer" list with the "Order" list because they share a "Customer ID."
    • GetReach: A special tool for the Social Network wing that finds everyone you can reach by walking down a certain number of roads (e.g., "Find everyone John is connected to, even if it takes 5 friends to get there").

3. The "Aha!" Moment: They Are the Same

The paper proves a major theorem: The Recipe and the Toolkit are interchangeable.
If you can describe a result using the "Recipe" (Calculus), you can build it using the "Toolkit" (Algebra), and vice versa. This is huge because it means database engineers can write a query in the easy, descriptive way, and the computer can automatically translate it into the efficient, step-by-step way to run it.

4. Making it Faster: The "Optimization" Rules

Once you have a toolkit, you want to use it efficiently. The paper provides a set of rules for rearranging your tools to make the work faster.

  • The Analogy: Imagine you are cleaning a house.
    • Bad way: Pick up every sock in the house, put them in a pile, then go back and throw away the dirty ones.
    • Good way (Optimization): Go to the room, pick up only the dirty socks, and throw them away immediately.
  • In the paper: They show rules like "Pushing the filter down." If you want to find "Male students who took Math," the computer shouldn't first find all students, then all math classes, and then filter. It should filter for "Male" first, then find their math classes. This saves massive amounts of time and computer memory.

5. Why Does This Matter?

We live in an era of "Multi-Model Data." Our apps use graphs for social media, trees for JSON files, and tables for billing. Currently, building systems that handle all of this is a nightmare of complexity.

This paper provides the theoretical foundation to build a single, unified database engine that understands all these shapes at once. It's like inventing a universal remote control that works on your TV, your stereo, and your lights, rather than needing three different remotes.

In summary:
The authors took the abstract math of Category Theory (which usually deals with very high-level shapes) and turned it into a practical instruction manual for querying modern, messy, multi-format data. They gave us a way to speak about data relationships in a single, unified language, ensuring that no matter how complex your data structure is, you can ask it a question and get an answer efficiently.