Deterministic Preprocessing and Interpretable Fuzzy Banding for Cost-per-Student Reporting from Extracted Records

Imagine you are the head chef of a massive, chaotic kitchen (a university). Every week, you get a giant, messy box of receipts from your suppliers (the "Casual Academic Database"). These receipts tell you how much you spent on temporary cooks and how many students they fed.

The problem? The receipts are messy. Some are torn, some have blank prices, and some are just notes saying "Total" or "Sum" which aren't actual receipts. If you try to calculate the "cost per student" just by glancing at this box, you might make a mistake, or worse, you might make a different mistake than your colleague did last week.

This paper describes a robotic kitchen assistant (a computer script) that solves this problem. Here is how it works, broken down into simple parts:

1. The "Perfect Memory" Robot (Deterministic Preprocessing)

The authors built a robot named cad_processor.py. Its most important rule is: "If I see the exact same box of receipts, I will always produce the exact same report."

The Fingerprint: Before the robot even starts cooking, it takes a digital "fingerprint" (a SHA-256 hash) of the entire box of receipts. This is like taking a photo of the receipt box so that if anyone tries to swap a receipt later, the photo won't match.
The Cleaning Crew: The robot goes through the receipts one by one:
- If a receipt has no price, it treats it as $0 (but counts it as a missing receipt).
- If a receipt says "Total" or "Sum," it throws it away because that's just a summary, not a real transaction.
- If a receipt says you fed "-5 students," it throws that receipt in the trash (you can't have negative students!).
The Result: It creates a clean, organized ledger. Because the robot follows strict rules and never "guesses," you can run the same box of receipts through it a thousand times, and you will get the exact same answer every time. This makes it audit-proof.

2. The "Color-Coded Map" (Trend Analysis & Reporting)

Once the robot has cleaned the data, it creates a new report book with four specific pages:

The Receipt Log: A summary of what happened (e.g., "We threw away 5 bad receipts, and 3 had missing prices").
The Heat Map: A colorful chart showing which schools (departments) are spending the most per student.
The Detailed List: A long list of every single subject and its specific costs.
The "Fuzzy" Labels: The most creative part.

3. The "Traffic Light" System (Interpretable Fuzzy Banding)

Looking at a spreadsheet full of numbers like "$12,450.32" or "$14,200.10" is hard for humans to understand quickly. Is that expensive? Is that cheap?

The robot adds a Traffic Light System to help you understand the numbers relative to that specific year.

The Anchors (The Traffic Lights): For each year, the robot looks at all the costs and picks three special numbers:
- The Minimum (Green Light): The cheapest school.
- The Median (Yellow Light): The "middle" school (not the average, but the one right in the middle of the pack).
- The Maximum (Red Light): The most expensive school.
The "Fuzzy" Logic: Instead of saying "This is exactly $12,000," the robot asks: "How close is this number to the Green, Yellow, or Red light?"
- If a school's cost is very close to the cheapest, it gets a "Low" label (Green).
- If it's right in the middle, it gets a "Medium" label (Yellow).
- If it's near the most expensive, it gets a "High" label (Red).
The "Fuzzy" Part: What if a school is exactly halfway between "Low" and "Medium"? In normal math, you have to pick one. But this robot uses Fuzzy Logic. It says, "You are 50% Low and 50% Medium." It gives you both numbers so you can see the nuance.
The Tie-Breaker: If the robot must pick a single color for a label (like for a quick summary), it has a strict rule: Always pick "Medium" first. It's like a referee who always favors the middle ground when the call is too close to see.

Why Does This Matter?

In the real world, university budgets are huge, and people argue about them.

Without this robot: "I think School A is too expensive!" "No, I think School B is!" (Everyone is arguing over messy spreadsheets).
With this robot: "Here is the report. We used the exact same box of receipts as last time (here is the fingerprint). Here is the clean data. And here is the Traffic Light map: School A is 'Medium' this year, but School B is 'High'."

The Big Picture

This paper is about trust.

Trust in the Math: Because the robot is "deterministic," you know the math isn't changing based on who is running it.
Trust in the Meaning: Because of the "Fuzzy Banding," you don't just see a scary number; you see a clear, color-coded label that tells you where you stand compared to your peers, while still keeping the exact number visible if you want to check the details.

It turns a messy pile of receipts into a clear, fair, and checkable story about how money is being spent.

Here is a detailed technical summary of the paper "Deterministic Preprocessing and Interpretable Fuzzy Banding for Cost-per-Student Reporting from Extracted Records" by Shane Lee and Stella Ng.

1. Problem Statement

Administrative data in higher education institutions is often exchanged as spreadsheet extracts (e.g., from a Casual Academic Database or CAD). These spreadsheets are frequently used directly as reports for budgeting, workload reviews, and governance. However, relying on static spreadsheets creates significant risks regarding:

Auditability and Reproducibility: It is difficult to verify if a derived table (e.g., cost-per-student) was generated correctly from the source data without access to the exact transformation logic and the specific input snapshot used.
Interpretability: Raw numeric ratios (cost-per-student) are difficult to interpret in isolation. Stakeholders need contextual labels (e.g., "Low," "Medium," "High") to understand relative performance, but standard binning methods often lack transparency or fail to account for year-specific data distributions.
Data Quality: Extracts often contain missing values, negative counts, or summary rows that require specific handling rules to avoid skewing aggregates.

2. Methodology

The authors propose a deterministic, rule-governed, file-based workflow implemented in a Python script (cad_processor.py). The system transforms a raw CAD export workbook into a processed, multi-sheet workbook designed for inspection and decision support.

A. Deterministic Preprocessing Pipeline

The pipeline operates on a "stream-based" approach, processing rows sequentially to ensure memory efficiency and strict adherence to rules.

Input Handling: Reads a single Excel workbook treated as an authoritative snapshot. It computes a SHA-256 hash of the input file bytes to ensure the output is cryptographically linked to the specific input version.
Table Detection: Automatically scans sheets to locate the header row containing required fields: School, Subject No., Subject, Teaching Session, Incl Oncosts, and Student Count.
Row Filtering & Cleaning:
- Dropped Rows: Rows missing key identifiers, rows where the year cannot be extracted from Teaching Session, rows labeled as summaries (e.g., "Total", "Sum"), and rows with negative student counts.
- Missing Value Handling: Missing costs are treated as 0.0; missing student counts are treated as 0. These are tracked via counters.
Aggregation: Data is aggregated into two levels:
- Subject-Year: Specific subject costs and counts per year.
- School-Year: Totals per school per year.
Ratio Calculation:
- OK: If Student Count > 0, Ratio = Total Costs / Total Students.
- No Activity: If Costs = 0 and Students = 0, Ratio = 0.0.
- Undefined: If Costs > 0 and Students = 0, Ratio is left blank (to highlight the anomaly).

B. Interpretable Fuzzy Banding

To provide context without losing numerical precision, the workflow adds a fuzzy banding layer for within-year interpretation.

Anchors: For each year, three anchors are computed from the finite, positive school-year ratios:
- $a$ (Minimum)
- $b$ (Median)
- $c$ (Maximum)
Membership Functions: Three piecewise-linear functions map a ratio $x$ $x$ to membership weights ( $\mu$ $μ$ ) in $[0, 1]$ $[0, 1]$ :
- Low: Left-shoulder function (peaks at $a$ , drops to 0 at $b$ ).
- Medium: Triangular function (peaks at $b$ , 0 at $a$ and $c$ ).
- High: Right-shoulder function (0 at $b$ , peaks at $c$ ).
Label Assignment: The label is assigned based on the maximum membership weight.
- Tie-Breaking: Deterministic priority is applied if weights are equal: Medium > Low > High. This ensures consistent labeling for boundary values.
Output: The system reports the raw ratio, the three membership weights, the assigned label, and a numeric score (0.0 for Low, 0.5 for Medium, 1.0 for High).

3. Key Contributions

Auditability via Cryptographic Hashing: The workflow embeds the SHA-256 hash of the input file into the output and logs. This allows any stakeholder to verify that a specific output table was derived from a specific input snapshot, preventing "silent" data drift.
Transparent Fuzzy Logic: Unlike "black box" categorization, the fuzzy banding layer explicitly reports the anchors (min, median, max) and the membership weights used to generate labels. This allows users to see why a value was labeled "Medium" (e.g., it was equidistant between Low and Medium).
Comprehensive Metadata & Counters: The Processing Summary sheet records detailed counters for dropped rows, missing values, and boundary cases. This transforms data cleaning from a hidden step into an auditable process.
FAIR Principles Alignment: The workflow produces well-described, reusable artifacts (code, logs, workbooks) that support the Findability, Accessibility, Interoperability, and Reuse (FAIR) principles.

4. Results

The paper presents a synthetic example run demonstrating the system's capabilities:

Output Structure: The processed workbook contains four sheets:
1. Processing Summary: Contains the input hash, detected sheet/header, row-handling counters, and per-year anchors.
2. Trend Analysis: A school-by-year matrix with conditional formatting (heat maps) anchored to the specific year's min/median/max.
3. Report: A wide table showing subject-level details (costs, counts, ratios) allowing for manual re-aggregation.
4. Fuzzy Bands: A table linking each school-year ratio to its membership weights, band label, and score.
Reproducibility: The authors demonstrate that given identical input bytes and the same code version, the system produces byte-identical outputs.
Worked Example: A calculation is provided where a ratio of 12,000 (with anchors 10k, 15k, 30k) yields $\mu_{Low}=0.6$ , $\mu_{Medium}=0.4$ , and $\mu_{High}=0.0$ , correctly assigning the "Low" label.

5. Significance

This work bridges the gap between raw administrative data extraction and high-level governance decision-making.

Trust in Data: By making the transformation rules explicit and the input snapshot verifiable, it reduces the risk of errors in budgeting and resource allocation.
Human-Centric Analytics: The fuzzy banding approach provides a "human-readable" summary (Low/Med/High) while retaining the mathematical rigor of the underlying data. It avoids the pitfalls of arbitrary thresholds by using data-driven anchors (min/median/max) for each specific year.
Scalability and Robustness: The stream-based processing and strict error handling make the system suitable for large institutional datasets, while the deterministic nature ensures that audits can be performed years later with confidence.

In conclusion, the paper proposes a robust framework for converting administrative spreadsheets into auditable, interpretable, and reproducible decision-support tools, addressing critical needs in institutional governance and data stewardship.

Deterministic Preprocessing and Interpretable Fuzzy Banding for Cost-per-Student Reporting from Extracted Records

1. The "Perfect Memory" Robot (Deterministic Preprocessing)

2. The "Color-Coded Map" (Trend Analysis & Reporting)

3. The "Traffic Light" System (Interpretable Fuzzy Banding)

Why Does This Matter?

The Big Picture

1. Problem Statement

2. Methodology

A. Deterministic Preprocessing Pipeline

B. Interpretable Fuzzy Banding

3. Key Contributions

4. Results

5. Significance

More like this

RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

Real-Time AI Service Economy: A Framework for Agentic Computing Across the Continuum

Reasoning Models Struggle to Control their Chains of Thought

Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks