Degrees of Freedom and Information Criteria for the Synthetic Control Method

This paper provides an analytical characterization of the Synthetic Control Method's degrees of freedom to derive estimable information criteria that improve model selection over cross-validation, particularly in settings with noisy donors and numerous candidates such as the Tianjin car license rationing case.

Guillaume Allaire Pouliot, Zhen Xie, Ziyi Liu

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you are trying to predict how a specific car model (let's call it the "Highlander") will sell in a city called Tianjin after the government suddenly starts rationing license plates. You want to know: How many fewer cars would have been sold if the rationing hadn't happened?

To answer this, you need to build a "Ghost Car." This Ghost Car represents what the Highlander's sales would have looked like in a parallel universe where no rationing existed.

In the past, economists built this Ghost Car by finding one perfect twin city (like Shijiazhuang) that looked exactly like Tianjin in every way. But what if that twin city is noisy? What if its sales data is jittery and unreliable?

This is where the Synthetic Control Method (SCM) comes in. Instead of finding one perfect twin, SCM builds a "Frankenstein" Ghost Car by mixing together parts from many different cities. It takes a little bit of City A, a dash of City B, and a pinch of City C to create a perfect average that mimics Tianjin's pre-rationing sales.

However, the authors of this paper discovered a problem with this method when there are too many cities to choose from.

The Problem: The "Overfitting" Trap

Imagine you are trying to draw a line through a scatter of dots on a piece of paper.

  • The Good Way: You draw a smooth line that captures the general trend.
  • The Bad Way (Overfitting): You have so many dots that you can draw a squiggly, crazy line that hits every single dot perfectly.

The problem is, that crazy line is just memorizing the noise (the random jitters) rather than learning the real trend. If you use that crazy line to predict the future, you will be wrong.

In the world of Synthetic Controls, if you have 100 cities to choose from but only 10 years of data, the computer can find a weird combination of cities that fits the past data too perfectly. It's like cheating on a test by memorizing the answers to the practice questions but failing the real exam because you didn't learn the concepts.

The Solution: Degrees of Freedom (The "Flexibility Meter")

The authors wanted a way to measure exactly how much the method is "cheating" or flexing its muscles to fit the noise. They invented a new metric called Degrees of Freedom.

Think of Degrees of Freedom as a "Flexibility Score."

  • If your model is simple (like a straight line), it has a low score. It's rigid and honest.
  • If your model is complex (like that crazy squiggly line), it has a high score. It's flexible and suspicious.

The paper proves a surprising mathematical fact: For the standard Synthetic Control method, the Flexibility Score is roughly equal to the number of cities you actually used minus one.

  • If you use 5 cities to build your Ghost Car, your Flexibility Score is 4.
  • This gives researchers a clear warning light: "Hey, you are using too many cities for the amount of data you have. You are probably overfitting!"

The Tool: Information Criteria (The "Smart Judge")

Once you have a Flexibility Score, you need a way to pick the best model. Usually, researchers use a method called Cross-Validation.

The Old Way (Cross-Validation):
Imagine you are a teacher testing a student. You give them half the homework to study (training) and the other half to take a test (validation).

  • The Flaw: In this specific economic problem, the "homework" period is very short. Splitting a short homework assignment in half leaves the student with almost nothing to study. It's like trying to learn a language by studying for 2 days and then taking a test on the remaining 2 days. The results are unreliable.

The New Way (Information Criteria):
The authors propose a new "Smart Judge" called an Information Criterion.

  • Instead of splitting the data, this judge looks at the entire homework assignment.
  • It calculates the score using a formula: How well did you fit the past? + (Penalty for being too flexible).
  • If your model is too complex (too flexible), the judge adds a heavy penalty. If it's too simple, it gets a penalty for missing the trend.
  • The goal is to find the "Goldilocks" model: not too simple, not too complex, just right.

The Real-World Test: Tianjin's Car Market

The authors tested this new "Smart Judge" on the Tianjin car market.

  • The Situation: Tianjin introduced a lottery/auction for car licenses. This changed who could buy cars. Wealthier people could afford the auction, so they bought different cars than before.
  • The Challenge: They had 76 different car models to analyze, but the sales data for each was noisy.
  • The Result:
    • When they used the old "Split the Data" method (Cross-Validation), it picked a model that was too simple and missed the real impact.
    • When they used the new "Smart Judge" (Information Criteria), it found the perfect balance.
    • The Finding: The rationing didn't just lower sales; it changed the mix of cars. Mid-range and luxury cars (like the Toyota Highlander) actually saw their market share increase relative to cheap cars. The "rich" buyers who won the auctions preferred nicer cars.

Summary

This paper is like giving economists a new ruler and a new judge.

  1. The Ruler (Degrees of Freedom): Tells you exactly how "flexible" your model is, so you know if it's cheating by memorizing noise.
  2. The Judge (Information Criteria): Helps you pick the best model without needing to split your tiny dataset in half, which usually leads to bad decisions.

By using these tools, researchers can finally trust their "Ghost Car" predictions, even when they are working with messy data and a huge number of options. It turns a guessing game into a precise science.