Plotting correlated data

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a detective trying to solve a mystery. You have a list of clues (data points), and each clue comes with a "margin of error" (how unsure you are about that specific clue). Usually, scientists draw these clues on a graph with little vertical lines (error bars) showing how much wiggle room they have.

The Problem: The "Lone Wolf" Illusion
In the old way of doing things, scientists treated every clue as if it were a "lone wolf." They assumed that if Clue A was wrong, it had nothing to do with Clue B. They drew error bars for each clue independently.

But in the real world, clues often talk to each other. If Clue A is wrong, Clue B is likely wrong in the exact same way. This is called correlation.

The paper argues that when you ignore these connections, your graph becomes a liar.

The Analogy: Imagine you are guessing the weather in three neighboring towns. If it rains in Town A, it almost certainly rains in Town B and Town C. If you draw three separate "maybe it rains" signs for each town, you might think a model predicting "sunny everywhere" is a bad fit because it misses the rain in Town A. But if you realize the rain is a system-wide event (a storm front), that model might actually be wrong in a very specific, predictable way.
The Paper's Example: The author shows a graph where a model (M2) looks perfect because it stays inside the error bars of all the points. But because the points are "linked" (correlated), the model is actually a terrible fit. It's like a student who gets the right answer on a test by guessing the same wrong answer for every question; they look like they fit the pattern, but they failed the logic.

The Solution: New Ways to Draw the Picture
The author, Lukas Koch, suggests three new ways to draw these graphs so we can see the "invisible links" between the data points.

1. The "Hinton" Map (The Weighted Dice)

Instead of just showing the error bars, we need to show a map of how the clues are connected.

The Old Way: A colorful heat map. If you print it in black and white or if you are colorblind, it looks like a blurry gray mess. You can't tell if two points are "friends" (positive correlation) or "enemies" (negative correlation).
The New Way (Hinton Diagram): Imagine a grid of squares. Instead of using color, we use size.
- A big square means a strong connection.
- A tiny square means a weak connection.
- The color (black or white) tells you if they are friends or enemies.
- Why it's great: Even in black and white, or for someone who can't see colors, the size difference is obvious. It's like seeing a giant handshake vs. a tiny wave.

2. The "Rope" Method (Correlation Lines)

This is for showing how neighbors affect each other.

The Analogy: Imagine the data points are people standing in a line, each holding a balloon (their error bar).
- If they are friends (positive correlation), they hold their balloons on the same side (both left or both right). If you draw a rope between them, it goes straight across.
- If they are enemies (negative correlation), one holds their balloon on the left, and the other on the right. If you draw a rope between them, it crosses over like an "X".
What it tells you: If you see a rope crossing over, you know that if one person moves up, their neighbor is likely to move down. This helps you see if a model is following the "dance" of the data or fighting against it.

3. The "Shadow" Method (Principal Components)

Sometimes, the biggest problem isn't just neighbors; it's a giant force pushing all the data in one direction at once.

The Analogy: Imagine a group of people trying to walk in a straight line, but a giant wind is blowing them all sideways.
- The Outer Box (the big error bar) shows the total uncertainty, including the wind.
- The Inner Triangle shows what the uncertainty would be if the wind stopped (the "intrinsic" uncertainty).
- The Hatched Area (the shadow between the box and the triangle) shows the "wind" itself.
The Trick: If a model prediction falls into the "windy" shadow area, it might actually be a good fit! It's just that the whole group was blown off course together. If the model tries to fight the wind (goes against the hatching), it's a bad fit.

The Big Takeaway

The paper is essentially saying: "Don't just look at the dots; look at the invisible strings tying them together."

By adding these visual cues (size-based maps, crossing ropes, and hatched shadows), scientists can stop being fooled by graphs that look good but are actually wrong. It makes the data more honest, more accessible (even for colorblind readers), and helps everyone understand why a model fits or fails, rather than just guessing.

In short: Stop looking at the data points in isolation. Look at how they dance together, and you'll see the truth.

1. Problem Statement

In quantitative sciences, data visualization typically involves plotting measured $y$ -values against fixed $x$ -values with vertical error bars representing uncertainties (68% confidence/credibility intervals).

The Flaw: Standard error bars represent only the square root of the diagonal elements of the covariance matrix (marginal uncertainties). They fail to convey off-diagonal correlations between data points.
The Consequence: When data uncertainties are correlated, the visual intuition that "a model fits if it lies within ~2/3 of the error bars" becomes invalid. A model may appear to fit well visually (staying within error bars) while having a terrible statistical fit (high $\chi^2$ or Mahalanobis distance) because the model violates the correlation structure.
Current Limitations: Existing methods to show correlations, such as separate 2D correlation matrix heatmaps, are often inaccessible (e.g., fail in black-and-white printing or for color-blind readers) and cognitively disjointed from the main data plot, making it difficult to correlate specific matrix elements with data point deviations.

2. Methodology

The paper proposes a suite of visualization techniques designed to embed correlation information directly into the primary data plot or provide accessible alternatives.

A. Accessible Correlation Matrices (Hinton Diagrams)

Problem: Standard divergent color maps (e.g., coolwarm) are unusable without color information. Sequential maps (e.g., cividis) struggle to distinguish small positive from small negative values.
Solution: The author advocates for Hinton diagrams.
- Mechanism: The absolute value of the correlation coefficient is represented by the area of a symbol (circle), while the sign (positive/negative) is represented by the color (or grayscale intensity) of the symbol.
- Benefit: This allows for immediate visual distinction between positive and negative correlations even in monochrome or for color-blind readers, as small bright dots are easily distinguished from small dark dots.

B. Correlation Lines (Neighboring Bins)

Concept: To show correlations between adjacent data points without cluttering the plot with a full matrix.
Mechanism:
- Two lines connect the error bars of neighboring points.
- Positive Correlation: Lines connect to the same side of the error bars (e.g., top-to-top).
- Negative Correlation: Lines cross, connecting to opposite sides (e.g., top-to-bottom).
- Zero Correlation: Lines coincide, forming a single line connecting the points.
- Positioning: The attachment point on the error bar corresponds to the magnitude of the correlation coefficient ( $\rho$ ).
Interpretation: These lines visualize the conditional expectation. If one bin fluctuates by $+1\sigma$ , the expected value of the neighbor shifts to where the line attaches. The distance from the center to the attachment point represents the variance "explained" by the neighbor.

C. Principal Component (PC) Plots

Concept: Visualizing the dominant directions of uncertainty in the $N$ -dimensional data space.
Mechanism:
- Perform PCA on the covariance matrix to find eigenvectors ( $u_i$ ) and eigenvalues ( $\lambda_i$ ).
- Visual Encoding:
  - Outer Edge: Represents the full marginal uncertainty (standard error bar).
  - Inner Edge: Represents the "remaining" covariance after removing the contribution of the first (dominant) principal component.
  - Hatched Area: The region between the outer and inner edges represents the uncertainty contributed by the first PC.
  - Hatching Direction: Indicates the sign of the component. If a model prediction aligns with the hatching direction of the data, it is compared against the full uncertainty. If it opposes the direction, it is compared only against the remaining (reduced) uncertainty.
- Scaling: The first PC is scaled down (e.g., to the level of the second PC or the median eigenvalue) to make the remaining correlations visible.
Conditional Uncertainties: The plot also includes "inner triangle points" representing the conditional variance (uncertainty of a point given all others are fixed), highlighting intrinsic uncertainty independent of correlations.

3. Key Contributions

Diagnosis of Misleading Visuals: Demonstrates via a toy example (Figure 1) that a model can visually appear superior while being statistically inferior due to ignored correlations.
Accessibility Standard: Promotes Hinton diagrams as a color-blind and print-friendly standard for displaying correlation matrices, replacing problematic divergent color maps.
Integrated Visualization: Introduces Correlation Lines and Principal Component Plots as methods to embed high-dimensional correlation data directly into 1D/2D data plots, reducing cognitive load.
Interpretative Rules: Provides specific "rules of thumb" for judging model fit in these new plots:
- Compare models to the outer edge if the deviation aligns with the dominant correlation direction.
- Compare models to the inner edge if the deviation opposes the correlation direction.
- Use conditional uncertainty points to gauge intrinsic data quality.

4. Results and Case Studies

Toy Example: In a 3-point dataset with strong positive correlation between points 2 and 3, Model M2 (closer to central values) had a much higher $\chi^2$ (21/3) than Model M1 (2.6/3) because M2 violated the correlation structure. The new plots (PC plot with hatching) immediately revealed why M2 was a poor fit.
Real-World Application ( $\delta p_T$ Cross-Section): Applied to experimental data from Abe et al. (2018).
- The PC plot revealed strong anticorrelations between bins 2, 3, and 4.
- This indicated that a "dip" in the data was likely a statistical fluctuation rather than a physical feature, as the model deviation did not align with the dominant anticorrelation direction.
- The plot clarified that the discrepancy was driven by the first and last bins, not just the visually obvious dip in the middle.

5. Significance

Statistical Rigor: Prevents researchers from drawing incorrect conclusions about model compatibility based solely on marginal error bars.
Accessibility: Addresses a critical gap in scientific communication by ensuring correlation data is interpretable by color-blind audiences and in black-and-white formats (crucial for printed journals).
Information Density: Balances the need for rich statistical information with visual clarity. The proposed methods are "additive"; if a reader ignores the new elements (hatching, lines), the plot reverts to the standard, familiar error bar view.
Implementation: The methods are implemented in the open-source NuStatTools Python package, facilitating immediate adoption by the scientific community.

In conclusion, the paper argues that visualizing correlated data requires moving beyond simple error bars. By integrating correlation structures directly into the plot via Hinton diagrams, correlation lines, and principal component hatching, researchers can accurately assess model fit and identify physical vs. statistical features in their data.