`pandemonium`: High Dimensional Analysis in Linked… — Plain-Language Explanation

Imagine you are trying to solve a giant, complex puzzle where you have two different sets of clues. One set of clues describes what you put in (like ingredients in a recipe or settings on a machine), and the other set describes what comes out (like the taste of the cake or the machine's output).

The problem is that there are so many ingredients and so many possible tastes that it's impossible to see the pattern just by looking at a spreadsheet. You need a way to see how the ingredients together create specific tastes.

This is exactly what the pandemonium R package does. It's a digital "magic window" that helps researchers connect the dots between two high-dimensional worlds.

Here is how it works, using simple analogies:

1. The Two Rooms (Linked Spaces)

Think of your data as two separate rooms:

Room A (The Clustering Space): This is where you group things based on how similar they are. Imagine sorting a pile of mixed-up socks by color and pattern.
Room B (The Linked Space): This is where you look at the original details. Imagine looking at the same socks to see what fabric they are made of or where they were bought.

Usually, researchers look at Room A, then walk over to Room B and try to guess how they relate. pandemonium puts a giant, two-way mirror between the rooms. When you point at a group of socks in Room A, the mirror instantly highlights those exact same socks in Room B, showing you their fabric and origin.

2. The Magic Lens (Clustering)

The tool starts by organizing the data in Room A. It uses a method called hierarchical clustering, which is like folding a map. You can zoom out to see a few big regions (like continents) or zoom in to see tiny neighborhoods (like streets).

You can say, "Show me 3 big groups," or "Show me 10 small groups."
As you change the number of groups, the tool instantly updates the view in both rooms.

3. The Moving Camera (Tours and Projections)

Since the data has too many dimensions to draw on a flat piece of paper, the tool uses two special camera tricks to flatten the 3D (or 100D) world into a 2D screen:

The Non-Linear Lens (UMAP/t-SNE): This is like a funhouse mirror that squishes and stretches the data to show which points are naturally close to each other, even if they are far apart in the raw numbers.
The Animated Tour: This is like a drone flying through a cloud of data points. Instead of a static photo, you get a video that slowly rotates the cloud, letting you see hidden shapes and gaps that you would miss if you just looked at one angle.

4. The "Brush" (Interactive Selection)

This is the most powerful feature. Imagine you have a paintbrush.

You paint a specific cluster of points in the "drone video" (Room A).
Instantly, those same points light up in the "static map" (Room B).
This lets you ask questions like: "Why do all these points that look similar in the output (Room A) have such different temperatures and humidity levels in the input (Room B)?"

Real-World Examples from the Paper

The authors tested this tool on two very different problems to show how it works:

Example 1: The Bike Rental Machine (Machine Learning)

The Setup: They had a computer model that predicts how many bikes people will rent based on weather (temperature, wind, rain).
The Problem: They wanted to know which weather combinations make the model act strangely or predict well.
The Solution: They grouped the model's internal "thoughts" (activations) into clusters. Then, they used the mirror to look at the weather data for those groups. They discovered that specific combinations of temperature and humidity were the main drivers for separating the groups. They also checked the "mistakes" (residuals) the model made and saw that the model was actually doing a good job everywhere, with no weird blind spots.

Example 2: The Particle Physics Puzzle (Physics)

The Setup: Physicists have a complex model with 150 knobs (parameters) that they turn to match experimental data about subatomic particles.
The Problem: With 150 knobs, it's impossible to know which ones actually matter.
The Solution: They took a smaller set of 6 knobs and 16 measurements. They grouped the measurements that looked similar. Then, they looked at the "knobs" for those groups. The tool revealed that only two specific knobs (out of the six) were responsible for creating the distinct groups. The other four knobs didn't seem to change the outcome much.

Why This Matters

Before tools like pandemonium, figuring out these connections was like trying to find a needle in a haystack while wearing blindfold. You might guess, but you couldn't see the pattern.

This package doesn't just crunch numbers; it lets you explore. It allows you to:

Group data by similarity.
Instantly see what those groups look like in the original data.
Rotate and zoom through the data to find hidden structures.

It is designed to be easy enough for a beginner to use with a mouse and screen, but flexible enough for experts to plug in their own custom math formulas. It turns a confusing mess of high-dimensional data into a clear, interactive story.

Technical Summary: pandemonium: High Dimensional Analysis in Linked Spaces

Problem Statement
Data analysis frequently encounters scenarios involving large numbers of predictors and responses, creating two intrinsically linked high-dimensional spaces (input and output). While visual approaches are effective for low-dimensional data, traditional techniques often fail to reveal relationships spanning both domains simultaneously. Existing tools typically focus on a single space or the interactive exploration of clustering results within one space, making it difficult to reason about how structures in a predictor space relate to patterns in a response space, or vice versa.

Methodology
The paper introduces pandemonium, an R package designed to explore linked high-dimensional spaces by combining hierarchical cluster analysis with interactive, linked visualizations. The methodology operates on a dataset of $n$ observations distributed across two spaces: a clustering space (variables $Y$ ) and a linked space (variables $X$ ), with optional additional information ( $Z$ ).

The core workflow involves:

Coordinate Transformation: Raw data is converted into coordinate representations ( $\tilde{Y}, \tilde{X}$ ) using user-defined or predefined functions (e.g., standardization, or transformations utilizing variance-covariance matrices).
Hierarchical Clustering: Observations are clustered within the clustering space using hierarchical clustering. The package supports repeatable results via nested cluster selection, allowing users to adjust the number of clusters, distance metrics, and linkage methods.
Linked Visualization: The resulting clusters are simultaneously visualized in both the clustering and linked spaces. The visualization framework employs:
- Non-linear Dimension Reduction (NLDR): Techniques such as t-SNE and UMAP to project high-dimensional data into 2D.
- Animated Tours: Linear projections (e.g., grand tours, guided tours, slice tours) generated via the tourr and detourr packages.
- Linked Brushing: Implemented using the crosstalk package, allowing selections (brushing) in one view (e.g., a UMAP plot of the clustering space) to immediately highlight corresponding points in all other views (e.g., a tour of the linked space).
Statistical Guidance: The package provides cluster statistics (e.g., Calinski-Harabasz index, within/between ratios, cluster radii, and benchmark distances) to assist in selecting the optimal number of clusters.

Key Contributions

Generic Framework for Linked Spaces: Unlike previous tools that focus on refining clustering within a single domain, pandemonium defines a generic framework for exploring two connected spaces while interactively changing clustering settings.
Modular Architecture: Built on shiny, the package allows users to inject custom functions for coordinate transformations, score calculations, and dimension reduction methods, extending its applicability beyond the default implementations.
Integrated Visual Analytics: It uniquely integrates hierarchical clustering, NLDR, and animated tours in a single interface, enabling the comparison of cluster structures against the geometry of the linked space.
Reproducibility: The package includes makePlots() and writeResults() functions to reproduce GUI-based analyses and export results programmatically outside the interactive session.

Results and Case Studies
The paper validates the package through two distinct case studies:

Machine Learning Interpretation: The package was used to analyze a neural network model predicting bike rental counts. By clustering latent activations (clustering space) and mapping them to input variables (linked space), the authors identified that specific input combinations (temperature and humidity) drove distinct activation patterns. The linked views revealed that while the model residuals were well-distributed, the activation space contained linear structures corresponding to the ReLU activation function, which were not immediately obvious in the input space alone.
High-Dimensional Physics Modeling: The package analyzed a complex particle physics model with 150 parameters reduced to a subset of six predictors and sixteen responses. Using a coordinate transformation based on experimental covariance matrices, the authors clustered the response space. Linked visualizations successfully identified that specific predictors ( $X_1$ and $X_3$ ) were responsible for separating the clusters, while others ( $X_6$ ) showed no dependence. This demonstrated the tool's ability to isolate relevant predictors in high-dimensional parameter spaces.

Significance and Limitations
The paper positions pandemonium as an exploratory tool that bridges the gap between statistical clustering and visual analytics in linked domains. Its significance lies in enabling analysts to formulate intuitive hypotheses about how structures in one space (e.g., model predictions or latent variables) relate to structures in another (e.g., raw inputs or experimental observables).

The authors note modest limitations:

Scalability: The tool is limited to mid-sized applications due to computing time constraints for tours and the visual clutter inherent in high-dimensional data. For very large datasets, variable selection or linear dimension reduction is recommended prior to exploration.
Flexibility vs. Simplicity: While the package offers modular inputs for advanced users, some visual options are fixed to maintain simplicity for novice users.
Future Work: The authors suggest that further development is needed to identify limitations through broader application testing and to potentially extend modularity for more complex use cases.

The paper concludes that pandemonium provides a valuable, accessible interface for investigating the interdependence of high-dimensional spaces, applicable across diverse fields from machine learning to theoretical physics.

`pandemonium`: High Dimensional Analysis in Linked Spaces