Original authors: Feline Lindeboom, Martijn Brehm, Davide Grossi, Pradeep Murukannaiah

Published 2026-05-07

📖 5 min read🧠 Deep dive

Original authors: Feline Lindeboom, Martijn Brehm, Davide Grossi, Pradeep Murukannaiah

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to organize a community festival committee. You have a huge list of potential activities (candidates) and a large crowd of neighbors (voters). Your goal is to pick a small group of activities (a committee) that makes the most people happy. In the world of computer science, this is called an "approval-based committee election."

The catch? You can't ask every single neighbor about every single activity. That would take forever, and people would get tired and start guessing or lying just to get it over with. This paper tackles the problem of how to pick the best committee when you have incomplete information (you haven't asked everyone everything) or inaccurate information (people are making mistakes or lying).

Here is the breakdown of their solution using simple analogies:

1. The Goal: Maximum Coverage

Think of the goal as trying to cover as many people as possible with a few umbrellas.

The Umbrellas: The activities you choose for the committee.
The People: The voters.
The Rule: A person is "covered" (happy) if they like at least one of the chosen activities.
The Challenge: You want to pick $k$ activities so that the total number of happy people is maximized. This is a classic computer science puzzle known as the Maximum Coverage Problem. It's notoriously hard to solve perfectly, but there's a "good enough" strategy (a greedy algorithm) that gets you about 63% ( $1 - 1/e$ ) of the way to the perfect solution.

2. The Problem: The "Blind" and the "Noisy"

The authors look at two specific problems that happen in real life (like on digital democracy platforms):

Incomplete Information (The Blindfolded Chef): Imagine a chef trying to taste a soup but can only take a few sips. They don't know the flavor of the whole pot. In our case, voters only see a tiny fraction of the activity list. If you try to guess the best committee without asking enough people, you might miss the most popular activities entirely.
- The Finding: If you ask questions randomly without changing your strategy based on what you hear (non-adaptive), you need to ask a massive number of questions (roughly the square of the number of activities) to get a good answer.
- The Fix: If you are adaptive—meaning you ask a question, listen to the answer, and then decide what to ask next—you can get the same good result with far fewer questions (roughly just the number of activities). It's like a detective who follows a clue rather than checking every house in town randomly.
Inaccurate Information (The Noisy Room): Imagine trying to hear a conversation in a loud room. Sometimes people shout the wrong answer, or you mishear them.
- The Finding: If voters make mistakes with a small probability, you have to ask the same question many times to figure out the truth. To get a reliable answer, you need to ask roughly the total number of voters multiplied by the number of activities. It's like asking a noisy crowd the same question 30 times to be sure you heard the right answer.

3. The Solution: Smart Sampling Algorithms

The authors propose two main algorithms to handle these messy situations:

The Greedy Algorithm (The "Pick the Best Next Step" approach):
- How it works: Instead of asking everyone about everything, the algorithm picks a small group of voters, asks them about a small batch of activities, and estimates which activity would make the most people happy right now. It picks that one, then repeats the process.
- The Magic: By using math to estimate the "true" popularity based on a small sample, they proved you can get a near-perfect result by asking only a tiny fraction of the total possible questions.
The Local Search Algorithm (The "Swap and Improve" approach):
- How it works: This is for when you have extra rules, like "The committee must have at least 2 sports activities and 2 arts activities." This is called a Matroid Constraint (think of it as a rulebook for valid committees).
- The Strategy: Start with a random valid committee. Then, try swapping one activity for another to see if it makes more people happy. If it does, keep the swap. Repeat until you can't improve it anymore.
- The Result: Even with incomplete or noisy data, this method finds a very strong solution, though it requires slightly more questions than the greedy method.

4. The Real-World Test

The authors didn't just do math on paper; they tested their ideas using real data from Polis, a platform where thousands of people discuss issues online.

They found that in the real world, their "smart sampling" algorithms worked incredibly well.
Even though their math said they might need millions of questions to be 100% sure, in practice, they got excellent results with just a handful of questions per person.
They also tested it with "noisy" data (simulating people making mistakes) and found the algorithms still performed very well, far better than the worst-case math predicted.

Summary

This paper is about efficiency in democracy. It proves that you don't need to ask every single person about every single idea to build a diverse, representative committee. By using smart, adaptive questioning strategies (like a detective following clues rather than a random search), you can build a committee that represents the group's diversity accurately, even when people are busy, confused, or only answering a few questions.

Technical Summary: Diverse Committees with Incomplete or Inaccurate Approval Ballots

Problem Statement

This paper addresses the challenge of selecting diverse committees in approval-based elections when voter information is either incomplete or inaccurate. The context is motivated by digital democracy platforms (e.g., Polis) where citizens deliberate on numerous issues, but no single user can evaluate every candidate (statement), leading to sparse ballots. Furthermore, voters may provide inaccurate responses due to fatigue or noise.

The objective is to select a committee $W$ of size $k$ from $m$ candidates that maximizes the Chamberlin-Courant (CC) score, which corresponds to the Maximum Coverage problem. The goal is to maximize the number of voters who have at least one approved candidate in the selected committee. The paper investigates how many queries (votes) are required to achieve a $(1 - 1/e)$ -approximation of the optimal score with high probability (w.h.p.) under two distinct information models:

Incomplete Information: Voters only respond to a subset of candidates. The algorithm must adaptively query voters to reconstruct the necessary information.
Inaccurate Information: Voters provide complete ballots, but each response is flipped with a small probability $p$ .

Methodology

Theoretical Framework

The authors formalize the problem using the Maximum Coverage framework. They distinguish between non-adaptive algorithms (which must specify all queries beforehand) and adaptive algorithms (which can adjust queries based on previous outcomes).

Incomplete Information (Adaptive):
- The authors propose a Greedy Query Algorithm (Algorithm 2) that samples voters to estimate the marginal gain ( $\Delta$ ) of adding a candidate to the committee.
- They also adapt a Non-Oblivious Local Search algorithm (Algorithm 4) for settings with matroid constraints (e.g., quotas on demographic groups). This algorithm uses an auxiliary objective function that weights elements covered multiple times to avoid local optima.
- Both algorithms rely on Hoeffding's inequality to bound the error of estimates derived from sampling a subset of voters ( $\ell$ ) rather than the entire population ( $n$ ).
Incomplete Information (Non-Adaptive Lower Bound):
- The paper proves that for non-adaptive algorithms, achieving an approximation ratio arbitrarily close to optimal requires $\Omega(m^2)$ queries. This motivates the shift to adaptive strategies.
Inaccurate Information:
- The authors model noise as a Bernoulli flip with probability $p$ .
- They propose an algorithm that repeats each query multiple times and takes a majority vote to recover the true preference.
- They utilize results from multi-armed bandit theory to establish lower bounds on the necessary query complexity.

Experimental Approach

The authors evaluate their algorithms using:

Real Data: 18 open-use datasets from Polis deliberations.
Synthetic Data: Generated using the $(q, \phi)$ -resampling model (Szufa et al.) calibrated to match the statistical properties of the Polis data (specifically approval density $q$ and spread $\phi$ ).
Baselines: Comparison against standard Approval Voting and a Local Search Proportional Approval Voting (Pav) algorithm.
Metrics: The Chamberlin-Courant score, measured both in absolute terms and relative to the complete-information optimal solution.

Key Contributions

Query Complexity Bounds for Incomplete Information:
- Non-Adaptive: Proved a lower bound of $\Omega(m^2)$ queries to achieve near-optimal approximation w.h.p.
- Adaptive: Demonstrated that adaptive algorithms can reduce this bound to $\tilde{\Theta}(m)$ (specifically $O(m \log m)$ ) for both unconstrained and matroid-constrained settings. This is achieved by the Greedy and Local Search query algorithms, which match the lower bound up to logarithmic factors.
Query Complexity Bounds for Inaccurate Information:
- Proved that recovering the optimal approximation ratio w.h.p. in the presence of noise requires $\tilde{\Theta}(nm)$ queries. This matches the lower bound derived from multi-armed bandit theory.
Matroid Constraints:
- Extended the results to generalized Maximum Coverage over matroid constraints, allowing for structural requirements like upper and lower quotas on candidate groups. The proposed local search algorithm maintains the $\tilde{\Theta}(m)$ query complexity in this setting.
Empirical Validation:
- Showed that the complete-information Greedy and Local Search algorithms outperform existing methods (Approval Voting, Local Search Pav) on real and synthetic data.
- Demonstrated that the querying algorithms perform significantly better in practice than worst-case theoretical bounds suggest. Even with very few queries ( $M=1$ to $5$ per voter) and noisy data ( $p=0.1$ ), the algorithms achieved scores close to the complete-information baseline, often exceeding the theoretical worst-case approximation ratio of $1-1/e$ .

Results

Theoretical Bounds:
- Incomplete (Adaptive): $\tilde{\Theta}(m)$ queries are sufficient.
- Incomplete (Non-Adaptive): $\Omega(m^2)$ queries are necessary.
- Inaccurate: $\tilde{\Theta}(nm)$ queries are necessary and sufficient.
- Note: While the theoretical bounds for the Local Search algorithm with matroid constraints involve large constant factors (making them potentially infeasible for very large $k$ in worst-case scenarios), the Greedy algorithm remains more efficient.
Experimental Findings:
- Complete Information: The proposed Greedy and Local Search algorithms consistently achieved higher CC-scores than Approval Voting and Local Search Pav across 118 datasets (18 real, 100 synthetic).
- Incomplete/Noisy Information:
  - With accurate responses ( $p=0$ ) and minimal querying ( $M=1$ ), the algorithms achieved 85–95% of the complete-information score.
  - With inaccurate responses ( $p=0.1$ ), performance dropped slightly (factor of 0.95–0.97) but remained well above the $1-1/e$ threshold.
  - Combining incomplete and inaccurate information resulted in a performance hit (factor of 0.7–0.8 for $M=1$ ), yet the algorithms still maintained robust diversity scores.
- Gap between Theory and Practice: The authors note that while theoretical bounds (e.g., requiring $10^8$ queries for specific parameters) suggest infeasibility for real-world instances, the empirical results show that much fewer queries are needed to achieve high-quality diverse committees.

Significance and Claims

The paper claims to be the first to address diversity in approval-based committee elections specifically within the context of online civic participation platforms where information is inherently sparse or noisy.

Practical Viability: The primary significance lies in demonstrating that diverse committees can be found by querying only a small fraction of voters, even when responses are inaccurate. This is crucial for the scalability of digital democracy tools like Polis.
Theoretical Tightness: The work establishes tight query complexity bounds (up to log-factors) for adaptive algorithms, showing that adaptivity is essential for efficiency in incomplete information settings.
Robustness: The empirical results suggest that the algorithms are robust to both data sparsity and noise, performing well even when the number of queries is far below the theoretical worst-case requirements.
Extensibility: By incorporating matroid constraints, the framework supports complex diversity requirements (e.g., quotas) without sacrificing asymptotic query efficiency.

The authors remain modest regarding the large constant factors in their theoretical bounds for the Local Search algorithm, acknowledging that while asymptotically viable, these constants may hinder performance in specific practical instances, though their experiments suggest this is less of an issue in practice than theory predicts. They also note that the combination of incomplete and inaccurate information was studied empirically but not theoretically, leaving that as a direction for future work.

Diverse Committees with Incomplete or Inaccurate Approval Ballots