Prompt Programming for Cultural Bias and Alignment of Large Language Models

Imagine you have a super-smart, all-knowing robot librarian named "LLM." This robot has read almost every book ever written, but there's a catch: it was mostly trained on books from Western countries (like the US, UK, and Europe). Because of this, when you ask the robot a question, it tends to answer with Western values, priorities, and ways of thinking, even if you ask it to speak for someone in a completely different culture, like a farmer in Kenya or a shopkeeper in Japan.

This paper is about a team of researchers at Los Alamos National Laboratory who wanted to fix this "cultural bias" so the robot can be a better, fairer assistant for people all over the world.

Here is the story of what they did, explained simply:

1. The Problem: The Robot's "Default Setting"

The researchers started by testing five different open-source versions of this robot (like Llama and Gemma). They asked them a series of questions similar to a global personality test (called the World Values Survey).

The Analogy: Imagine the robot is a chameleon. When you don't tell it what color to be, it automatically turns "Western Blue." It doesn't matter if you ask it to describe life in Brazil or China; without a specific instruction, it defaults to its own "Western Blue" perspective.

The researchers found that, just like previous studies on expensive, closed robots, these free, open-source robots also had this "Western Blue" default. They were clustered together on a map of human values, far away from many other cultures.

2. The First Fix: "Manual Prompt Engineering" (The Sticky Note)

To fix this, the researchers tried a simple trick. They added a "sticky note" to the front of every question.

Without the note: "How happy are you?" (Robot answers with Western values).
With the note: "You are a citizen of Egypt. How happy are you?"

The Analogy: This is like telling the chameleon, "Okay, pretend you are a desert lizard." The robot does a better job! It shifts its answers closer to how real people in Egypt actually feel. This is called Prompt Engineering. It works, but it's a bit like manually writing a new sticky note for every single country you visit. It's tedious and might not be perfect.

3. The Big Innovation: "Prompt Programming" (The Smart Auto-Pilot)

The researchers asked: Can we do better than writing sticky notes by hand?

They used a tool called DSPy. Think of DSPy as a "smart auto-pilot" for the robot's instructions. Instead of a human writing the sticky note, they let the computer write and test thousands of different versions of the instruction to see which one works best.

The Analogy:

Manual Engineering: You are a chef trying to make a dish taste like "Italy." You taste it, add a pinch of basil, taste it again, add a pinch of oregano. You are doing the work manually.
Prompt Programming (DSPy): You give the recipe to a super-fast robot chef. It instantly cooks 1,000 versions of the dish, tastes them all against a "perfect Italian flavor" target, and automatically picks the one that is closest. It then figures out the exact perfect recipe on its own.

4. What They Found

The researchers compared the "Manual Sticky Note" method against the "Smart Auto-Pilot" (DSPy) method.

The Result: The Smart Auto-Pilot won. It didn't just nudge the robot closer to the target culture; it often pushed it much further.
The Surprise: The auto-pilot was especially good at helping the robot understand cultures that were very different from its Western training data (like countries in Africa or the Middle East). For Western countries, the robot was already close, so the improvement was small. But for distant cultures, the auto-pilot made a huge difference.
The Secret Sauce: They found that the "brain" used to write the instructions mattered. Using a very smart, large robot to write the instructions for the smaller robot worked better than using a small robot to write them.

5. Why This Matters

Why should you care?
Because these robots are starting to be used for serious jobs: writing laws, summarizing news, helping governments make decisions, and auditing documents.

If a robot is making decisions for a country in the Middle East but thinks like a person from New York, it might suggest policies that don't make sense or feel unfair to the local people. This paper shows that by using Prompt Programming, we can "tune" these robots to respect and reflect the values of the specific people they are serving, making them more fair and useful tools for everyone, not just the West.

Summary

The Issue: AI robots naturally think like Westerners.
The Old Fix: Manually telling the robot, "Pretend you are from Country X." (Works okay).
The New Fix: Using a smart computer program (DSPy) to automatically write the best possible instructions to make the robot think like Country X. (Works much better).
The Goal: To make AI a fair partner for strategic decisions and daily life for people everywhere, regardless of where they live.

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed in strategic decision-making, policy support, and document engineering. However, they exhibit significant cultural biases, often defaulting to Western, Educated, Industrialized, Rich, and Democratic (WEIRD) value profiles. This misalignment can distort downstream analyses, recommendations, and justifications when applied to non-Western target populations.

While previous work (Tao et al., 2024) demonstrated that manual prompt engineering (e.g., adding "You are a citizen of X") could reduce this bias in proprietary models, two critical gaps remained:

Proprietary Limitation: The findings were limited to closed-source models, leaving open-weight models unverified.
Methodological Limitation: The reliance on manual prompt engineering is brittle and non-scalable compared to programmatic optimization.

The authors address two research questions:

Do the findings regarding cultural skew and the benefits of conditioning hold for open-weight LLMs?
Does prompt programming (using optimization frameworks like DSPy) achieve better cultural alignment than manual prompt engineering?

2. Methodology

The study extends the survey-grounded cultural alignment framework of Tao et al. [42] to open-weight models and introduces DSPy-based prompt optimization.

A. Cultural Mapping Framework (Baseline Replication)

Data Source: Integrated Values Surveys (IVS), a harmonized integration of World Values Survey (WVS) and European Values Study (EVS) data (2005–2022).
Metric Construction:
- Ten survey indicators (e.g., happiness, social trust, authority) are used to construct an Inglehart–Welzel (IW) cultural map.
- Principal Component Analysis (PCA) with varimax rotation reduces these indicators to two canonical axes: Survival vs. Self-Expression and Traditional vs. Secular.
- Country/territory reference points ( $\nu_{IVS}^c$ ) are computed as survey-weighted means in this 2D space.
Model Projection:
- Five open-weight LLMs were evaluated: Llama 3.3 (70B), Llama 4 (16×17B), Gemma 3 (27B), GPT-OSS (20B), and GPT-OSS (120B).
- Models were prompted with the 10 IVS survey questions under two conditions:
  1. Generic: No cultural conditioning.
  2. Cultural: Prepend with a persona (e.g., "You are a citizen of [Country]").
- Responses were mapped to numeric variables and projected into the same 2D IW space to calculate Euclidean distance from the human benchmark.

B. Prompt Programming with DSPy

To move beyond manual engineering, the authors treated cultural alignment as an optimization problem using DSPy (a framework for treating prompts as modular, optimizable programs).

Objective: Minimize the Euclidean distance between the model's projected response ( $\mu_{m,c}(\theta)$ ) and the human benchmark ( $\nu_{IVS}^c$ ) for a target country $c$ .
Optimization Process:
- The "culture-conditioning instruction" is parameterized as a discrete text prompt $\theta$ .
- Teleprompters: Two DSPy optimizers were tested:
  - COPRO: Coordinate-ascent style search for instruction refinement.
  - MIPROv2: Multi-stage search optimizing both instructions and few-shot demonstrations using Bayesian Optimization.
- Proposer Models: The optimization was run using two different "instruction-proposal" models to generate candidate prompts:
  - Small: Llama 3.2 (1B)
  - Large: GPT-OSS (120B)
Evaluation: 5-fold cross-validation was used, where 80% of countries formed the training/compilation pool and 20% were held out for testing to measure generalization.

3. Key Contributions

Validation on Open-Weight Models: Confirmed that open-source LLMs exhibit the same systematic Western cultural skew under generic prompting as proprietary models, and that manual cultural conditioning reduces this misalignment.
Introduction of Prompt Programming: Demonstrated the application of DSPy to cultural alignment, treating prompts as optimizable programs rather than static text.
Comparative Analysis: Systematically compared manual prompt engineering against programmatic prompt optimization (COPRO and MIPROv2) across different model scales and proposer configurations.
Empirical Evidence of Optimization Superiority: Showed that prompt programming, particularly when using a large proposer model (GPT-OSS 120B) with MIPROv2, consistently outperforms manual engineering in reducing cultural distance.

4. Key Results

Cultural Skew: Under generic prompting, all five open-weight models clustered tightly in a region of the IW map corresponding to high Self-Expression and Secular values, far from many non-Western country benchmarks.
Manual Prompting Efficacy: Adding a country-specific persona (e.g., "You are a citizen of Jordan") significantly reduced the distance to the target benchmark for all models, confirming the generalizability of Tao et al.'s findings to open weights.
Prompt Programming Superiority:
- DSPy vs. Manual: Prompt programming generally achieved lower cultural distances than manual engineering.
- Best Configuration: The combination of MIPROv2 (the optimizer) and GPT-OSS 120B (the proposer) yielded the most consistent improvements across all target models.
- Model Sensitivity: Llama 4 showed the most dramatic improvement with DSPy. For Gemma 3 and GPT-OSS targets, gains were selective, with only the MIPROv2 + GPT-OSS 120B configuration consistently outperforming manual baselines.
Geographic Variance: The reduction in cultural distance was most pronounced for countries with the largest initial mismatch (e.g., African-Islamic nations like Jordan saw $\Delta \approx +4.29$ improvement), while Western countries saw smaller shifts (e.g., USA $\Delta \approx +0.14$ ), suggesting the models were already closer to Western norms.

5. Significance and Implications

Beyond Proprietary Systems: The study proves that cultural bias is a systemic issue across both closed and open LLM architectures, necessitating alignment strategies for open-source models used in public and strategic sectors.
Shift from Engineering to Optimization: The results suggest that prompt programming offers a more stable, transferable, and scalable route to cultural alignment than manual prompt engineering. It moves the field from "guessing" the right prompt to "compiling" the optimal prompt based on explicit objectives.
Strategic Decision-Making: For applications in governance, policy, and security (where the authors are based at Los Alamos National Laboratory), ensuring that LLM outputs reflect target-population values is critical. Misalignment can lead to flawed strategic reasoning or policy recommendations that do not resonate with local contexts.
Limitations & Future Work: The study relies on forced-choice survey items, which may not capture nuances in open-ended generation or multi-turn dialogue. Future work should explore multilingual contexts and strategic culture (security-specific beliefs) to further refine alignment in high-stakes environments.

In conclusion, the paper establishes that while open-weight LLMs inherit Western biases, these can be effectively mitigated not just by manual prompting, but by treating cultural conditioning as an optimization problem solvable via frameworks like DSPy.