How We Generate Census Weighted Synthetic Populations for 20 Countries

Jason Duke, Founder, Kronaxis

Tag: Technical

Most synthetic persona generators work like this: pick a name from a list, assign a random age, bolt on a personality type from a dropdown, and call it a persona. The result looks plausible at a glance. It falls apart the moment you ask a question that depends on internal consistency.

A 24 year old woman living in Burnley on £22,000 a year does not have the same media diet, political priors, or economic anxieties as a 58 year old man in Guildford on £95,000. If you generate both from the same uniform random distribution and slap on different names, their responses will be driven by prompt engineering rather than a coherent model of who they are.

We took a different approach. Every persona in Panel Studio starts as a demographic skeleton drawn from real census data. The personality layer, economic state, media diet, political orientation, and life history are generated in three passes, each constrained by everything that came before. The result is a synthetic population that matches the actual demographic distribution of the target country, with internal consistency enforced by 18 validation rules.

The Demographic Skeleton

The foundation is census data. For the United Kingdom, we use ONS Census 2021 with constituency level resolution across all 650 parliamentary constituencies. For the United States, Census Bureau data at state level. Canada uses StatCan. Australia uses ABS. We have 20 dedicated country builders (GB, US, DE, FR, NL, SA, AE, EG, TR, SD, IE, CA, AU, NZ, SE, NO, DK, BE, JP, KR), each encoding that country's demographic distributions: regions, age bands, gender balance, ethnic composition, religious affiliation, education levels, occupation categories with salary ranges, and political parties.

The country builder is not a lookup table. It is a structured data class that provides weighted sampling functions for every demographic variable. When you ask for a panel of 400 UK personas, the builder does not pick 400 random ages. It assigns ages proportionally from the ONS age band distribution. It assigns regions to match the census population weights. Gender, ethnicity, religion, and education follow the same logic.

Deterministic Preassignment

Before any language model is involved, the demographic sampler preassigns every structural variable. Region, age band, gender, ethnicity, and religion are all allocated proportionally from census data using weighted random sampling with a fixed seed. The demographic composition of your panel is locked before the personality layer exists.

This is deliberate. If you let the language model decide demographics, you get whatever biases the model absorbed from its training data. American personas skew coastal and college educated. British personas skew London and professional class. Preassignment ensures that a 500 persona UK panel contains the right proportion from the North East, the right proportion of over 65s, the right proportion without university degrees. The population structure is correct before anyone writes a biography.

Five diversity rules enforce this at the dataset level: no single region exceeds its census share by more than a threshold, no gender ratio drifts beyond bounds, no age band is overrepresented, no ethnic group concentration exceeds census proportions, and the DYNAMICS-8 octant distribution (256 cells from binarising 8 dimensions) has sufficient spread to avoid clustering.

The Three Pass Generation Pipeline

Once the skeleton exists, three passes of language model generation build the persona into a full synthetic human.

Pass 1: Biography. The model receives the demographic skeleton (age, gender, region, occupation, income band, education, ethnicity, religion) and generates a life narrative. Where were they born? What shaped their childhood? How did their career develop? What are the formative experiences that explain who they are today? This biography is not decorative flavour text. It is the grounding document that every subsequent generation pass references.

Pass 2: Structured fields. The model receives the skeleton plus the biography and generates 187 structured fields: DYNAMICS-8 personality scores (8 dimensions, 32 facets), economic state, media diet, political priors, emotional baseline, brand affiliations, and lifestyle markers. Each field is constrained by the biography and the demographics. A 22 year old retail worker in Middlesbrough does not get the media diet of a 45 year old solicitor in Edinburgh.

Pass 3: Questionnaire responses. The model receives the full persona and generates responses to ISSP (International Social Survey Programme) questionnaire items. These serve as consistency checks: if a persona's questionnaire responses contradict their stated values or personality scores, the validation engine flags it. They also provide attitudinal data that researchers can analyse directly.

Each pass builds on the previous one. The biography cannot contradict the demographics. The structured fields cannot contradict the biography. The questionnaire responses cannot contradict the structured fields. This layered constraint is what produces internal consistency.

The 18 Rule Validation Engine

Every generated persona passes through 18 validation rules before it enters the dataset. Thirteen rules check internal consistency. Five check cross dataset diversity.

The internal rules cover what you would check by hand: does income fall within the plausible range for the occupation? Is the education level achievable at the persona's age? Does the media diet match the demographic and political profile? Are DYNAMICS-8 scores consistent with described behaviour? Do questionnaire responses align with stated personality?

The cross dataset rules check population level properties: age distribution within tolerance of the census target, gender ratio, DYNAMICS-8 octant spread, regional distribution, and ethnic composition.

Personas that fail are rejected and regenerated. The pipeline runs until the target count of valid personas is reached. In practice, the first attempt pass rate is around 85 to 90 percent, with most failures caused by salary/occupation mismatches or implausible education timelines.

The DYNAMICS-8 Overlay

The personality scores assigned in Pass 2 are not drawn from a uniform distribution. They are constrained by known demographic correlations from the personality psychology literature. Openness to experience (mapped to Novelty in DYNAMICS-8) correlates positively with education level. Conscientiousness (mapped to Discipline) increases with age. Agreeableness (mapped to Yielding) shows gender differences across most populations. Neuroticism (mapped to Mercuriality) correlates with economic stress.

These are soft constraints, not hard rules. A highly educated persona is more likely to have a higher Novelty score, but not guaranteed. The constraints prevent absurdities without eliminating natural variation. The result is a population where personality distributions look like real populations, not uniform random noise.

The Open Dataset

We publish a 1,000 persona sample on HuggingFace: 500 UK personas and 500 US personas, all generated through the full pipeline with all 18 validation rules enforced. The dataset includes demographics, biographies, DYNAMICS-8 scores, structured fields, and questionnaire responses. It is free to download, free to use, and free to build on.

The full census weighted generation pipeline, covering all 20 countries, is available through Panel Studio. Describe your target audience in plain English, and the panel builder handles the census weighting, the three pass generation, and the validation automatically.

The demographic skeleton is the foundation. Get it wrong and no amount of prompt engineering will save you. Get it right and the personas behave like a population, not a collection of chatbot characters.

Try it yourself

Build a census weighted UK panel and run your own stimulus test.

Get Your API Key