Modeling Public Opinion with Large Language Models

From surveys to subpopulation representative models.

Apr 22, 2024

A large amount of the field of sociology is about understanding emergent properties of groups. While economics often takes an “out of many, one” (E Pluribus Unum) approach to society, where “one” is an economic theory, sociology takes a kind of “out of one, many” (Ex Uno, Plures) approach, which says there are stable properties of groups that effect the behavior of members. In other words, in sociology, the group (race, gender, class, nationality, etc.) has agency or does stuff.

But why would we want to understand the stereotypical behavior of a group? Well, if you ask a sociologist the question, they'll be liable to say something like, So we can understand how inequality or racism is disproportionately affecting different groups of Americans and so that we can build policies to reduce those inequities. If you ask an advertiser, they’ll say so that companies can better understand group difference and tailor advertisements appropriately. If you ask someone in the 2016 Trump campaign, they’ll say so they can tailor political messaging to the audience. But this didn’t start with Trump.

Over the years, understanding the behavior of a group has evolved. Census surveys and public opinion polling started in the early 1900s and really hit their stride in the 1960s. In the 1960 presidential election, the Kennedy administration admitted to having a “secret weapon” they called “The People Machine,” which was the first real large-scale implementation of combining demographic data, public opinion polling, and regression analysis to build a model of 480 distinct voter types throughout the country. They used it to predict the reactions different groups of Americans would have to various policy proposals.

For instance, how would a rural Midwestern white protestant react to Kennedy taking a stronger stance on Civil Rights? Although the “People Machine” was woefully computationally underpowered to have any real predictive power, at the time Harold Laswell called it “the A-bomb of the social sciences,” implying that the moment public officials can accurately model the emergent opinion of their publics, the state (in this case, democracy) will change forever.

For the past two months, I’ve dived fully into trying to sift through the bullshit and really understand the technical details and moral implications of our ongoing “AI Revolution.” What makes this so-called revolution confusing is that a lot of the discourse around it is being propagated by the for-profit companies that, as a result of how our capitalist system functions, rely on overpromising and misinformation to fan the flames of progress. Yes, Meta and Google are enshrouding Africa and South America in undersea cables. Yes, the colonialism of the next hundred years will be informational colonialism. Yes, there are parts of this whole thing that Ted Kacynznski was right about. Sorry, it’s true. But I want to zero in on one component that I think is pretty clear and revealing: Subpopulation Representative Models (SRMs).

We all know how surveys work, right? You ask a lot of people questions, tally up the answers, and then you can look at, like, is there a difference between how men and women answered the question about guns, abortion, whatever? Obviously, survey methods are complicated, but the logic behind them is not. But why do a survey in the first place?

Well, if we’re the US Census Bureau, we might want to get a sense of who is in the United States in 2024. But of course, not just who, but like, who: of what origin, political inclinations, gender, race, income—demographics. If we’re political pollsters, we might want to understand who (demographically) is planning on voting in Michigan and who they intend to vote for. Then you can release the poll to CNN, and they can run a story with a headline like, “Biden is behind with Latino voters in Detroit!” This might alert the Biden campaign that they need to do more work with Latinos in the area.

But as we know from the 2016 election, surveys are getting worse. From 1930 to 1960, response rates were really high because people had to go door to door to collect data. From 1960 to 1990, data collection got easier because of phone polls and shit, but response rates and quality started to tail off. And then, from 1990–2020, with the rise of internet polls and declining phone use, the quality of (political) surveys tanked. But during this time, everyone sort of started to use the internet.

People got thinking... So, no one wants to participate in polls anymore, but they post shit all the time. What if, rather than asking people questions (polling), we instead took all of the text information from the internet, fed it into a huge machine learning algorithm that could learn the patterns of association between word usage, attitudes, and behaviors, and used it to predict voting patterns or public opinion?

Subpopulation Representative Models are machine learning tools that approximate to some useful degree certain characteristics of a human subpopulation (Simmons and Hare 2023). You should remember this term.

What if, instead of asking everyone in Iowa who they intend to vote for, we gathered everything they posted online in the three months leading up to election day and created a predictor for voting outcomes? What if, rather than paying for focus groups for a new product, we gather consumer taste information about a ton of people and then predict which subpopulation (white, democratic, male) will be the most likely to actually buy the new product? What if we then asked the same machine to generate an advertisement for said product that would better appeal to black women? What if, rather than making an advertisement for the new Coke Zero, we make 480 advertisements for 480 different possible demographic combinations to optimize messaging based on our predictive SRM?

The “People Machine” that the Kennedy campaign used in the 1960s didn’t have enough horsepower under the hood. You just can’t get too far if all you’ve got is some people crunching regressions on graph paper. But here we are, 60 years later. Silicon is flowing into the Valley again. On the one hand, we can imagine the way that these technologies can be used to enhance the democratic process. On the other hand, what about Putin? Every country has access to the same internet data. Every country can build its own models. In other words, Russia too can perform its own kind of US Census. And so, in a sense, we are staring down at another arms race.

When Laswell called the “People Machine” the “A-bomb of the social sciences,” perhaps he did not mean the actual thing that the Kennedy administration had at the time. Perhaps what he was referring to was the idea of a machine that can accurately predict the opinions, beliefs, and actions of populations of humans or individuals themselves. Perhaps what he was referring to was the day where opinion mining could lead to accurate behavior predictions, could forecast social trends based off of a complex set of latent conditions, and autonomously perform sociological analysis without the need for tiresome data collection.

I want to suggest that this is where we’re at. I also want to suggest that the really important thing about the A-bomb that transcended the violence at Hiroshima and Nagasaki is that it irreversibly changed geopolitics. I want to push you to think beyond the dumbass imaginaries of, like, AI robots killing all the people or whatever adolescent thing people often gravitate towards. I want to think about what Laswell had in mind even back in the 1960s, which is how this kind of opinion modeling technology will change the political process within countries—who gets what, when, and how?—and between countries. I don’t have the answers yet, but this is the level we need to think about.

How does this change the relationship between officials and the public within various types of states? In a democracy? In an autocracy? In an oligarchy? Between democracies and theocracies?

Within Europe, how will AI differentially impact the “Nordic model” democracies of Scandanavia with their combination of free market capitalism with a comprehensive welfare state and collective bargaining at the national level versus the “social market democracy” in Germany that “combines free-market capitalism with social policies that establish both competitive economic environments and social welfare measures?

This is what it means for AI to be “the A-bomb of the social sciences.”

All This Life Here

Modeling Public Opinion with Large Language Models

From surveys to subpopulation representative models.