Uncovering the Biases in State-Level Polling Data

In addition to clear polling bias against GOP presidential candidate Donald Trump at the national level, we see the same pro-liberal bias within the state data.  It's not just in places like Utah, but all across the nation.

The problem is so deep and widespread that only fools would attempt to average the datasets to assess the actual mood of the electorate.  All of this data needs to be corrected prior to averaging, and if that is done, it becomes abundantly clear that Trump is likely leading Hillary Clinton by at least several points at the national level, and statewide data is shaping up consistent with a possible Trump landslide.

One polling firm consistently reported on the RealClearPolitics poll tracker is Public Policy Polling (PPP), whose data could provide some useful insights into the American experience, but which appears to have an anti-Trump bias this cycle due to demographic deficiencies.

A few examples are in order at the state level, but we'll start with one of PPP's national polls.  On May 10, PPP released a national poll claiming that "Hillary Clinton leads Trump 42-38, with Libertarian Gary Johnson at 4% and Green Party candidate Jill Stein at 2%."  Into the demographic data we go.  When asked the question, "[i]n the last presidential election, did you vote for Barack Obama or Mitt Romney?," 49% of respondents said Obama, and just 40% said Romney.

This 9% spread for Obama voters over Romney voters is far greater than the actual national spread in the 2012 election, which was 3.9%.  Since polling results from a wide range of sources show that voters are overwhelmingly likely to repeat their 2012 party choice for president during the 2016 election, this translates into a baked-in bias against Trump of at least 4-5% on the poll in question.  Thus, rather than a 4% Clinton lead, we likely have a slight Trump lead.

The 2012 Obama-Romney vote spread bias is diagnostic for potential problems in many polls, notably PPP's state level data.

On May 17, PPP reported that "[t]he Presidential election is pretty competitive in Arizona at this point. Donald Trump leads Hillary Clinton just 40-38, with Gary Johnson at 6% and Jill Stein at 2%."  Except that 46% of respondents said they voted Romney in 2012, and 42% voted for Obama, yielding only a 4% advantage to Romney 2012 voters.  But in 2012, Romney beat Obama by 9% in Arizona.  Yet again, a likely liberal bias is in the composition, suggesting that Trump's support is significantly underestimated, while perhaps the race is not that competitive after all.

On May 25, the pollster claimed that "PPP's new North Carolina poll finds Donald Trump leading Hillary Clinton 43-41 in the state, with Gary Johnston [sic] at 3% and Jill Stein at 2%."  Respondents were tied at 46% each when it came to whether they voted for Obama or Romney in 2012.  The actual margin in 2012 was a 2% victory for Romney in the state.  To correct this bias, likely add another 2% to Trump's polling numbers.

On June 1, another PPP poll reported Trump with a healthy lead in Georgia: "Trump leads Clinton 45/38, with Libertarian Gary Johnson at 6% and Green Party candidate Jill Stein at 2%."  In fact, Trump could be crushing Clinton by nearly double digits.  The 2012 Romney-Obama spread among survey respondents in the state is 6%, compared to 8% in the actual election.  Chalk up another couple of percentage points to Trump's lead.

On June 7, PPP reported that "[t]he Presidential race in Florida looks like a toss up. Donald Trump's at 41% to 40% for Hillary Clinton, with Gary Johnson at 4% and Jill Stein at 2%."  Or maybe it isn't a toss-up after all?  Among those surveyed, 4% more said they voted for Obama in 2012 (48%) than Romney (44%).  But Obama beat Romney by 0.88% in 2012, indicating a possible 3% liberal bias in the polling data.  Perhaps that should be revised to a 4% lead for Trump over Clinton in Florida, rather than just 1%?

Next, we have Pennsylvania.  A poll released June 8 states that "PPP's new Pennsylvania poll finds a close race between Hillary Clinton and Donald Trump in the state ... Clinton has 41% to 40% for Trump, with Gary Johnson at 6% and Jill Stein at 3%."  Readers know where this is headed.  When asked whom they voted for in 2012, 49% said Obama, and 41% indicated Romney, an 8% spread.  The actual margin of victory was 5.4%.  Correct for this nearly 3% bias, and it looks as though Trump has a small lead in the Keystone State.

One notes that in each case, the bias is against conservatives (i.e., 2012 Romney voters) and for liberals (aka 2012 Obama voters).  If the errors were random, we'd expect them distributed approximately equally for both sides – meaning some of the polls would have shown Romney 2012 supporters over-represented and others would show these voters under-represented.  Instead, all of them have Romney 2012 under-representation – or, in other words, conservative under-representation.

As a final example, the CBS News/YouGov poll in New Jersey conducted May 31-June 3 had Clinton with a massive 15% lead over Trump, 49% to 34%.  Just 35% of respondents leaned Republican, versus 54% who leaned Democrat.  According to Gallup, the actual leaning Democrat advantage over Republicans in the state is only 10%, not almost 20%, as this CBS News poll assumes.  Of those surveyed, 51% were registered Democrats, 33% were registered Republicans, and 12% were independents.  Based on the state's voter registration statistics:

About 48 percent of registered voters in New Jersey are not affiliated with any party. Out of 5.5 million voters, a little over 2.6 million identify as independent. Democrats follow, making up roughly 32.6 percent of registered voters, while Republicans claim 19.6 percent.

That would be a 13% spread for registered Democrats over Republicans in the state, not 18%, as the poll has it.  And independents – who the poll itself shows favor Trump over Clinton by 36% to 28% – are massively under-represented.  They appear to constitute almost half of the potential electorate but just 12% of the survey.  Correcting for the under-representation of independents and the over-representation of liberals/Democrats suggests that New Jersey may be tilting toward Trump, which is remarkable, since the state hasn't voted for the Republican candidate in a presidential election since Reagan '88.

Imagine that.  Trump possibly leading in New Jersey, which would be a nice set of electoral college votes to add to the "R" column.

We can throw in more evidence of bias onto this New Jersey polling data.  The poll had nearly sixfold more respondents between the ages of 18 and 64 than those aged 65+.  But the actual population ratio between these age groups is just 4:1, meaning that the 65+ cohort – which the poll itself shows is a dead heat between Clinton and Trump – is substantially under-represented.  Add on the fact that apparently 51% of respondents have completed a bachelor's degree or higher, but U.S. Census data shows that just 36% of state residents 25 years and older have this level of educational qualification, and the poll becomes even less representative of the New Jersey public, and not in Trump's favor.

Then, on Tuesday, Bloomberg released a national poll suggesting that Clinton leads Trump by 12% in the three-way with Gary Johnson, which is way outside the bounds of any other national polls.  Bloomberg national polls are no stranger to controversy on this site, as demonstrated previously, where this media outlet released a poll purporting to show a 18% Clinton lead over Trump – until we looked at the demographic data and found that Obama 2012 voters outnumbered Romney 2012 voters among survey constituents by a terrifyingly large 50% margin, rather than effective parity as per the actual 2012 election results.

Translation: Here is an apparent massive liberal bias that could, almost on its own, explain nearly all of Clinton's supposed lead – never mind any other liberal biases hiding in the polling demographics (the details of which were not released).

In this latest Bloomberg poll, conducted by the same firm as the previous survey, no demographic data was provided, so I sent the technical contact given on the poll an e-mail requesting detailed demographic data.  Here is what was sent back to me:

We appreciate your interest in our recent poll. The data are the property of Bloomberg Politics. They treat unpublished poll data the same way they would reporters' notes. What is unpublished today may be published tomorrow, next week, next month, or next year. They reserve the right to be the first to do so. Their policy is not to release specific findings, such as the ones you seek, if they have not been published in the newspaper.

Huh?

Reputable polls provide demographic data so readers can look for potential bias.  Any respectable media outlet would make such information available immediately – and many, if not most, do.  The Bloomberg media contact for the poll didn't respond to my request for further information.

Is this poll real?  Does Clinton really have a 12% lead over Trump at the national scale?  If the poll's previous bias still holds, the answer is almost certainly a resounding "no."  There are some odd numbers in the poll as well, such as the number of Clinton and Trump supporters given in regard to a question about voter enthusiasm.  The numbers are 332 and 333, respectively, which is unusual, since if Clinton has a massive lead over Trump, she should have more "supporters" in the polling composition than Trump, not fewer.  Perhaps there is an explanation for this discrepancy.  I asked the poll's technical contact, and received the following reply (emphasis added):

What you see are unweighted ns (sample sizes) used to calculate maximum margin of error. As you may have noted in the methodology document, the data are weighted (or adjusted, or rebalanced) by age and race to recent census data. Weighting is a common practice in polling, as even with random sampling certain groups are more likely than others to answer or to stay on the phone to answer the full poll. In order to be truly representative of the general population of the U.S., we need to weight to account for this. The demographic groups that support Trump ended up being weighted down and the groups supporting Hillary Clinton ended up being weighted up to conform to the general population.  We weight at that level to known population parameters, then pull likely voters out for the horserace question.  So, it is not the case we are knowingly adjusting Clinton or Trump voters.  It just works out that way given the way the demographics fall.

Readers can draw their own conclusions, but my concerns can be summed up with the following characters: !!!!????

These are but the tip of the iceberg of potential polling problems this year, and it seems as though all of them are tilted against Trump in various ways.  The devil is in the details, and polls that only claim to weight their data by age and race, while not providing detailed income, education, and historical/current political leaning information, are not discharging their duty to the public.  It is trivial to have a poll with no age and racial bias but still exhibiting very large bias due to these other demographic descriptors.  Only by being shown how the polling sausage was made can readers come to their own conclusions about the validity of the poll.  Overall, any unusual poll that doesn't provide detailed compositional data should be discarded from the pile.

Looks as if this year's polling data is shaping up the same way as a lot of climate data does – namely, suspect at best.

In addition to clear polling bias against GOP presidential candidate Donald Trump at the national level, we see the same pro-liberal bias within the state data.  It's not just in places like Utah, but all across the nation.

The problem is so deep and widespread that only fools would attempt to average the datasets to assess the actual mood of the electorate.  All of this data needs to be corrected prior to averaging, and if that is done, it becomes abundantly clear that Trump is likely leading Hillary Clinton by at least several points at the national level, and statewide data is shaping up consistent with a possible Trump landslide.

One polling firm consistently reported on the RealClearPolitics poll tracker is Public Policy Polling (PPP), whose data could provide some useful insights into the American experience, but which appears to have an anti-Trump bias this cycle due to demographic deficiencies.

A few examples are in order at the state level, but we'll start with one of PPP's national polls.  On May 10, PPP released a national poll claiming that "Hillary Clinton leads Trump 42-38, with Libertarian Gary Johnson at 4% and Green Party candidate Jill Stein at 2%."  Into the demographic data we go.  When asked the question, "[i]n the last presidential election, did you vote for Barack Obama or Mitt Romney?," 49% of respondents said Obama, and just 40% said Romney.

This 9% spread for Obama voters over Romney voters is far greater than the actual national spread in the 2012 election, which was 3.9%.  Since polling results from a wide range of sources show that voters are overwhelmingly likely to repeat their 2012 party choice for president during the 2016 election, this translates into a baked-in bias against Trump of at least 4-5% on the poll in question.  Thus, rather than a 4% Clinton lead, we likely have a slight Trump lead.

The 2012 Obama-Romney vote spread bias is diagnostic for potential problems in many polls, notably PPP's state level data.

On May 17, PPP reported that "[t]he Presidential election is pretty competitive in Arizona at this point. Donald Trump leads Hillary Clinton just 40-38, with Gary Johnson at 6% and Jill Stein at 2%."  Except that 46% of respondents said they voted Romney in 2012, and 42% voted for Obama, yielding only a 4% advantage to Romney 2012 voters.  But in 2012, Romney beat Obama by 9% in Arizona.  Yet again, a likely liberal bias is in the composition, suggesting that Trump's support is significantly underestimated, while perhaps the race is not that competitive after all.

On May 25, the pollster claimed that "PPP's new North Carolina poll finds Donald Trump leading Hillary Clinton 43-41 in the state, with Gary Johnston [sic] at 3% and Jill Stein at 2%."  Respondents were tied at 46% each when it came to whether they voted for Obama or Romney in 2012.  The actual margin in 2012 was a 2% victory for Romney in the state.  To correct this bias, likely add another 2% to Trump's polling numbers.

On June 1, another PPP poll reported Trump with a healthy lead in Georgia: "Trump leads Clinton 45/38, with Libertarian Gary Johnson at 6% and Green Party candidate Jill Stein at 2%."  In fact, Trump could be crushing Clinton by nearly double digits.  The 2012 Romney-Obama spread among survey respondents in the state is 6%, compared to 8% in the actual election.  Chalk up another couple of percentage points to Trump's lead.

On June 7, PPP reported that "[t]he Presidential race in Florida looks like a toss up. Donald Trump's at 41% to 40% for Hillary Clinton, with Gary Johnson at 4% and Jill Stein at 2%."  Or maybe it isn't a toss-up after all?  Among those surveyed, 4% more said they voted for Obama in 2012 (48%) than Romney (44%).  But Obama beat Romney by 0.88% in 2012, indicating a possible 3% liberal bias in the polling data.  Perhaps that should be revised to a 4% lead for Trump over Clinton in Florida, rather than just 1%?

Next, we have Pennsylvania.  A poll released June 8 states that "PPP's new Pennsylvania poll finds a close race between Hillary Clinton and Donald Trump in the state ... Clinton has 41% to 40% for Trump, with Gary Johnson at 6% and Jill Stein at 3%."  Readers know where this is headed.  When asked whom they voted for in 2012, 49% said Obama, and 41% indicated Romney, an 8% spread.  The actual margin of victory was 5.4%.  Correct for this nearly 3% bias, and it looks as though Trump has a small lead in the Keystone State.

One notes that in each case, the bias is against conservatives (i.e., 2012 Romney voters) and for liberals (aka 2012 Obama voters).  If the errors were random, we'd expect them distributed approximately equally for both sides – meaning some of the polls would have shown Romney 2012 supporters over-represented and others would show these voters under-represented.  Instead, all of them have Romney 2012 under-representation – or, in other words, conservative under-representation.

As a final example, the CBS News/YouGov poll in New Jersey conducted May 31-June 3 had Clinton with a massive 15% lead over Trump, 49% to 34%.  Just 35% of respondents leaned Republican, versus 54% who leaned Democrat.  According to Gallup, the actual leaning Democrat advantage over Republicans in the state is only 10%, not almost 20%, as this CBS News poll assumes.  Of those surveyed, 51% were registered Democrats, 33% were registered Republicans, and 12% were independents.  Based on the state's voter registration statistics:

About 48 percent of registered voters in New Jersey are not affiliated with any party. Out of 5.5 million voters, a little over 2.6 million identify as independent. Democrats follow, making up roughly 32.6 percent of registered voters, while Republicans claim 19.6 percent.

That would be a 13% spread for registered Democrats over Republicans in the state, not 18%, as the poll has it.  And independents – who the poll itself shows favor Trump over Clinton by 36% to 28% – are massively under-represented.  They appear to constitute almost half of the potential electorate but just 12% of the survey.  Correcting for the under-representation of independents and the over-representation of liberals/Democrats suggests that New Jersey may be tilting toward Trump, which is remarkable, since the state hasn't voted for the Republican candidate in a presidential election since Reagan '88.

Imagine that.  Trump possibly leading in New Jersey, which would be a nice set of electoral college votes to add to the "R" column.

We can throw in more evidence of bias onto this New Jersey polling data.  The poll had nearly sixfold more respondents between the ages of 18 and 64 than those aged 65+.  But the actual population ratio between these age groups is just 4:1, meaning that the 65+ cohort – which the poll itself shows is a dead heat between Clinton and Trump – is substantially under-represented.  Add on the fact that apparently 51% of respondents have completed a bachelor's degree or higher, but U.S. Census data shows that just 36% of state residents 25 years and older have this level of educational qualification, and the poll becomes even less representative of the New Jersey public, and not in Trump's favor.

Then, on Tuesday, Bloomberg released a national poll suggesting that Clinton leads Trump by 12% in the three-way with Gary Johnson, which is way outside the bounds of any other national polls.  Bloomberg national polls are no stranger to controversy on this site, as demonstrated previously, where this media outlet released a poll purporting to show a 18% Clinton lead over Trump – until we looked at the demographic data and found that Obama 2012 voters outnumbered Romney 2012 voters among survey constituents by a terrifyingly large 50% margin, rather than effective parity as per the actual 2012 election results.

Translation: Here is an apparent massive liberal bias that could, almost on its own, explain nearly all of Clinton's supposed lead – never mind any other liberal biases hiding in the polling demographics (the details of which were not released).

In this latest Bloomberg poll, conducted by the same firm as the previous survey, no demographic data was provided, so I sent the technical contact given on the poll an e-mail requesting detailed demographic data.  Here is what was sent back to me:

We appreciate your interest in our recent poll. The data are the property of Bloomberg Politics. They treat unpublished poll data the same way they would reporters' notes. What is unpublished today may be published tomorrow, next week, next month, or next year. They reserve the right to be the first to do so. Their policy is not to release specific findings, such as the ones you seek, if they have not been published in the newspaper.

Huh?

Reputable polls provide demographic data so readers can look for potential bias.  Any respectable media outlet would make such information available immediately – and many, if not most, do.  The Bloomberg media contact for the poll didn't respond to my request for further information.

Is this poll real?  Does Clinton really have a 12% lead over Trump at the national scale?  If the poll's previous bias still holds, the answer is almost certainly a resounding "no."  There are some odd numbers in the poll as well, such as the number of Clinton and Trump supporters given in regard to a question about voter enthusiasm.  The numbers are 332 and 333, respectively, which is unusual, since if Clinton has a massive lead over Trump, she should have more "supporters" in the polling composition than Trump, not fewer.  Perhaps there is an explanation for this discrepancy.  I asked the poll's technical contact, and received the following reply (emphasis added):

What you see are unweighted ns (sample sizes) used to calculate maximum margin of error. As you may have noted in the methodology document, the data are weighted (or adjusted, or rebalanced) by age and race to recent census data. Weighting is a common practice in polling, as even with random sampling certain groups are more likely than others to answer or to stay on the phone to answer the full poll. In order to be truly representative of the general population of the U.S., we need to weight to account for this. The demographic groups that support Trump ended up being weighted down and the groups supporting Hillary Clinton ended up being weighted up to conform to the general population.  We weight at that level to known population parameters, then pull likely voters out for the horserace question.  So, it is not the case we are knowingly adjusting Clinton or Trump voters.  It just works out that way given the way the demographics fall.

Readers can draw their own conclusions, but my concerns can be summed up with the following characters: !!!!????

These are but the tip of the iceberg of potential polling problems this year, and it seems as though all of them are tilted against Trump in various ways.  The devil is in the details, and polls that only claim to weight their data by age and race, while not providing detailed income, education, and historical/current political leaning information, are not discharging their duty to the public.  It is trivial to have a poll with no age and racial bias but still exhibiting very large bias due to these other demographic descriptors.  Only by being shown how the polling sausage was made can readers come to their own conclusions about the validity of the poll.  Overall, any unusual poll that doesn't provide detailed compositional data should be discarded from the pile.

Looks as if this year's polling data is shaping up the same way as a lot of climate data does – namely, suspect at best.