Hard to believe, but it’s been 5 years since the last census was taken. Starting January 21, 2020, the bureau started counting the population in remote Alaska and moved to kick off self-responses on March 12th when the first mailed notices went out. As we now know, days later a global pandemic derailed the timeline for both responses and data dissemination, with the release of the 2020 apportionment count delayed until April 2021. We knew almost immediately that something had gone very wrong with the data, not because of the pandemic, but because of the implementation of the new census “disclosure avoidance” policies. Five years later, it seems like we remain standing alone as the only ones who actually care enough about our customers, and data users in general, to sound the alarm about what went wrong.

In July 2021, we published the first of what became a long span of blog articles discussing the issues with the 2020 Census. For those of you who missed it, here is the gist of the problem: the 2020 census data was adversely affected by the implementation of new, complex, and theoretically more stringent non-disclosure rules intended to protect the identity of census respondents. While such protection is important, it also presents challenges. The essential concept being used is that of a “privacy budget” for any particular geographic entity, which essentially means that the more detailed the geographic area, the more the released data was distorted from the actual enumerated counts. Bluntly, the smaller the scale and population size of the geographic area, the greater the noise introduced into the data. At the census block level, where our models are built, the data were so distorted that we coined some unique ways to describe them: Mermaid blocks (entire communities of people living on water bodies), ghost blocks (with occupied dwellings but no people), and Lord of the Flies blocks (all children with no adult supervision).

But much to our surprise, none of the other data suppliers seemed to be talking about these issues and kept releasing data as normal (all of which relies on the Census data to some capacity). Well, so what? The census data is now five years out of date and most users will rely on current year estimates rather than aging census data. But much of this flawed census data is critical in analyzing trends in some key tables – age by sex, race and Hispanic origin, household size, and family structure to name a few. If the published tables are used as the “baseline”, the results will increasingly diverge from reality because of the spatial patterns of the census injected noise. The smaller any individual count (such as the female population age 25-29 for a census block), the greater the variance between the actual and published value, since all attributes were pushed either to zero or to a number deemed large enough to avoid disclosure.  For lack of a better way to put it, the released counts are ‘lumpy’ and use of those as a basis for projections will necessarily result in polarization over time. So, it comes as no surprise that we invested considerable time and effort in the cleanup of the released data.

What has it taught us? We went back to basics – applying sophisticated spatial models and injecting other data sources (including the previous census) to rebalance the data and restore it to a useable state. We learned that we have the data team to get the data right, and the passion for our work and attention to detail that this project demands. Our clients also learned that we care deeply about them and the resulting quality of their reports and models built from that data. Accurate and open data makes for happy and successful users.

Since we are at the midpoint of the decade, looking forward, we are enhancing our data using a broad range of new, detailed sources and our new parcel database to create data that will minimize our reliance on the Census Bureau, given that they seem to be doubling down on the current disclosure avoidance techniques. Our goal for the 2030 census? To use the results as but one component of our estimates and projections rather than as the critical element of them. Once bitten, twice shy as they say.