After several revised publication dates over the past two years, the Census Bureau is threatening to finally release the detailed counts from the 2020 census starting in May with the DHC (Demographic and Housing Characteristics) file. We won’t pretend to be particularly excited about this.
Well before the 2020 census, we at AGS were paying careful attention to the potential impacts of the then ongoing disclosure avoidance discussions at the Census Bureau. We were increasingly concerned about two key issues – the impact of COVID-19 on the response quality and consistency, and the extent to which the data would be affected by the complex approach that was being discussed.
Within hours of the release of the data, our worst fears were realized. A quick look at the data revealed some disturbing impossibilities even at the block group level – which we quickly labeled with rather whimsical names that made it even to the New York Times (https://www.nytimes.com/2022/04/21/us/census-data-privacy-concerns.html) –
- Ghost blocks, where there were occupied households but no people
- Mermaid blocks, where there were people and no dwelling units that happened to be water blocks
- Baseball team blocks, where the average number of people per household was enough to field a team
- Lord of the Flies blocks, where there were apparently completely unsupervised children running free
Our publication of a series of articles on the subject didn’t make us overly popular amongst our peers, who we fear would have preferred a wall of silence. Eventually, some made statements which ranged from the absurd to the condescending. Others said nothing at all and charged ahead without possibly even realizing the extent of the problem.
Interestingly, while we were surveying the damage to the data and attempting to correct it, others were busy evaluating whether the attempt at disclosure avoidance even worked (Paul Francis, “A Note on the Misinterpretation of the US Census Re-Identification Attack”). It didn’t, apparently.
Over time, the tune has changed somewhat, and we note that our colleagues at ESRI have released what is likely a useful tool to evaluate the noise in the data so that users can evaluate the potential impact on their particular use cases (https://www.esri.com/arcgis-blog/products/esri-demographics/analytics/how-much-noise-is-in-census-2020-data-esri-can-show-you/). Claritas, on the other hand, and despite their special relationship with the census, only recently incorporated the PL-94 release into their data but still attempt to minimize the problem by referring to ‘logical inconsistencies’. To our knowledge, Synergos has recently stated only that the implementation of differential privacy methods has resulted in release delays without pointing out that the usability of the data has been significantly impaired. Indeed, in a recent post they state that ‘we ultimately trust the data reported in the Census as the ultimate truth to rebuild our models’, which given our findings should not provide much comfort to users.
Why do we point all this out? We could have easily kept our mouths shut and made oblique references in our methodology statements to potential source data problems. But we think trust and transparency matter and that data vendors have an obligation to inform and educate users on both the strengths and the weaknesses of the data they use to make business decisions. After all, if you are going to be making locational decisions where the cost of error is high, you should know how much uncertainty is in the data itself.
Rather than hide from the problem, we identified it, told users about it, and set out to disassemble and reassemble the census data to make it usable. It is not enough to point out that a relatively small number of block groups have logical inconsistencies, as those are merely the tip of the iceberg.
Oh, and why aren’t we excited about the upcoming release? The PL-94 release had a minimal number of basic elements, but even the Census Bureau eventually said that you shouldn’t divide the total population less group quarters population by the household count to estimate average household size. The coming release should include two key tables – population by age and sex and population by race and Hispanic origin – both of which are likely to have been modified even more than just the population and household counts. The Census Bureau has already stated that the geographic scale of the release will vary by table, and we expect that little data will be released at the block group level for detailed tables.
Don’t expect a good OOB (out-of-the-box) experience unless the Census Bureau has suddenly abandoned its current disclosure avoidance program. When the data is released, we will give you our honest assessment of it and tell you what we plan to do to make it useable.
You have a right to know, and we have an obligation to tell you.
Recent Comments