Over the past year or so, we have been giving a heads up that the 2020 census data was to be affected by new and more stringent non-disclosure rules intended to protect the identify of census respondents. We recognize that such protection is important while muttering concerns about exactly what the effects would be.
The essential concept being used is that of a “privacy budget” for any particular geographic entity, which essentially means that the more detailed the geographic area, the more the actual data must be adjusted. Bluntly, the lower the population count, the more error has been introduced in the released data.
For the first release, the PL-94 redistricting file, at the block through the county levels, the only data we are 100% sure about is the count of housing units and the group quarters population. The released counts are also guaranteed to match the actual enumerated counts at the state level. The counts for most counties will be correct, except for the least populated counties such as Kalawao in Hawaii and Loving and King counties in Texas.
Further releases, which include detailed data such as population by sex and age, will have considerably more error induced in them to meet the privacy budget rules. We expect that there will be some variables, such as in the detailed ancestry tables, that are not even correct to the state level. Further, it is not yet clear how complex tables like age by sex will be handled, and whether the parent tables will be matched.
In our early work on the PL-94 release, we have been looking at the relationships between variables and have identified many cases where the data has been manipulated to the point where the relationships between variables are not possible on the ground –
- 2.72% of the census blocks have more households than household population
- 4.83% have zero occupied housing units which have people living in them (that is, there are no occupied dwellings, but there are people living there)
- 1.12% have no population in households, yet have occupied housing units
- 1.25% of the blocks have no adult supervision (that is, the adult population is zero, but the total population is non-zero)
- 1.48% of the blocks have over 10 people per household, which while possible, is extremely rare
At the block group level, which is more customary for use, these numbers drop substantially. It should be noted that these impossibilities were found in areas with as many as one hundred residents.
Many users will never notice these impossible results as data is viewed for aggregated areas such as drive time polygons and ZIP codes. But what lurks underneath should be disturbing to users – the number of near-impossibilities will be much greater, and even a statistic as simple as average household size will be affected to an unknown extent for small geographic areas.
Most of the modeling and estimation AGS does to build its estimates and projections is undertaken at the census block level. This presents a significant challenge as we move forward. Many of the models rely on using ratios – such as average family size or the percent of population under age 18 – and minor discrepancies due to error injection can be magnified in models.
We are currently engaged in the arduous process of cleaning the dataset to at least remove the impossibilities and smooth out some of the near impossibilities. Our approach includes table balancing at multiple levels of geography using entropy maximizing techniques. Please note that this does not remove the error, it simply results in data which is more usable with usability defined crudely as “things add up”.
This is not without its issues in that we will inevitably be asked “why doesn’t the count for males aged 35 to 39 for this census tract not match the census web site count?”. We firmly believe that it is better to release more usable data than to match the census tables. Will there be less error in the results? Probably not. But tables will add up as expected and tables with the same universe total will match.
Techniques matter though, as balancing a series of related tables over multiple geographic levels is a daunting process. Having dealt extensively over the years with multiple census integrations and multinational experience with error induction methods puts us in a very good position to both effectively and quickly meet this challenge.
While we expect to report further as more data is released, there are two key points that should be repeatedly raised –
- Error, which is rarely a topic of polite conversation, must be openly discussed and understood by data providers and users alike. Absent understanding, the term error produces only fear. The role of analytics is to reduce uncertainty in business decisions, and the more open the discussion, the more intelligent and thoughtful the use of that data will be.
- The lure of “free” census data will lose its luster as users discover that, because of privacy budgets, the data for many small areas appears haphazard and confusing. Consider instead experienced sources of this data that have thoughtfully recalibrated the local data to enhance its usability and have undertaken the major effort involved to permit historical time series analysis.