A tremendous effort in the analytics world is devoted to the task of data preparation and cleansing. Data exchanges refer to ‘curated’ data, which suggests that the suppliers of that data have gone to the trouble of estimating missing fields, reigning in outliers, and harmonizing the data with other known and trusted data sources at various geographic aggregations. Users of that curated data then rely on the dataset for their models, often automated, for making informed business decisions. If the goal of analytics is to reduce uncertainty in business decisions, the minimization of error must be a priority at all stages of the effort – since error in source data is not only propagated but magnified.
Consider it equivalent to having a termite infestation in your house. Superficially, everything looks perfectly fine, as the frame of the house is covered by layers of drywall, paint, siding and roofing materials. Sooner or later though, the foundational rot erupts at the surface – sagging floors and cracking drywall – but by then, the damage done is substantial. Structural rehabilitation of the bones of a house is an expensive and time-consuming effort. Tenting the house early in the process is in comparison extremely cheap, despite the neighbors mocking about the circus coming to town.
The notion that error would be deliberately induced into a foundational dataset is close to a moral issue. Who would do such a dastardly deed?
What is the Issue?
In years past, the census has – as required by law – made substantial efforts at protecting the privacy of individuals. As the genealogy world well knows, the physical records which have names and addresses, are sealed for decades. When the census included both the short form and the long form, the sensitive personal data found in the long form was reasonably well protected — it was based on a sample and techniques were employed to “borrow” characteristics between similar, nearby census blocks. With the demise of the long form – replaced by the American Community Survey (ACS) – the census consists of only completely (obviously with some error) enumerated geographic areas. As a result, the data for small areas can be used in conjunction with other databases (mailing lists, property records, etc.) to potentially identify individuals within them.
We have recently been talking about the census concept of a “privacy budget” and its potential effects on the 2020 data releases. Detailed discussions of those issues can be found on the AGS blog —
- Our rather tongue in cheek expose of mermaids in http://appliedgeographic.com/2021/09/mermaids-and-census-privacy-concerns/
The unpleasant conclusion is that the data has been seriously corrupted, so much so that a significant number of census block groups have statistically impossible data, among them –
- entire blocks of unsupervised children in households (no adults)
- ghost communes, where there are occupied dwellings with no people
- baseball team size families, complete with a stocked bullpen
For every identified impossibility, there lurks underneath it at least ten improbabilities, and this is just the baseline numbers. The real meat of the 2020 census is found in the detailed tables which address key population characteristics (age, sex, race, Hispanic origin, ancestry) and household characteristics (household size and structure).
The privacy “budget” was essentially exhausted at the block group level with the release of the general population counts, and the Census is considering releasing the detailed tables only to the Census Tract level. It is not hard to understand why. Massive reallocation was required just to release top level statistics. Imagine what will need to be done to publish a table of population by age and sex?
In such a table, the worst-case scenario is a value of one: showing that there is one female age 20-24 in a block allows that individual to be identified. At a certain level of geographic aggregation, the data must match the actual totals – the offending cell must be modified, and this means that the value must be changed in the opposite direction for some nearby block. Even at the block group level, there will be a great many cells with a value of 1 or 2, and each adjustment affects at least two block groups. Multiply this through, and you quickly see how pervasive the issue becomes.
From an operational standpoint, the goal of maintaining privacy while maintaining the essence of each geographic unit is an almost impossible task. The published redistricting results clearly indicate that the problem was not solved by one at a time characteristic trading between nearby areas but instead relied upon bulk operations which radically change the essential character of each geographic unit. The presence of statistical impossibilities is clear evidence of this.
The ACS to the rescue? Logic would suggest that the ACS analysts would have access to the original census data to both structure their sampling and extrapolate the results. Given some of the comments and discussions we have seen, this is not necessarily the case.
Alas, all is not well in ACS-land. The 2020 1-year series is delayed until the end of November and is being touted as “experimental”. We expect that the tables will be little more than asterisks punctuated by the occasional numeric entry. The 2020 survey occurred at the pandemic peak and response rates were substantially lower than usual. Worse, in-person visits were cancelled, and the results are seriously biased. The 2021 ACS should be much improved, but likely not at normal quality levels.
Don’t expect the 1-year series to be back to normal until late 2023, and the critical 5-year series until 2026.
What remains is a census made far less usable by privacy concerns, and an ACS series with noticeable deficiencies for the next several years.
There are several companies already trumpeting “We have 2020 Census Data” and in some cases, going as far as to say that “we have curated it”. Let us be clear on this point. If all you do is load in the raw data into tables, clean up the labels, and cross-check the totals against national values, you have not curated it. Worse yet, in none of these cases have we seen any mention about the data quality and usability issues. Most users remain blissfully unaware of these issues, and it is incumbent upon responsible data suppliers to ensure that users understand the limitations of any database they provide. Our philosophy at AGS has always been to educate users about both the strengths and weaknesses of all databases we create or provide, and we do not shy away from the concept of error. Error is simply uncertainty, and no data is without error. Higher error rates do not render a dataset unusable; they simply increase our level of uncertainty about decisions we may make using it.
The ACS will recover from its pandemic issues, but this will take a few years to work through. In the meantime, our goals are as follows:
- to educate both our business partners and end users on the nature and scope of the issue and, equally importantly, its impact on decision making processes
- to utilize the geostatistical techniques that we have developed over the decades to enhance the usability of the data as much as possible by reigning in the statistical impossibilities and using multi-tiered maximum likelihood models to provide a consistent and reasonably accurate benchmark point
Here at Applied Geographic Solutions, we have been working with census data for several decades and have developed and refined a powerful set of tools for analyzing and manipulating small area data.
Spatially Aware Matrix Mathematics
Users of census data have long faced the issue of the ever-changing geographic units which hamper time series analysis. Administrative units such as cities and towns are very unstable over time, and even at the county level there have been changes over time. The development of the census tract program over time has alleviated some of those issues, but they are often too large in terms of geographic area and population to be useful for many purposes.
Over the years, we have painstakingly migrated the historical census data with each decade to the latest block group boundaries. Conceptually, it is a simple matter of allocating historical block and block group level data to the new census blocks, and reaggregating to the block group level.
While this seems quite simple for one-dimensional variables (population, households), in practice we must rely on guideposts from both periods and from alternate sources to accurately disaggregate the data from the old boundaries then reaggregate it on the new. An example of this is to carefully compare the age composition of the dwelling units between the two geography sets.
For one-dimensional variables, the final step is to convert the results to integers. As an aside, our statistical side would prefer to leave the data as is, but most users are strangely uncomfortable with the concept of 3.17348 people in a block group.
For two-dimensional tables (such as age by sex), and multidimensional tables, the techniques are much more complex. Iterative proportional fitting (IPF) techniques are generally used, but these are computationally intensive and often fail to reach convergence, especially when dealing with tables beyond two dimensions. While we make use of IPF techniques, we prefer to use maximum entropy models which are one-pass solutions that force the values of a matrix to sum to their target marginal totals while minimizing the disruption of the structural integrity of the relationships between the matrix cells. This avoids a common IPF problem that emerges because the techniques lack memory and may make repeated adjustments to cell elements that lead to distorted but stabilized results.
These multi-dimensional matrix techniques work in floating-point arithmetic, so an age by sex by race table will be in fractional people. Simply rounding these values to integers will result in sparse tables which will once again not sum to the target totals.
Here is where the geographic experience comes in, along with some fuzzy logic thinking. Block groups are nested within census tracts, census tracts within counties, and counties within states. Since no state boundary changes have occurring in quite some time, the individual cells of the state level matrices will be integers, and by successively working up and down the hierarchy tree, it is possible to reasonably allocate down to the block group level. An issue that emerges is the scale gap between the census tract and county levels. In some cases, that gap can be too large to overcome – Los Angeles County, for example. Amalgamations of census tracts can be used here where the boundaries have been consistent from one census to the next.
Lessons from Canada
The census of Canada has long utilized rather disruptive privacy shields in the release of data, by using a random-rounding technique to the nearest five. For most geographic levels, all numbers end in 0 or 5 and, if the total population is under a minimum threshold, no detailed data is presented at all. The results at the dissemination area level (roughly equivalent to a block group) are, lacking a better term here, lumpy. The distribution of population by age will not equal the total, and the cross-tabulation of age by sex will not equal either major dimension.
The techniques AGS has developed and used in the United States have been demonstrated to work even in the extreme conditions imposed by Statistics Canada and result in data which is both internally and hierarchically consistent. Our extensive experience with decades of census data in each country puts AGS in a unique position to properly curate the 2020 census data.
Our 2020 Census Approach
The additional complexity of the privacy budget concept presents additional challenges, in that even the base population counts at a census block level have been modified, sometimes even to the point of statistical impossibility.
Our approach to making the 2020 census redistricting release usable includes –
- Using multiple levels of hierarchically organized geographic units to harmonize the data from the block level to the known state totals by utilizing census tracts, counties, and stable sub-county temporary areas
- Adding external data from the ACS public use micro samples to develop maximum likelihood models of sub-tables (such as households by size)
- Using available historical data (ACS, census) that refines the micro-distribution of certain variables such as vacant dwellings and household size.
- Using the spatial relationships between census geographies (adjacency and distance measures) to shift specific variables between units within parent geographies in an intelligent manner.
Most users of census data rarely look at individual block group data, but rather focus on trade area aggregations (radius and drive time areas) or standard geographic aggregates such as ZIP codes. Many of those users will not be aware of the issues that we address here, since many of these problems are resolved with larger geographic areas. That said, some will notice the problems when individual blocks or block groups are mapped, which, in our earlier termite analogy, demonstrates the structural damage that otherwise lies hidden below the surface.
Our considerable effort here will not result in more accurate data, as it is impossible to know the actual values. Nonetheless, it will be both internally and spatially consistent, and will effectively present the maximum likelihood solution. When used in combination with the ACS over the coming decade, it will be much more usable as a geographic base.
For more information on the issues related to the census data, and on how AGS plans to mitigate them, please feel free to contact us at firstname.lastname@example.org and we will be happy to discuss these issues, and how they ought to impact your business decision making, in more detail.