Every representative democracy has a fundamental need to undertake a regular enumeration of its citizens to ensure that equal representation can be maintained as best as possible. But once you have gone to considerable trouble to count the people, it is quite reasonable that the census asks questions which might be broadly useful for public policy decisions. In other words, since we’re already here, how many bathrooms do you have?
At the same time, the census must ensure that individual respondents cannot be identified. Should be simple, right? But to find out who has been counted, you must enumerate all possible addresses, then ask respondents what their address is so that you can tell who didn’t respond. That data is not released into the wild on purpose, so no big deal, right?
Having counted everybody as best as we can, how do we divvy up the territory into a set number of districts? Data must be released for small enough geographic areas to allow districts of equal size to be created. Political concerns require much more than simple counts to ensure that everyone is represented as best as possible. Details – race, age, income, sex – are required. Now we run the risk of identifying individuals if we are not very careful. The smaller the geographic area, the more likely that we have possible identification. Likewise, the more we divide the population of an area into groups (such as an age by sex cross tabulation), the greater the risk.
Both the US Census Bureau and Statistics Canada have long employed methods to minimize privacy risk that include a mixture of three basic elements –
- Suppression – by not publishing it at small geographic levels
- Modification – by injecting error into the data
- Substitution – by swapping the data for households between geographic areas
Canadian census data releases have universal modification (random rounding) with selected suppression. Every individual count is rounded to the nearest five so all numbers end in 0 or 5. If there are 7 males aged 45-49 in an area, the data would be released as either 5 or 10, having been “rounded” to the nearest five. The problem is that the same absolute error is introduced to each data element, regardless of its size and the need for it in the first place. Even done correctly, the error on a value of 100 is at most +/- 5%, but on a value of 5 is +/- 100%.
American census releases have largely used suppression and substitution. Either you can’t look at the table for a particular level of geography at all, or there is error injected in the tables so that the individual cells of a cross tabulation are affected without disrupting the marginal totals.
Whether we like these methods or not – or even know they are being used – the effects on overall data quality and usability are somewhat understood. But things have changed. Clever people with powerful computers have discovered that by adding external information, you can thwart these simple privacy controls.
To illustrate the problem, imagine for a moment that we have a copy of a famous painting which has been entirely covered up with tiny squares of paper. Our goal is to uncover as many squares as possible without allowing you to identify the painting. A strange game show concept, we agree, but bear with us here.
We can make the game harder by deciding that there are areas of the painting which, if we showed them, would end the game immediately (suppression). We could also blur the image by changing pixel values by a random but bounded amount (modification) or by swapping pixel values in a systematic manner (substitution). The more you apply these methods, the longer the game is likely to go on if the contestants are allowed to bring only their memories. We tend to focus on the parts of the image which define it by using positive identification techniques. Call this the Aha! method.
But what happens if one of the contestants arrives with a laptop containing images of paintings? The exposed portions of the image can be statistically compared to the library of images, and with each successive revealed square eliminate more and more candidates by recomputing the probability that the images could be the same. In effect, the computer simply eliminates choices until one emerges as the most likely.
Modification techniques, especially bounded ones like the random rounding method, are quickly overcome since the pixel values will have a known deviation range. That is, the particular shade of yellow of the pixel is less important than the fact that it is yellow and not red. The injected error might make the picture less pleasing to the eye, but it doesn’t prevent the computer from statistically matching it.
The substitution techniques are slightly more resistant, especially if they are applied well – but the computer has the advantage of knowing how many pixels of each color it should find in total. Even blurry images can be eliminated by probability relatively easily.
This leaves suppression. Unlike the human eye, which is drawn to the distinctive features of the painting, the computer considers each pixel to be of equal importance and can eliminate non-candidate images on the basis of the seemingly unimportant areas of the picture. Show enough of these unimportant pixels, and the image will be known by a process of elimination.
In other words, the levels of privacy protection used in the past will not work in the current era of fast computing and an incredible array of external data which can be brought to the party. The replacement which will be employed by the US census for the 2020 release is much more complex and is heavily dependent upon the data itself. This means that the exact nature of the solution is not knowable in advance. At this date, the census has still not been able to tell us definitively what data will be released on what date.
What we do know is that the overall quality of the census as a data source will by necessity be reduced because of a combination of suppression and error injection. Data providers will need to adjust to these conditions by bringing more and more external data to the party.
For more information on the changes coming to the Census, we recommend the following resources:
https://www.census.gov/library/video/2021/protecting-privacy-in-census-bureau-statistics.html
Recent Comments