Over the years, we have worked with many datasets that attempt to avoid disclosing private information on individuals (and businesses) that governments and companies collect.

But what exactly is ‘private information’?

Private information is data that, either directly or indirectly, can be used to identify individuals or entities (which we will call ‘personal’) that has not otherwise been voluntarily released by the affected party.

The current laws governing data released by the United States government have evolved over time (“Census Protections Evolve Continuously to Address Emerging Threats”, 2020, www.census.gov/library/stories/2020/02/through-the-decades-how-the-census-bureau-protects-your-privacy.html).

Interestingly enough, until 1850 the initial results of the decennial census were typically posted in public areas so that individuals could correct information or add missing information. Over time, disclosure laws were enacted to protect businesses information, and these were extended to cover both direct and indirect disclosure, and eventually to include information on people.

Despite the ongoing risk of direct disclosure through the hacking of government information systems, the main impetus of privacy policy of late has been focused on indirect disclosure. If the published data for a census block has one household, then clearly any detailed data can be tagged back to that household. From there it is an easy task to link that single household to external records that include name, address, household composition, and so forth.

Truthfully, there has been no disclosure of private information, as that information has been repeatedly and voluntarily yielded by the household. After all, most of us regularly give up private information in exchange for a service. Navigation. Ad-free games on your phone. Having a bank account. Leasing an apartment. Purchasing a property. Even putting your last name on your mailbox.

And remember that it is only by linking census data with these external databases that identification is possible. At best, such a linkage confirms what you already knew. It rarely can disclose information which is still private.

Truth is that nobody really trusts corporations with that data, but it is a cost of participating in modern life. With government, giving up data really isn’t voluntary, and nobody trusts them to respect our personal data. That’s nothing new — a core concept of our constitution reveals that fundamental distrust.

In our view, the problem is that the Census Bureau has taken the privacy concept too far beyond what the law requires and are threatening to take it further – even to the point of applying their disclosure avoidance rules to sample surveys.

The 2020 census documentation makes it clear that there are only two things that are ‘true’ at the block level:

  • The dwelling unit count has not been modified
  • A zero in the group quarters population means zero present, while a non-zero value means just that and nothing more

In years past, the census bureau used a household swapping model that preserved the counts of at least population, households, and population in households. More detailed tables were subjected to some noise injection, but without making users contend with the mathematical impossibilities of the 2020 census. Had the disclosure avoidance worked, we could understand continuing the course. But doubling down on a strategy which failed to achieve its primary objective will make the census less usable without likely fulfilling the goal.

A compromise? Don’t mess with the basic counts (households, population) for an area, mess with the geographic units for which they are reported. Choose a minimum threshold count for households for data release at the block level and amalgamate blocks into adjacent blocks until all reporting blocks meet the standard. Revert to the tried and true household swapping methods. We don’t believe that preventing indirect disclosure is even possible without torturing the data to the point where it fails to fulfil its primary objectives.  Are there pitfalls here? Of course – we can see that some blocks, and even some block groups, will be by necessity disjointed in order to capture the intricate geography of the place and county levels – but this is a minor issue at best. We will happily live with more complex geographic units if we get data that actually reflects reality at the smallest published scale.