The Canadian census has for some decades used a random rounding technique to minimize privacy concerns on its data. Essentially, the method works by rounding each number to the nearest five using a probability table. A value of 1 would have an 80% probability of being rounded to 0, and a 20% probability of being rounded to 5.

These rounding techniques are applied to each element released at the Dissemination Area (DA).

For many applications, data at the DA level is aggregated by radius or drive time, and the impact of the error is minimal. But what happens when you use this data at its native level? As part of our US-Canada segmentation project, one of the first things we noticed was that the data for Canada is more “lumpy”, especially for areas with relatively small populations. When you are trying to group similar areas together, this lumpiness has a significant impact, since the effect is to polarize the data.

Using a random rounding formula, the average error should be nearly zero if the procedure has been done well and the distribution of the error uniformly distributed. However, the absolute average error of any value will be 2.5. The smaller the original number, the larger this error is as a percentage of its value. The average error if the number is 1002 is 0.25%, but if that number is 10, it is 25%.

Complicating this is the fact that by its very nature, segmentation attempts to group areas together – in effect, looking for the lumps. Worse, the more interesting the variable, the more likely that it exhibits lumpy behavior. Group quarters populations tend to be a small proportion of the total population, and tend to be spatially clustered (e.g. military bases, college dormitories) to begin with. Add to this random rounding, and the landscape looks even more clustered than it actually is.

Adding to our statistical woes is the simple fact that the average dissemination area has 684 people compared to the average block group which has 1532 people.

As a result, the combined segmentation analysis for the two countries tends to polarize the Canadian data into a smaller number of concentrated segments which are more unstable than segments which are US dominant.

To overcome this, we have employed some rather unusual techniques which attempt to overcome the lumpiness of the Canadian data, or at least minimize it, to distinguish between the demographic characteristics are truly different and those which are primarily statistical effects. It is problems like this that remind us that data science is often more of a data art.