The discussion of methodology is always a balancing act. At one end of the scale are users who hunger for gritty technical details that one would expect in a doctoral thesis in applied mathematics. Most prefer a short document that covers the basic underlying data sources but without the embedded equations or statistical terms. A few – none of our readers, of course – just want to know that a statement exists, and you aren’t just winging it. From a commercial standpoint, giving away too many details – especially the innovative ones – can be risky.
For our purposes here, we will simply lay out some of the key source data which we currently utilize, then give some vague hints as to the future directions we are actively exploring and testing.
We have heard of “top-down” and “bottom-up” style estimates as if one must do one or the other. The reality is more akin to a three-dimensional ping-pong game, where different data elements are estimated from multiple sources at different geographic aggregation scales. Everything must ultimately balance out to be close to reliable – or at least widely accepted – sources which often are not in agreement with each other. The result is a complex system of successive aggregation and disaggregation of data elements at various scales.
But what source data is brought to this party? The decennial census has historically been the foundation for each decade’s estimates, but with the demise of the “long form”, the primary source has become a mixture of the more accurate decennial counts and the annually released American Community Survey (ACS) estimates. Most of the interesting variables – for example income, occupation, housing values – are sampled on an on-going basis rather than enumerated once every ten years. Most federal agencies produce statistics on their particular sphere of influence which can be used to supplement the ACS estimates, albeit in some cases with some significant caveats.
The real work in estimation occurs at the local level. While most users are comfortable with block group level data for their work, we do all our core modeling at the census block level. In relative order of importance, in addition to the above federal government sources, we use the following:
- Private saturation mailing list counts at the ZIP+4 level. These are accurate estimates of occupied dwelling units at the local level. This is a substantial improvement over post office derived data for the same areas, since private firms have a clear interest in obtaining and maintaining accurate counts of occupied housing units because direct mail response rates are highly affected by mail counts which include vacant or non-existent housing. Unlike post office counts, these are useful for tracking both decline and growth, and have relatively short and predictable lag times.
- Post office ZIP+4s by type and locations over time, with an understanding of the limitations of postal geographies.
- The Open Address and National Address Databases are utilized to properly identify the location of ZIP+4 records, which is essential for identifying specifically where growth is occurring within census block groups.
- The PUMS (public use microdata sample) databases which accompany the decennial census and the annual ACS release are used extensively to model the complex relationships between demographic characteristics
- In our latest enhancements, we are using national building footprint data to assist in the transition between the 2010 and 2020 block boundaries, and moving forward, we will be using this source as an additional indicator of local, recent growth.
But like baking, getting the right ingredients, even in the right proportions, doesn’t guarantee a good result. Key to our methodologies is a reliance on decades of geographic knowledge that ranges from urban systems theory to spatial statistics and modeling techniques.
We have several projects underway to further enhance the precision of our small area estimates over the coming decade which include –
- Using multiple consumer lists, not for the purpose of counting population, but for quickly identifying new addresses and new streets over time. The presence of a household at a new address will help to determine when new construction is actually occupied. To a much lesser extent, given the known problems with consumer lists, we will be tracking household and population changes over time in these files, as well as selected demographic characteristics which are often available for household records.
- Using multiple cartographic sources to track the addition of new street segments, extension of existing segments, and in some cases, even the removal of existing streets which might affect the local household count.
- Using shape recognition algorithms and satellite imagery to track new housing developments as they occur. Initially, efforts will be confined to the fringes of the faster growing metropolitan areas and used to update the building footprint database between major updates
For those desiring more detail, we do have some bedtime reading that we can share with you. If that is insufficient, give us a call and we will happily talk – seemingly endlessly – about spatial data methodologies. For those at the other end of the scale? Yes, we have data and a methodology statement. We hope that this satisfies those who live safely within two standard deviations of the mean. Call us if it doesn’t!
Recent Comments