Over the past several years, the term ‘data science’ has emerged as one of those fashionable buzzwords that I hope will soon disappear. The term data science implies that its practitioners operate within the firm guidelines of bias-free and replicable empirical science.
During my generally unremarkable high school tenure, I faithfully arrived at 7:30 every single morning for band practice and was required to have music as one of my optional courses each year. The band director was always impressed by my technical abilities, and especially my abilities to impeccably play an unknown piece in the dreaded ‘sight reading’ portion of the final exams. For a high school student, I had a reasonably firm grasp of music theory and both an understanding of and appreciation for the sheer beauty of its mathematical expression.
My problem though, as he repeatedly told me, was that I was a technician. As much as I understood that the mathematics of music could evoke a tremendous range of emotional response, I was simply not a musician. I may have been a music scientist, but I was not an artist.
To my mind, the problem with the term ‘data scientist’ is that it focuses the attention on the toolkit rather than the artist who wields the tools, relegating the artist to at most a technician. Rarely has there been a non-trivial data problem solved by the execution of a series of steps from a recipe book. Data, and especially data related to the behavior of people, is just too messy and error laden for that to ever work. Indeed, it is that very messiness which makes data interesting and worthwhile to study in the first place.
It has often been repeated that if you torture a dataset enough it will confess to anything. Through transformations and sketchy statistical techniques, you can pretty much cook the outcome any way you want. And therein lies the problem. The notion that data analysis occurs in a bias-free environment is naïve at best. It is shockingly easy to insert your own a priori notions into the analysis, or worse yet, to yield to the expectations of a client, resulting in models which fit the data beautifully but work extremely poorly.
The term data science can then be used to justify the results – after all, who can argue with the science? – when in fact the results are actually a delicate blend between the tools of data science and the artistry of the practitioner. Both components are necessary, and if you have ever heard an artist make amazing music with whatever objects are placed in front of them, you will understand that it is the artistry that makes music. The tools make noise.
But back to my original complaint. I dislike the term ‘data scientist’ slightly more than I dislike the term “GIS Professional”, of which I am rather less than enthusiastic.
Why? Both are scientific tools based on mathematics. Complex? Certainly. Useful? Absolutely.
But would you hire someone who says that they are a “power tools specialist” to design and build your house? Probably not. The data science label makes the tool the object of attention rather than the knowledge and experience of the user of those tools. Put another way, given the choice of an experienced site location analyst who clings to spreadsheets and a highly skilled practitioner of data science with no retail location experience, I will choose the former. Every time.