I recently read an OpEd piece in the New York Times, Eight (No, Nine!) Problems with Big Data. The article was written by two NYU professors, one a psychologist and one a computer scientist. I was looking forward to some interesting new arguments and points of view. I was in for disappointment.
Here is an abridged version of the 9 problems they see with Big Data.
- Big data is good at detecting correlations but doesn’t tell us which correlations are meaningful.
- Big data is helpful as an “adjunct to scientific inquiry” but doesn’t replace domain expertise.
- Tools produced with big data can be easily “gamed” reducing long-term utility.
- Google’s Flu Trends app doesn’t work as well as it once did.
- If big data techniques are used for both data collection and analysis there may be pitfalls.
- Big data finds “too many correlations” because it explores too much data.
- Big data is “prone to giving scientific-sounding solutions to hopelessly imprecise questions” like ranking Francis Scott Key as history’s 19th best poet.
- Big data is “best when analyzing things that are extremely common”. Huh?
- There is too much hype surrounding Big Data.
Let’s take a look at these one at a time.
Problem #1 is straight from the Statistics 101 textbook: correlation does not imply causation. So big data in the hands of someone who never studied statistics could be a problem. Of course this has always been an important axiom for data analysts. Nothing new or significant about this problem with the advent of big data.
Problem #2 reminds us that we will still need biologists and other scientists with specific domain expertise because big data can’t do the job on its own. Yes, despite coordinated efforts, statisticians and computer scientists have failed to supplant all other scientific disciplines and domain expertise will continue to be valuable. Again, this issue predates use of the term big data.
Problem #3 warns us that if someone builds a tool with a simple algorithm (e.g., grading papers by looking for use of sophisticated words) then people will be able to figure it out. Darn it. You mean that feeble attempts at laziness might not work…even with big data? I’m starting to see a pattern here.
Problem #4 is clearly alarming. Google’s Flu Trends doesn’t seem to work well any longer. It was so cool at first. And, now, the whole internet has changed. Jeez. Why would the internet change? It’s sort of like analyzing people and finding a way to predict behavior and then those people change their behavior. Why do they do that? It’s a horrible dilemma but we seem to be stuck re-analyzing data over and over to keep track of changes. Brutal. Why doesn’t big data fix that?
Problem #5 implies that before big data no one ever created a model with data they collected on their own and then failed to validate said model with a 3rd party independent source. In other words, model validation is still important. Yes, again, even with big data. Damn it!! I still have to pay attention!
Problem #6 is a major issue if you don’t know anything at all about statistics or data analysis – just like problem #1. The issue here though is that there are correlations everywhere because there is simply too much data analysis going on. Were we better off when we only analyzed a few data sets? I don’t really get this one and I’m surprised the computer science professor allowed this to go to print. Doing more data analysis doesn’t prevent bad interpretation but it doesn’t hurt anything either.
Problem #7 is the anti-positivist angle. Talk to any social scientist who hated math or statistics. They refer to any effort to quantify as “positivism” and lump this sort of research into a bucket full of other horrible practices like voter discrimination and other parts of the GOP platform. (Aside: if you didn’t go to grad school imagine a chain smoking, hand-waving intellectual want-to-be who uses big words but on the inside is terrified they’ll be discovered as not terribly insightful.) I’ll bet the computer scientist secretly liked seeing Francis Scott Key as #19 on the poet list. Who says he’s not a poet? But, the psychologist doesn’t want us to forget that humans have a right to make arbitrary distinctions between those who rhyme with spoken word versus those who rhyme with lyrics to music. To me, these efforts can provide uniquely useful insights. Sometimes they must be disregarded as whimsical but not always. And what’s wrong with a whimsical perspective from time to time?
Problem #8 is sort of like … well, we really want to get to 8 or 9 problems. That’s a big number and that way no one will really want to read the entire article because there are so many problems they will simply assume we’re correct and move on to the next article. Big data is good at analyzing “common” things? What does that mean? So big data is good for analyzing baseball and apple pie but it’s bad for analyzing tennis and zucchini bread? There are rules about how much significance can be attributed to inferential findings – again this is Stats 101..okay, maybe Stats 102 – but there’s nothing problematic about looking for a needle in a haystack. The example they give, something to do with translating a book review, has nothing to do with big data and everything to do with the thorny task of language translation. This may come as a shock but “big data” is better with numbers than it is with text.
Problem #9 – too much big data hype? Perhaps. To me, it’s very exciting that advances in computational power allow us to explore possible solutions to problems that were intractable just a few years ago. Maybe Big Data today is like disco in the 1980s. The Bee Gees were hot but popularity faded a few years later. Or maybe it’s more like the internet in the 1990s. There was way too much hoopla. Remember pets.com? What a joke. After 20 years the internet hasn’t really lived up to the hype…well, except now I work from a home office using a Google Chromebook purchased on Amazon. And you’re reading this on my blog.