What Can Search Predict?

Monday, November 30, 2009
By Sharad Goel


Statistical wisdom is sometimes found in unusual places. Take, for example, the following exchange in, “Whip It,” a story of roller-derby loving misfits:

Team: (chanting, after a game) We’re number two! We’re number two! Coach: You guys came in second out of two teams. Team: Woo!

So what’s the moral? Well, there’s been a fair amount of buzz this year over the finding that web search behavior is correlated with offline outcomes, including flu incidence and economic activity (e.g., auto and home sales). For example, an increase in queries for “flu” and “cold” is often associated with a rise in actual flu caseloads. The correlation between search volume and flu levels in fact seems quite good, hovering around 0.95. My initial reaction to these results was somewhat critical, and boils down to the observation that simple statistical models are comparable to, often even better than, search-based predictions. Invoking the first lesson of Whip It: Success is best evaluated in context. To give another example of this principle, it may sound impressive to predict with 97% accuracy whether it’ll be cloudy in Cold Bay, AK  — at least until you learn Cold Bay is only sunny 10 days a year. This assessment, however, should be weighed against the second lesson of Whip It: Number two can still be cool. The fact that search is so highly correlated with health and economic outcomes says something interesting about human behavior, regardless of whether or not it’ll help save the world.

Motivated by these examples, fellow yahoos (Jake Hofman, Sébastien Lahaie, Dave Pennock, and Duncan Watts) and I recently investigated the extent to which search behavior predicts the success of cultural products, namely movies, video games, and music. In a departure from past work that has focused on real-time reporting of current activity (e.g., flu trends) — what Choi and Varian cleverly call “predicting the present” — our objective was to predict future events, typically days to weeks in advance. Specifically, we use query volume to forecast opening weekend box-office revenue for feature films, first month sales of video games, and the rank of songs on the Billboard Hot 100. In all cases that we consider, we find search volume on its own is predictive of future outcomes, but search is nevertheless often outperformed by baseline models trained on publicly available data; combining search and baseline models generally leads to modest improvements.

Whether web search is useful in predicting real-world activity is therefore likely a matter of circumstance and necessity: On the one hand, across a variety of domains, adding search to baseline models does not dramatically boost performance; but on the other hand, in certain situations, such baseline estimates may be difficult to generate, and for some applications, even small gains in performance are valuable. In other words, the benefit of web search as a prediction tool may have less to do with its superiority over other methods than with its generality, low cost, and real-time nature.

NB: For more details, check out our paper.

Illustration by Kelly Savage

Tags: ,

  • Bheema V

    I thought the whole deal about the search volume based predictor was the ‘real-time’ part.

    The paper in Nature on Google’s effort to track Flu status makes this point a number of time (one day lag when using search trend, vs. 1-2 week reporting lag when waiting for CDC to respond).

  • http://www.cam.cornell.edu/~sharad/ Sharad Goel

    Even though the CDC may have a 1-2 week delay in reporting ground-truth flu caseloads, there is still a lot of (non-search) information available for estimating current flu levels (e.g., CDC reports from the last few weeks). So the relevant question is, At any instant in time, how much does search boost performance over a baseline tracking model? As it turns out, assuming a 1 week reporting delay, the boost is negligible; and with a 2 week delay, there is real — but still relatively small — improvement. In other words, the real-time information (i.e., search volume) does not add much to the stale information (i.e., actual flu caseloads from a week ago). See The Future is Yesterday for more discussion.

    In any case, flu reporting lags may be a thing of the past, as the CDC has recently adopted a near real-time flu tracking system.

  • http://lingpipe-blog.com/ Bob Carpenter

    There are companies like Health Monitoring Systems that use natural language classifiers over emergency room chief complaints (short text descriptions of symptoms) to add predictors for bio-surveillance (e.g. tracking flu outbreaks, localizing botulism outbreaks, etc.)

    I wonder if search load would be a useful predictor given chief complaint data. Or the richer data feed that CDC’s now getting.

    The problem I’ve seen discussed relative to using search for bio-surveillance is that if a celebrity gets sick, searches for whatever they have spike. That may not matter for forward prediction, or may itself be predictable given co-searches for the celebrity.

  • Pingback: How much is a song play worth? « Music Machinery

  • Pingback: Revisiting predictions: Google and Avatar: Oddhead Blog: Prediction Markets, Gambling, Electronic Commerce, Artificial Intelligence: David Pennock: Yahoo! Research

  • http://www.decisionsciencenews.com Dan Goldstein

    Nice graphs!

  • http://www.acepetmemorials.com Dan

    Interesting read and a nice view on search prediction. Nonetheless, Harnessing this data has lots of potential opportunities.

  • http://www.gainesville-marketing.com Herb

    interesting post…from a fellow analytics geek.

    You said: “search volume on its own is predictive of future outcomes, but search is nevertheless often outperformed by baseline models trained on publicly available data; combining search and baseline models generally leads to modest improvements.”

    I am curious to see if this continues to hold true as search evolves and tracking improves. We track trends for customers whose products are seasonal and even regional.

  • http://www.blockarchitects.co.uk Lorna

    Found this article interesting. Web search – not only a tool for predicting future outcomes but as an insight into human behaviour

  • http://squarecowmovers.com Austin Myers

    Yes Lorna, you are right. It also predicts business venture outcomes. There are many of them who is starting to look into the web search stat to predict on what market will give them the biggest profit in the long run.

  • http://www.marketingforsmallbusiness.mobi Elizabeth

    That’s an interesting point Austin, who knew that web search is such an important tool!

  • Pingback: Predicting the future with big dataThe Warri Post | The Warri Post