What Can Search Predict?

Monday, November 30, 2009
By Sharad Goel


Statistical wisdom is sometimes found in unusual places. Take, for example, the following exchange in, “Whip It,” a story of roller-derby loving misfits:

Team: (chanting, after a game) We’re number two! We’re number two! Coach: You guys came in second out of two teams. Team: Woo!

So what’s the moral? Well, there’s been a fair amount of buzz this year over the finding that web search behavior is correlated with offline outcomes, including flu incidence and economic activity (e.g., auto and home sales). For example, an increase in queries for “flu” and “cold” is often associated with a rise in actual flu caseloads. The correlation between search volume and flu levels in fact seems quite good, hovering around 0.95. My initial reaction to these results was somewhat critical, and boils down to the observation that simple statistical models are comparable to, often even better than, search-based predictions. Invoking the first lesson of Whip It: Success is best evaluated in context. To give another example of this principle, it may sound impressive to predict with 97% accuracy whether it’ll be cloudy in Cold Bay, AK  — at least until you learn Cold Bay is only sunny 10 days a year. This assessment, however, should be weighed against the second lesson of Whip It: Number two can still be cool. The fact that search is so highly correlated with health and economic outcomes says something interesting about human behavior, regardless of whether or not it’ll help save the world.

Motivated by these examples, fellow yahoos (Jake Hofman, Sébastien Lahaie, Dave Pennock, and Duncan Watts) and I recently investigated the extent to which search behavior predicts the success of cultural products, namely movies, video games, and music. In a departure from past work that has focused on real-time reporting of current activity (e.g., flu trends) — what Choi and Varian cleverly call “predicting the present” — our objective was to predict future events, typically days to weeks in advance. Specifically, we use query volume to forecast opening weekend box-office revenue for feature films, first month sales of video games, and the rank of songs on the Billboard Hot 100. In all cases that we consider, we find search volume on its own is predictive of future outcomes, but search is nevertheless often outperformed by baseline models trained on publicly available data; combining search and baseline models generally leads to modest improvements.

Whether web search is useful in predicting real-world activity is therefore likely a matter of circumstance and necessity: On the one hand, across a variety of domains, adding search to baseline models does not dramatically boost performance; but on the other hand, in certain situations, such baseline estimates may be difficult to generate, and for some applications, even small gains in performance are valuable. In other words, the benefit of web search as a prediction tool may have less to do with its superiority over other methods than with its generality, low cost, and real-time nature.

NB: For more details, check out our paper.

Illustration by Kelly Savage

Tags: ,