Yes, You Are (Maybe) Overconfident
218 of you took our calibration quiz, not counting the 10% of submissions that had to be thrown out for not being complete or giving ranges with the min greater than the max or other sanity check failures. (Here’s the raw data.)
The bad news is that you’re terrible at making 90% confidence intervals. For example, not a single person had all 10 of their intervals contain the true answer, which, if everyone were perfectly calibrated, should’ve happened by chance to 35% of you. Getting less than 6 good intervals should, statistically, not have happened to anyone. How many actually had 5 or fewer good intervals? 76% of you.
Here’s a histogram of the number of good intervals you got, out of 10:
The overlaid phantom histogram is what it would look like if it were really the case that every interval people gave had a 90% chance of containing the true answer. In other words, you should’ve made your intervals much wider. When we ask for a 90% confidence interval there’s in fact only a 41% chance that your interval contains the true answer.
We ran this quiz on Mechanical Turk as well and you marginally outperformed the turkers. The histogram of turkers’ good intervals is indicated by the red dots in the above graph. They failed our sanity checks at almost twice the rate (19%) of Messy Matters readers and of the remaining responses, the mean number of good intervals was 3.5 out of 10.
The more we’ve thought about (and read the literature on — or rather, consulted endlessly with Dan Goldstein, who knows the literature on) these kinds of overconfidence results, however, the less clear it is that the moral of this quiz is simply “people are overconfident”. For one thing, overconfidence depends on the question. The fraction of good intervals in your responses ranged from 23% (the length of the Nile and the gestation period of an Asian elephant) to 75% (number of OPEC countries). Of course, even 75% is not the 90% that was asked for.
More interestingly, in an ongoing follow-up study on Mechanical Turk we’re finding that after you get people’s intervals, more than half of them realize in retrospect that too few of their intervals are good. This suggests that people can learn to perform much better at this task.
Obligatory Wisdom of Crowds Demonstration
It’s not a fair demonstration since people weren’t asked for their best guesses, but here’s a table of median lower bounds, upper bounds, and midpoints of everyone’s ranges. Interestingly, people’s upper bounds are overall most accurate.
MLK | Nile | OPEC | Bible | Moon | 747 | Mozart | Elephant | Tokyo | Ocean | |
---|---|---|---|---|---|---|---|---|---|---|
True | 39 | 4132 | 12 | 39 | 2160 | 390000 | 1756 | 645 | 5959 | 35994 |
Min | 35 | 900 | 6 | 8 | 1000 | 20000 | 1700 | 180 | 5000 | 13500 |
Mid | 45 | 1750 | 13 | 15 | 3500 | 63250 | 1725 | 320 | 8000 | 30000 |
Max | 55 | 3000 | 20 | 20 | 5000 | 100000 | 1790 | 400 | 10000 | 40000 |
Thanks to Sharad Goel, Dan Goldstein, Bethany Soule, Dan Kaminsky, and Michael J.J. Tiffany.
Image: Kelly Savage