# Yes, You Are (Maybe) Overconfident

Wednesday, March 31, 2010
By dreeves

218 of you took our calibration quiz, not counting the 10% of submissions that had to be thrown out for not being complete or giving ranges with the min greater than the max or other sanity check failures. (Here’s the raw data.)

The bad news is that you’re terrible at making 90% confidence intervals. For example, not a single person had all 10 of their intervals contain the true answer, which, if everyone were perfectly calibrated, should’ve happened by chance to 35% of you. Getting less than 6 good intervals should, statistically, not have happened to anyone. How many actually had 5 or fewer good intervals? 76% of you.

Here’s a histogram of the number of good intervals you got, out of 10:

The overlaid phantom histogram is what it would look like if it were really the case that every interval people gave had a 90% chance of containing the true answer. In other words, you should’ve made your intervals much wider. When we ask for a 90% confidence interval there’s in fact only a 41% chance that your interval contains the true answer.

We ran this quiz on Mechanical Turk as well and you marginally outperformed the turkers. The histogram of turkers’ good intervals is indicated by the red dots in the above graph. They failed our sanity checks at almost twice the rate (19%) of Messy Matters readers and of the remaining responses, the mean number of good intervals was 3.5 out of 10.

The more we’ve thought about (and read the literature on — or rather, consulted endlessly with Dan Goldstein, who knows the literature on) these kinds of overconfidence results, however, the less clear it is that the moral of this quiz is simply “people are overconfident”. For one thing, overconfidence depends on the question. The fraction of good intervals in your responses ranged from 23% (the length of the Nile and the gestation period of an Asian elephant) to 75% (number of OPEC countries). Of course, even 75% is not the 90% that was asked for.

More interestingly, in an ongoing follow-up study on Mechanical Turk we’re finding that after you get people’s intervals, more than half of them realize in retrospect that too few of their intervals are good. This suggests that people can learn to perform much better at this task.

### Obligatory Wisdom of Crowds Demonstration

It’s not a fair demonstration since people weren’t asked for their best guesses, but here’s a table of median lower bounds, upper bounds, and midpoints of everyone’s ranges. Interestingly, people’s upper bounds are overall most accurate.

MLK Nile OPEC Bible Moon 747 Mozart Elephant Tokyo Ocean
True 39 4132 12 39 2160 390000 1756 645 5959 35994
Min 35 900 6 8 1000 20000 1700 180 5000 13500
Mid 45 1750 13 15 3500 63250 1725 320 8000 30000
Max 55 3000 20 20 5000 100000 1790 400 10000 40000

Thanks to Sharad Goel, Dan Goldstein, Bethany Soule, Dan Kaminsky, and Michael J.J. Tiffany.

Image: Kelly Savage

• http://markjstock.org Mark Stock

As an engineer for whom numbers are (almost) holy, I was happy to see that my intervals were right on target (9/10 answers in my range), though the competitive part of me still wanted to get them all within the tightest intervals possible.

Your publishing of the raw data may have made me *more* confident than I would be normally. If I were to take a similar test again, my currently-big head will probably tempt me to tighten my ranges.

After browsing through some of the raw data, I wonder how the width of the ranges compares to their “accuracy”. Are test-takers split into two groups (those who have a good idea what the answer is, and those who have never seen the correct number and are simply guessing), or is there a smooth range (everyone “guesses” but the people who know the answer have a tighter range)?

• http://www.investomy.com Antony

From the data table, I suggest the test result could be better interpreted if we remove data where the true answer is outside the min and max. When no one in the test make a correct guess, it is not about confidence. It could only show the question asked did not fall into their knowledge and the answer entered is a blind-guess.

• Dave

Here’s my problem with this test. 1.) There exists questions that are directly misleading. For example, the I’m betting the average gestation period for any animal chosen at random is no where near 645 days. Thus since I know nothing about animals I pick a lower value based on the fact that many animals will have a lower gestation period (I may in this case be correct in my 90% confidence interval, note I actually put 700 as my upper threshold so got the question correct). Secondly, I think there is a psychological effect going on here. I don’t want to appear stupid if my upper bound value is MILES from the right answer. That being said I still believe people are overconfident.

