This tutorial is a shortened version of a series of blog posts on probabilistic thinking.
Probabilistic question answering
The usual form of a multiple-choice question consists of a short question, sometimes a small text clarifying the use of some technical terms in the question, and then a list of answers from which you must choose one (for instance by ticking a box). We will go through an example of such a question and the most common thought process when answering, and then show the difference with a probabilistic point of view.
The question is the following: "Which of these planets is closest to the sun?". Four possible answers are listed: "Earth", "Mars", "Mercury", "Venus". Perhaps you knew the correct answer and went straight for the corresponding box. Or maybe your last astronomy class was a while ago and you are not entirely sure of the answer. For instance you remember that Mercury and Venus are the first two, but not sure of their order. In this case, you would rapidly rule out the Earth and Mars answers, and tick one of the other two at random.
When answering probabilistically, things happen differently. You must assign a probability to each answer. During the first step, you recall that Mercury and Venus are the first two, but you can't immediately rule out the other two, you must still assign a precise value to each answer. You can view this as moving a proportion of your "budget" to the two answers that you think most likely to be incorrect. How much you want to move (20%, 10%, 5%?) depends on how sure you are that neither of those answers is correct, or how well you remember the information that Mercury and Venus are the two closest planets to the sun. If you have decided that you are 90% sure that the answer is either Mercury or Venus, then it remains to allocate this budget among the two answers, will it be 45% and 45%, or rather 60% and 30% ? There are many more possibilities available than when there was a single box to tick, each corresponding to a different form of knowledge of the relative ordering of the two planets. Answering may take a little longer than in the other form.
Logarithmic scoring
Choosing which probability to pick for each answer is hard. The entire purpose of the experiment is to train yourself to do it intuitively, by just trying to maximize your score without thinking about anything else. Your score for each answer is the logarithm of the probability you assigned to the correct answer, and for a sequence of questions you get the sum of each question's score. Here is a different way to view it.
You get a fixed number of tokens before the game begins, and you will have to (probabilistically) answer a sequence of questions. Each submission consists in dividing your remaining tokens among possible answers, the probability you assign to each answer corresponding to the proportion of tokens attached to it. Once the correct answer is revealed, you lose all tokens except the ones you attached to the correct answer, and for the next question you will redistribute only the remaining tokens, and so on. With this point of view, the logarithm acts only as a trick to make scores additive instead of multiplicative. The proportion of the initial tokens that remain at any time is the exponential of your score.
This scoring rule incentivizes truth-telling. If you don't know the answer, then you will maximize your expected score by accurately reporting your uncertainty (see the blog posts for a proof). This can be understood intuitively by the drastic consequences of assigning a low probability to the correct answer: you will lose a large majority of your tokens at once. If you place too many tokens on an answer that you consider unlikely, you will have fewer tokens to place on the likely answers and will run out of them faster. Conversely, if you do not place enough tokens on a likely answer, you will be punished and lose all the rest. In order to minimize your loss of tokens, you must place sufficently many tokens on any answer that is likely, which over time will push you to distribute your tokens in a way that accuratey reflects your uncertainty.
Calibration
After answering many questions, we can start to measure a different quantity about your answers. We gather all answers for which you indicated a probabillity of 10%, and count the proportion of such answers that were correct. If there were less than 10% of such answers correct, it means that you overestimated the chance that they were correct and should have predicted a lower value. Conversely, if there were more than 10% of them correct, then you underestimated it and should have predicted a higher value. If the probability you predict consistently corresponds to the proportion of answers that turn out correct, we say your answers are calibrated. A good calibration is something you should strive for, not just because it will marginally improve your score, but more importantly because it will be much easier for others to use your predictions effectively.
Frequently Asked Questions
What does it mean to assign a probability to a past event ?
If it bothers you to speak about probabilities for events which have already happened, you can stick with the token interpretation: you are given a fixed amount of tokens and you must allocate tokens on each answer, before losing all tokens placed on incorrect answers, no probabilities involved. If you'd like to dig deeper into the probabilistic interpretation, then the "true probability" you have in mind, for future events, corresponds to an aleatoric uncertainty. In contrast, this experiment is more concerned with epistemic uncertainty, stemming from what you don't know about the answer (valid also for past events) rather than what cannot be known about the answer (valid only for future events).
What if I have no idea about the answer ?
It can be hard to quantify how certain you are about an answer you think you know is correct, but it's often even harder to quantify your uncertainty when you think you don't know the correct answer. The trap to be avoided is thinking there is a one-size-fits-all "I don't know" answer. Every question comes with a reference prediction, usually with the same probability for every possible answer, but this reference is chosen upon question creation to form a reference score, it does not constitute a know-nothing prediction that you can match if you're clueless. The reason for this is that you can never really know nothing about the answer, it's only hard to quantify it when you know little, but that's also were you can easily make most progress. The question is not written in an alien language you can't understand, the answers are not interchangeable, your task is to carefully examine each answer and assess its plausibility. This is a long, tedious, and sometimes stressful process, but questions for which you are unsure of the answer are those where your score will be lowest, even small variations in your prediction can have a large compounded impact on your overall score.
What if the question is ill-posed or ambiguous ?
To improve your score on a question where you are almost certain of the correct answer, you must put almost all the probability on that answer, leaving very little probability on the other answers. This means that if there is an error in the answer marked as correct in the software, you will incur a huge drop in score, as though you indicated you were very sure of an incorrect answer, and the logarithmic scoring rule is very punishing in that regard. This mistake can happen if there is an unexpected problem in the question creation (typos happen), but also more importantly, and much more insidiously, if there is a fundamental flaw with the question. For instance, if a question is not factual in nature, and the person who created the question is very opinionated, the "correctness" of the chosen answer is questionable, and even more so if the question is ambiguous in its statement. This software cannot distinguish these cases, and you should therefore try to predict the answer that was marked correct by the person creating the question, even if that answer is not properly-speaking "correct" in a different interpretation, which might sometimes require leveraging side information about the person asking the question. Hopefully the safeguards in place prevent most such cases.
How do I improve my calibration ?
Your calibration will improve naturally with time. Simply being aware of the concept of calibration when forming your predictions and trying to maximize your score should be sufficient to gradually improve your calibration as you answer more and more questions. People are most often overconfident at the start, which would lead you to incorrectly assign very low probabilities to rather likely events, at which point you will lose a large amount of points for that prediction. After this punishment for overconfidence, fear of losing that many points again should naturally curb this behavior and steer you in the right direction. You can check your current calibration curves and get more targeted tips to improve it on your profile page.