Main content

# Information entropy

## Video transcript

Voiceover: Imagine two machines. They both output messages from an alphabet of A, B, C, or D. Machine One generates each symbol randomly, they all occur 25% of the time, while Machine Two generates symbols according to the following probabilities. Which machine is producing more information? Claude Shannon cleverly rephrased the question. If you had to predict the next symbol from each machine, what is the minimum number of yes or no questions you would expect to ask? Let's look at Machine One. The most efficient way is to pose a question which divides the possibilities in half. For example, our first question, we could ask if it is any two symbols, such as "is it A or B?", since there is a 50% chance of A or B and a 50% chance of C or D. After getting the answer, we can eliminate half of the possibilities, and we will be left with two symbols, both equally likely. So we simply pick one, such as "is it A?", and after this second question, we will have correctly identified the symbol. We can say the uncertainty of Machine One is two questions per symbol. What about Machine Two? As with Machine One, we could ask two questions to determine the next symbol. However this time, the probability of each symbol is different, so we can ask our questions differently. Here A has a 50% chance of occurring, and all other letters add to 50%. We could start by asking "is it A?", if it is A we are done, only one question in this case. Otherwise, we are left with two equal outcomes, D or, B and C We could ask, "is it D?". If yes, we are done with two questions. Otherwise, we have to ask a third question to identify which of the last two symbols it is. On average, how many questions do you expect to ask, to determine a symbol from Machine Two? This can be explained nicely with an analogy. Let's assume instead we want to build Machine One and Machine Two, and we can generate symbols by bouncing a disc off a peg in one of two equally likely directions. Based on which way it falls, we can generate a symbol. With Machine One, we need to add a second level, or a second bounce, so that we have two bounces, which lead to four equally likely outcomes. Based on where the disc lands, we output A, B, C, or D. Now Machine Two. In this case, the first bounce leads to either an A, which occurs 50% of the time, or else we lead to a second bounce, which then can either output a D, which occurs 25% of the time, or else it leads to a third bounce, which then leads to either B or C, 12.5% of the time. Now we just take a weighted average as follows. The expected number of bounces is the probability of symbol A times one bounce, plus the probability of B times three bounces, plus the probability of C times three bounces, plus the probability of D times two bounces. This works out to 1.75 bounces. Notice the connection between yes or no questions and fair bounces. The expected number of questions is equal to the expected number of bounces. So Machine One requires two bounces to generate a symbol, while guessing an unknown symbol requires two questions. Machine two requires 1.75 bounces. We need to ask 1.75 questions on average, meaning if we need to guess a hundred symbols from both machines, we can expect to ask 200 questions for Machine One, and 175 questions for Machine Two. This means that Machine Two is producing less information because there is less uncertainty, or surprise, about it's output, and that's it. Claude Shannon calls this measure of average uncertainty "entropy", and he uses the letter H to represent it. The unit of entropy Shannon chooses, is based on the uncertainty of a fair coin flip, and he calls this "the bit", which is equivalent to a fair bounce. We can arrive at the same result using our bounce analogy. Entropy or H is the summation for each symbol, of the probability of that symbol times the number of bounces. The difference is, how do we express number of bounces in a more general way? As we've seen, number of bounces depends how far down the tree we are. We can simplify this by saying that the number of bounces equals the logarithm base two of the number of outcomes at that level. The number of outcomes at a level, is also based on the probability, where the number of outcomes at a level equals one divided by the probability of that outcome. Number of bounces actually equals the logarithm base two of one over the probability of that symbol, which gives us our final equation. Entropy or H, is the summation for each symbol of the probability of that symbol times the logarithm base two of one over the probability of that symbol. Shannon writes this slightly different, which just inverts the expression inside the logarithm which causes us to add a negative, though both formulas give the same result. Let's summarize. Entropy is maximum when all outcomes are equally likely. Any time you move away from equally likely outcomes, or introduce predictability, the entropy must go down. The fundamental idea is that, if the entropy of an information source drops, that means we can ask fewer questions to guess the outcome. Thanks to Shannon, the bit, which is the unit of entropy, is adopted as our quantitative measure of information, or measure of surprise.