If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content
Current time:0:00Total duration:7:05

Video transcript

imagine two machines they both output messages from an alphabet of a B C or D machine one generates each symbol randomly they all occur 25% of the time while machine to generate symbols according to the following probabilities now which machine is producing more information Claude Shannon cleverly rephrase the question if you had to predict the next symbol from each machine what is the minimum number of yes-or-no questions you would expect to ask let's look at machine 1 the most efficient way is to pose a question which divides the possibilities in half for example our first question we could ask if it is any two symbols such as is it a or B since there is a 50% chance of a or B and a 50% chance of C or D after getting the answer we can eliminate half of the possibilities and we will be left with two symbols both equally likely so we simply pick one such as is it a and after this second question we will have correctly identified the symbol so we can say the uncertainty of machine one is two questions per symbol now what about machine - as with machine one we could ask two questions to determine the next symbol however this time the probability of each symbol is different so we can ask our questions differently here a has a 50% chance of occurring and all other letters add to 50% so we could start by asking is it a if it is a we are done only one question in this case otherwise we are left with two equal outcomes D or B and C so we could ask is it D if yes we are done with two questions otherwise we have to ask a third question to identify which of the last two symbols it is on average how many questions do you expect to ask to determine a symbol from Machine 2 and this can be explained nicely with an analogy let's assume instead we want to build machine 1 and machine 2 and we can generate symbols by bouncing a disk off a peg into one of two employee likely directions based on which way it falls we can generate a symbol so with machine 1 we need to add a second level or a second down so that we have two bounces which lead to 4 equally likely outcomes and based on where the disk lands we output a B C or D now Machine 2 in this case the first bounce leads to either an A which occurs 50% of the time or else we lead to a second bounce which then can either output a D which occurs 25% of the time or else it leads to a third bounce which then leads to either B or C 12.5% of the time so now we just take a weighted average as follows the expected number of bounces is the probability of symbol a a times one bounce plus the probability of B times three bounces plus the probability of C times three bounces plus the probability of D times two bounces and this works out to 1.75 bounces now notice the connection between yes or no questions and fair bounces the expected number of questions is equal to the expected number of bounces so machine one requires two bounces to generate a symbol while guessing an unknown symbol requires two questions Machine 2 requires 1.75 bounces we need to ask 1.75 questions on average meaning if we need to guess a hundred symbols from both machines we can expect to ask 200 questions for Machine 1 and 175 questions for machine 2 so this means that Machine 2 is producing less information because there is less uncertainty or surprise about its output and that's it Claude Shannon calls this measure of average uncertainty entropy and he uses the letter H to represent it and the unit of entropy Shannon chooses is based on the uncertainty of a fair coin flip and he calls this the bit which is equivalent to a fair bounce and we can arrive at the same result using our bounce analogy entropy or H is the summation for each symbol of the probability of that symbol times the number of bounces now the difference is how do we express number of bounces in a more general way and as we've seen number of bounces depends how far down the tree we are and we can simplify this by saying that the number of bounces equals the logarithm base 2 of the number of outcomes at that level and the number of outcomes at a level is also based on the probability where the number of outcomes at a level equals 1 divided by the probability of that outcome number of bounces actually equals the logarithm base 2 of 1 over the probability of that symbol which gives us our final equation entropy RH is the summation for each symbol of the probability of that symbol times the logarithm base 2 of 1 over the probability of that symbol and Shannon writes the slightly different which just inverts the expression inside the logarithm which causes us to add a negative though both formulas give the same result so let's summarize entropy is maximum when all outcomes are equally likely anytime you move away from equally likely outcomes or introduce predictability the entropy must go down now the fundamental idea is that if the entropy of an information source drops that means we can ask fewer questions to guess the outcome and thanks to Shannon the bit which is the unit of entropy is adopted as our quantitative measure of information or measure of surprise