If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Information entropy

Finally we arrive at our quantitative measure of entropy. Created by Brit Cruise.

Want to join the conversation?

  • male robot hal style avatar for user motogeeeksatyam
    can't get how machine 1 gives more information than machine 2 just because data coming out of it is more uncertain. How uncertainty is related to information. What exactly do we mean by information here. To me information is "what is conveyed or represented by a particular arrangement or sequence of things" are we talking about this information here.

    Also I would like to tell what more information to me means. for that first thing I have to choose is for what I need information. For example I choose that I need information about "Computers" and then two machines gave me information as below:
    Machine 1: computer is an electronic device.
    Machine 2: computer is an electronic device which can perform huge arithmetic operations in fraction of seconds.
    In the above i would say machine 2 is more informative as it gives more detailed information about the subject I wanted to know. How is this related to uncertainty. Is the second sentence produced by machine 2 is more uncertain than machine 1.

    Is this entropy being used in some real world application, if yes then if you could have quoted a real world example in the video it would have been more clear. Why should I know What is entropy? Where it will help me? or where it helps the world to know about entropy
    (14 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user mnedelko
      AHHH! I think I have an answer, but it would be great if the community could verify. The whole question about 'which machine produces more information' could also be rephrased as 'which machine produces letters with less certainty'. The expression 'information' to me here is chosen a little counter intuitively. I see information as something that 'illuminates' a problem. This video on the other hand identified information as the total amount of variables, choices and outcomes the machine produces and which we have to work through to arrive at the right sequence of letters. Seeing as MACHINE 1 requires us to do MORE word by asking MORE questions it produces more 'information'. 'Information' here is not something 'informative' but rather something we need to wade through to get to the right answer.
      (29 votes)
  • aqualine ultimate style avatar for user Thos Os Sal's Strong
    When Brit says #bounces = log(2)(#outcomes), what exactly does that mean? I didn't quite understand that.
    (9 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user michaelyhuang23
      Consider the reverse of the equation, that is
      2^#bounces = #outcomes

      What Brit is referring to is the tree of bouncing coins that he drew.

      After the first bounce, we arrive at 2 equally likely outcomes (either to the right or to the left). After the second bounce, we have 4 equally likely positions that the coin can land (4 equally likely outcomes). In general, after k bounces, we have 2^k equally likely positions that we can land on, thus the equation I wrote above.

      1/p = #outcomes because the probability of each outcome is 1 over the number of outcomes (since each outcome is equally likely in the fair bouncing machine)
      (4 votes)
  • duskpin ultimate style avatar for user Roshni
    At , how could there be 1.75 questions?
    (7 votes)
    Default Khan Academy avatar avatar for user
    • leaf red style avatar for user Noble Mushtak
      This is because we took a weighted sum and got a sum of 1.75.

      We got our weighted sum by taking the sum of the products of the probability of something happening (the weight) by the number of bounces it takes for that to happen (the data).
      The probability of getting A is 0.5 and it takes 1 bounce to get to A. 0.5*1=0.5
      The probability of getting B is 0.125 and it takes 3 bounces to get to B. 0.125*3=0.375
      The probability of getting C is 0.125 and it takes 3 bounces to get to C. 0.125*3=0.375
      The probability of getting D is 0.25 and it takes 2 bounces to get to D. 0.25*2=0.5
      Now we take the average of those products: 0.5+0.375+0.375+0.5=1.75

      I hope this helps!
      (17 votes)
  • old spice man green style avatar for user Gavriel Feria
    According to the rules of logarithms:sum(i=1,n) p_i * -1 log2(p), which means entropy is negative (for n odd)? That's weird . . . .
    (4 votes)
    Default Khan Academy avatar avatar for user
    • male robot hal style avatar for user Cameron
      Entropy is 0 or positive
      Here's why:
      Since p i s a probability we can say: 0 <= p <= 1
      If p =1 then log_2(p) = 0 and -0 is just 0 (not negative)

      For 0 < p < 1
      log_2 of a value >0 and < 1 is negative
      thus log_2(p) is a negative value
      thus -p * log_2(p) is a positive value (a negative value times a negative is positive)

      As a side note -p * log_2(p) = p * log_2 (1/p) if that form seems more intuitive

      Hope this makes sense
      (8 votes)
  • blobby green style avatar for user Ryan
    Why is this considered a measure of information, rather than of information efficiency? Is calling it the first just a different way of saying it's the second? The reason I ask this is that Nick Corrado's answer to the top comment seems to imply it's more like a measure of efficiency, if deleting unnecessary letters means it has more information.
    (6 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user Andreas Klebinger
      Efficiency always depends on the use case. If you delete all unnecessary characters it might contain the same amount of information so be more efficient in information/page but less efficient for a means of communication because it's harder to read.

      Since entropy isn't about the kind or representation of information it's not helpful to think of it as efficiency. Since that would require you to have a use case in mind.
      (2 votes)
  • leaf red style avatar for user Noble Mushtak
    http://en.wikipedia.org/wiki/Digital_Signature_Algorithm#Sensitivity
    The above link describes the DSA algorithm.

    Here, it says that "the entropy, secrecy, and uniqueness of the random signature k is critical." I understand how, given a signature, leaking k would allow someone to solve for the private key, x, but how does the entropy (the number of questions needed to find k, correct?) have to do with hiding x?
    (3 votes)
    Default Khan Academy avatar avatar for user
    • hopper cool style avatar for user Shane Pyman
      Your looking for a high entropy in order to stop someone from randomly guessing what k (or any other value) might be. Encryption of this sort can be broken using brute force hacking (randomly guessing at the answer a lot of times) and the encryption is only as strong as its weakest link. Encryption of this type can always be brute forced if a powerful enough computer is given enough time so a high entropy is needed to increase the time it would take a computer to guess the correct code. This not only increases the security of the given algorithm but also acts as a deterrent (if the answer would take a typical computer 1000 years to guess hopefully that will deter 1000 people from trying for a year)

      if your interested check out this video
      http://www.youtube.com/watch?v=feKCO5BzjFg
      about 18 min in theres an interesting bit on semi-prime numbers that applies
      cheers
      (6 votes)
  • purple pi purple style avatar for user Hannah Elaine
    What is the yellow, Z- shaped symbol at called? I want to look it up so I can learn more about it.
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user anurag4102
    why is the #outcomes 1/p? please explain
    (4 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Ryan
    @ Why log_2 of the number of outcomes? I'm not very good with logarithms.
    (3 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user dkm
    At we compute E(# bounces) = P(A)*1 + P(B)*3 + P(C)*3 + P(D)*2. In particular, the number of bounces for the outcome "B" is 3, or (# bounces_B) = 3. Fair enough.

    Then, around , we say (# bounces_i) = log_2(# outcomes_i). So, in order to have (# bounces_B) = 3, we need (# outcomes_B) = 8. But it seems there are only two "outcomes at that level:" B and C. What am I missing?
    (3 votes)
    Default Khan Academy avatar avatar for user

Video transcript

Voiceover: Imagine two machines. They both output messages from an alphabet of A, B, C, or D. Machine One generates each symbol randomly, they all occur 25% of the time, while Machine Two generates symbols according to the following probabilities. Which machine is producing more information? Claude Shannon cleverly rephrased the question. If you had to predict the next symbol from each machine, what is the minimum number of yes or no questions you would expect to ask? Let's look at Machine One. The most efficient way is to pose a question which divides the possibilities in half. For example, our first question, we could ask if it is any two symbols, such as "is it A or B?", since there is a 50% chance of A or B and a 50% chance of C or D. After getting the answer, we can eliminate half of the possibilities, and we will be left with two symbols, both equally likely. So we simply pick one, such as "is it A?", and after this second question, we will have correctly identified the symbol. We can say the uncertainty of Machine One is two questions per symbol. What about Machine Two? As with Machine One, we could ask two questions to determine the next symbol. However this time, the probability of each symbol is different, so we can ask our questions differently. Here A has a 50% chance of occurring, and all other letters add to 50%. We could start by asking "is it A?", if it is A we are done, only one question in this case. Otherwise, we are left with two equal outcomes, D or, B and C We could ask, "is it D?". If yes, we are done with two questions. Otherwise, we have to ask a third question to identify which of the last two symbols it is. On average, how many questions do you expect to ask, to determine a symbol from Machine Two? This can be explained nicely with an analogy. Let's assume instead we want to build Machine One and Machine Two, and we can generate symbols by bouncing a disc off a peg in one of two equally likely directions. Based on which way it falls, we can generate a symbol. With Machine One, we need to add a second level, or a second bounce, so that we have two bounces, which lead to four equally likely outcomes. Based on where the disc lands, we output A, B, C, or D. Now Machine Two. In this case, the first bounce leads to either an A, which occurs 50% of the time, or else we lead to a second bounce, which then can either output a D, which occurs 25% of the time, or else it leads to a third bounce, which then leads to either B or C, 12.5% of the time. Now we just take a weighted average as follows. The expected number of bounces is the probability of symbol A times one bounce, plus the probability of B times three bounces, plus the probability of C times three bounces, plus the probability of D times two bounces. This works out to 1.75 bounces. Notice the connection between yes or no questions and fair bounces. The expected number of questions is equal to the expected number of bounces. So Machine One requires two bounces to generate a symbol, while guessing an unknown symbol requires two questions. Machine two requires 1.75 bounces. We need to ask 1.75 questions on average, meaning if we need to guess a hundred symbols from both machines, we can expect to ask 200 questions for Machine One, and 175 questions for Machine Two. This means that Machine Two is producing less information because there is less uncertainty, or surprise, about it's output, and that's it. Claude Shannon calls this measure of average uncertainty "entropy", and he uses the letter H to represent it. The unit of entropy Shannon chooses, is based on the uncertainty of a fair coin flip, and he calls this "the bit", which is equivalent to a fair bounce. We can arrive at the same result using our bounce analogy. Entropy or H is the summation for each symbol, of the probability of that symbol times the number of bounces. The difference is, how do we express number of bounces in a more general way? As we've seen, number of bounces depends how far down the tree we are. We can simplify this by saying that the number of bounces equals the logarithm base two of the number of outcomes at that level. The number of outcomes at a level, is also based on the probability, where the number of outcomes at a level equals one divided by the probability of that outcome. Number of bounces actually equals the logarithm base two of one over the probability of that symbol, which gives us our final equation. Entropy or H, is the summation for each symbol of the probability of that symbol times the logarithm base two of one over the probability of that symbol. Shannon writes this slightly different, which just inverts the expression inside the logarithm which causes us to add a negative, though both formulas give the same result. Let's summarize. Entropy is maximum when all outcomes are equally likely. Any time you move away from equally likely outcomes, or introduce predictability, the entropy must go down. The fundamental idea is that, if the entropy of an information source drops, that means we can ask fewer questions to guess the outcome. Thanks to Shannon, the bit, which is the unit of entropy, is adopted as our quantitative measure of information, or measure of surprise.