If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Lossless bit compression

AP.CSP:
DAT‑1 (EU)
,
DAT‑1.D (LO)
,
DAT‑1.D.1 (EK)
,
DAT‑1.D.2 (EK)
,
DAT‑1.D.3 (EK)
Computers represent all data in binary, so all types of files, from text to images to videos, are ultimately sequences of bits. Regardless of whether the bits represent a document or a GIF, computers can use a bit compression technique called Huffman coding.

Huffman coding algorithm

Let's see how it works with a simple textual example. This example language uses only 4 different characters, and yet is incredibly important to us: it's the language used to represent DNA and is made up of sequences of four characters A, C, G and T.
For example, the 4.6 million characters representing an E.coli DNA sequence happens to start with:
agcttttcattct
Since we need to represent four characters, a computer would typically represent each character using 2 bits, such as:
characterbinary code
a00
c01
g10
t11
The 13 characters above would be written using 26 bits as follows - notice that we don't need gaps between the codes for each bits.
00100111111111010011110111
But we can do better than this. In the short sample text above the letter "t" is more common than the other letters ("t" occurs 7 times, "c" 3 times, "a" twice, and "g" just once). If we give a shorter code to "t", then we'd be using less space 54% of the time (7 out of 13 characters). For example, we could use the codes:
characterbinary code
a010
c00
g011
t1
Then our 13 characters would be coded as:
0100110011110001011001
That's just 22 bits, four less bits than our original encoding. That may not seem like a lot, but imagine if we used an optimization like that on the entire 4.6 million characters of the DNA!

Huffman decoding

You might be scratching your head at the new binary codes we're using, with all different lengths. Is it still possible to decode it reliably? Yes, with the right set of codes.
Try it yourself!
Decode the following bits using the optimized binary codes.
111001
characterbinary code
a010
c00
g011
t1
Make sure you start at the first bit on the left, and match up the codes from left to right. What DNA string do you come up with?
Choose 1 answer:
Choose 1 answer:

That's the beauty of Huffman coding: the algorithm gives us a way to come up with a set of binary codes for a given sequence that ensures the data can be reconstructed unambiguously and reliably.

Uses for Huffman coding

Many file formats utilize some kind of Huffman coding to reduce the size of their file. Fax machines use Huffman coding after using RLE on the black and white runs. PNG images compress using LZ77, an algorithm similar to the text compression technique we learned, combined with Huffman coding on the results.

🙋🏽🙋🏻‍♀️🙋🏿‍♂️Do you have any questions about this topic? We'd love to answer— just ask in the questions area below!

Want to join the conversation?

  • hopper cool style avatar for user Alex Ewart
    With Huffman coding, does it take every 2 bits, so 00, 01, 10, or 11, convert them to a, g, t, or c, and then re-convert them to binary as 1, 00, 010, and 001 based on which appears most often? What if the letters appear the same amount of times so that Huffman coding expands it rather than compressing?
    (7 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Baraka Mujtaba
    Hi. Why are A and G given one more bit?
    (4 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user Andrew Cook
      A and G are assigned one more bit because they are the least common letters in the sequence.

      If they were only assigned 2 bits, there would be confusion when decoding messages. Imagine we had the following code:
      a - 01
      c - 00
      g - 10
      t - 1


      If I asked you to decode the message 101, it would not be clear whether the original message was ta or gt.
      (27 votes)
  • hopper happy style avatar for user Fredrick Nganga
    Hello, is the Huffman optimized binary codes universally standard?
    (3 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user Romeo Jeng
      Hi Fredrick,

      I don't think the Huffman codes are universally standard. I did a short search and found a short snippet of the rules used in the coding:

      1. The code length of a character depends on how frequently it occurs in the given text.
      2. The character which occurs most frequently gets the smallest code.
      3. The character which occurs least frequently gets the largest code.

      Based on the above, it seems to me that the Huffman encoding is done on-the-fly, so the same characters would be stored as different codes depending on its frequency of appearance in a file.

      The file can then come with metadata that provides info on what codes represent what characters. This is just my guess, but I hope it helps!
      (10 votes)
  • blobby green style avatar for user NAVEED RIAZ
    So a huffman coded file will always have a decode algorithm also?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • aqualine ultimate style avatar for user Martin
      I'm not sure I understand you correctly.
      With a Huffman code, you need a translation table to convert the Huffman code into something readable. In fact, a translation table or an algorithm to decode a specific coded message is always needed, no matter the code.
      So if you want to read a code someone needs to either give you a means of decoding the code or you have to break the code. Breaking the code is in some cases really easy and in others almost impossible.
      (9 votes)
  • leaf red style avatar for user layaz7717
    Why wasn't one utilized for one of the remaining letters ((b, c, or d-in the DNA section)? Wouldn't that make the code even shorter?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • aqualine ultimate style avatar for user Martin
      Sometimes the best way to see why is something is done the way it's done is to try to do it differently.

      So we have a, c, g and t and we want to encode them. So Pamela always used two bits in the first example, so let's try something else.

      a - 0
      c - 1
      g - 00
      t - 01

      So this should give use shorter codes, so let's encode the string aaggct, that gives us

      000000101 instead of

      000010100111 with the original code

      so the new code is a lot shorter.

      So now send the codeword and the encoding table to someone (that someone being ourselves) and we decode, that gives us

      aaaaaacac or gggct or ggaact or gaagcac or ... (a lot more possibilities) using the new code
      or just
      aaggct with the original.

      You see while the new code is shorter it's also not very good because decoding it can create different results.

      With the code in the first example, you always know that two bits are one letter, so there is no way to go about decoding it the wrong way.

      With the Huffman code, that one is "prefix-free" (a very important characteristic), no codeword has a prefix that could be another code word. So if you have the codeword 1 you never have the codeword 11 or 111 or something similar. That way decoding it will always give you the same result.
      That's why its such a great encoding scheme, its very short, but still prevents coding errors.
      (7 votes)
  • aqualine ultimate style avatar for user Huzaifah A.
    I did not understand how a DNA code can be used in computing, when it is a biological term?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • aqualine ultimate style avatar for user Martin
      Adenine (A), cytosine (C), guanine (G), thymine(T) are the building blocks of DNA. If you're doing research on DNA you can represent those biological structures on your computer to do research.
      Here those blocks are introduced as example of the advantages of compression, by using the Huffman encoding you can reduce the amount of space the DNA strings require.
      (5 votes)
  • blobby green style avatar for user Lorenzo Hess
    Why are we able to represent a,c,t,g using 1, 2, or 3 bits, instead of 2 bits each? If we need to represent 4 characters with 2 bits each, don't we always have to include 2 bits to represent the characters? What allows Huffman compression to assign a single bit to a character?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      When choosing a set of binary codes (whose lengths are unknown during decompression) for a set of characters, the only rule we have to follow is that no code is a prefix for another code (i.e. no code appears at the beginning of another code). 2 bits is the minimum number of bits required to be able to have 4 codes of equal length, however, we could also choose 4 codes that are 6 bits each or codes that are {3, 4, 5, 6} bits long. If we want to, we can even make one of the codes 1 or 0 as long as that bit does not appear at the beginning of any other code.
      (2 votes)
  • blobby green style avatar for user Daiman Webb
    How is this any better? You have 2 binary bits as opposed to a single letter. What makes this represent less data if not more?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • aqualine ultimate style avatar for user Martin
      It depends on how long your encoded string is, it is actually possible for the encoding to be impractical.

      But in this case you have to keep in mind, that computer only reads 1 and 0. So a doesn't exist, so you need to find a way to express as a binary string.
      (3 votes)
  • leafers ultimate style avatar for user MIHINI
    What does a, g, c, and t stand for?
    (1 vote)
    Default Khan Academy avatar avatar for user
  • male robot johnny style avatar for user Aditya Chauhan
    that means in cases where it is impossible to read the file without the table can be transfered without the table linkd to it and the table is already at the users side which can make the tansfer really secure.

    correct me please if i m wrong!
    (1 vote)
    Default Khan Academy avatar avatar for user