Main content

### Course: AP®︎/College Computer Science Principles > Unit 1

Lesson 6: Lossless data compression# Lossless bit compression

Computers represent all data in binary, so all types of files, from text to images to videos, are ultimately sequences of bits. Regardless of whether the bits represent a document or a GIF, computers can use a bit compression technique called

**Huffman coding**.### Huffman coding algorithm

Let's see how it works with a simple textual example. This example language uses only 4 different characters, and yet is incredibly important to us: it's the language used to represent DNA and is made up of sequences of four characters A, C, G and T.

For example, the 4.6 million characters representing an E.coli DNA sequence happens to start with:

`agcttttcattct`

Since we need to represent four characters, a computer would typically represent each character using 2 bits, such as:

character | binary code |
---|---|

a | 00 |

c | 01 |

g | 10 |

t | 11 |

The 13 characters above would be written using 26 bits as follows - notice that we don't need gaps between the codes for each bits.

`00100111111111010011110111`

But we can do better than this. In the short sample text above the letter "t" is more common than the other letters ("t" occurs 7 times, "c" 3 times, "a" twice, and "g" just once). If we give a shorter code to "t", then we'd be using less space 54% of the time (7 out of 13 characters). For example, we could use the codes:

character | binary code |
---|---|

a | 010 |

c | 00 |

g | 011 |

t | 1 |

Then our 13 characters would be coded as:

`0100110011110001011001`

That's just 22 bits, four less bits than our original encoding. That may not seem like a lot, but imagine if we used an optimization like that on the entire 4.6 million characters of the DNA!

### Huffman decoding

You might be scratching your head at the new binary codes we're using, with all different lengths. Is it still possible to decode it reliably? Yes, with the right set of codes.

That's the beauty of Huffman coding: the algorithm gives us a way to come up with a set of binary codes for a given sequence that ensures the data can be reconstructed unambiguously and reliably.

### Uses for Huffman coding

Many file formats utilize some kind of Huffman coding to reduce the size of their file. Fax machines use Huffman coding after using RLE on the black and white runs. PNG images compress using LZ77, an algorithm similar to the text compression technique we learned, combined with Huffman coding on the results.

🙋🏽🙋🏻♀️🙋🏿♂️Do you have any questions about this topic? We'd love to answer— just ask in the questions area below!

## Want to join the conversation?

- Hi. Why are A and G given one more bit?(9 votes)
`A`

and`G`

are assigned one more bit because they are the least common letters in the sequence.

If they were only assigned 2 bits, there would be confusion when decoding messages. Imagine we had the following code:`a - 01`

c - 00

g - 10

t - 1

If I asked you to decode the message`101`

, it would not be clear whether the original message was`ta`

or`gt`

.(62 votes)

- With Huffman coding, does it take every 2 bits, so 00, 01, 10, or 11, convert them to a, g, t, or c, and then re-convert them to binary as 1, 00, 010, and 001 based on which appears most often? What if the letters appear the same amount of times so that Huffman coding expands it rather than compressing?(12 votes)
- Hello, is the Huffman optimized binary codes universally standard?(4 votes)
- Hi Fredrick,

I don't think the Huffman codes are universally standard. I did a short search and found a short snippet of the rules used in the coding:

1. The code length of a character depends on how frequently it occurs in the given text.

2. The character which occurs most frequently gets the smallest code.

3. The character which occurs least frequently gets the largest code.

Based on the above, it seems to me that the Huffman encoding is done on-the-fly, so the same characters would be stored as different codes depending on its frequency of appearance in a file.

The file can then come with metadata that provides info on what codes represent what characters. This is just my guess, but I hope it helps!(13 votes)

- Could a RLE encoder also be used for DNA where encode

abcdddacddd as

1,1,1,3,1,0,1,3 for example? Could you combine the two encoding methods?

Or am I confusing two different things?(2 votes)- Yes, you could use RLE encoding to represent DNA sequences. (Personally, I don't think it would be very effective given that there are not likely to be many long repetitions of the same character in the sequence though.)

Yes, you can combine RLE and Huffman coding.

Let's take the example you gave, "abcdddacddd". The RLE encoding would be "1,1,1,3,1,0,1,3" which would be represented as "0101011101000111" in binary. Then, we can use Huffman coding with the following codes; 0 = 100, 1 = 0, 2 = 111, 3 = 110. This gives the final binary representation of "00011001000110". After using the Huffman codes, we save another 2 bits.(12 votes)

- Why wasn't one utilized for one of the remaining letters ((b, c, or d-in the DNA section)? Wouldn't that make the code even shorter?(1 vote)
- Sometimes the best way to see why is something is done the way it's done is to try to do it differently.

So we have a, c, g and t and we want to encode them. So Pamela always used two bits in the first example, so let's try something else.

a - 0

c - 1

g - 00

t - 01

So this should give use shorter codes, so let's encode the string aaggct, that gives us

000000101 instead of

000010100111 with the original code

so the new code is a lot shorter.

So now send the codeword and the encoding table to someone (that someone being ourselves) and we decode, that gives us

aaaaaacac or gggct or ggaact or gaagcac or ... (a lot more possibilities) using the new code

or just

aaggct with the original.

You see while the new code is shorter it's also not very good because decoding it can create different results.

With the code in the first example, you always know that two bits are one letter, so there is no way to go about decoding it the wrong way.

With the Huffman code, that one is "prefix-free" (a very important characteristic), no codeword has a prefix that could be another code word. So if you have the codeword 1 you never have the codeword 11 or 111 or something similar. That way decoding it will always give you the same result.

That's why its such a great encoding scheme, its very short, but still prevents coding errors.(12 votes)

- So a huffman coded file will always have a decode algorithm also?(0 votes)
- I'm not sure I understand you correctly.

With a Huffman code, you need a translation table to convert the Huffman code into something readable. In fact, a translation table or an algorithm to decode a specific coded message is always needed, no matter the code.

So if you want to read a code someone needs to either give you a means of decoding the code or you have to break the code. Breaking the code is in some cases really easy and in others almost impossible.(11 votes)

- I did not understand how a DNA code can be used in computing, when it is a biological term?(1 vote)
- Adenine (A), cytosine (C), guanine (G), thymine(T) are the building blocks of DNA. If you're doing research on DNA you can represent those biological structures on your computer to do research.

Here those blocks are introduced as example of the advantages of compression, by using the Huffman encoding you can reduce the amount of space the DNA strings require.(5 votes)

- I was a little confused at this part: "Since we need to represent four characters, a computer would typically represent each character using 2 bits." Why wouldn't the computer just use 1 bit per character?(0 votes)
- Suppose we need to represent the following characters with one bit each; a, b, c, and d. A single bit has only two values, 0 and 1. Let's arbitrarily say that a = 0 and b = 1. Now, what values do we assign to c and d? We already used both values a bit can represent, so we would need to repeat values we already used. So, we will just say that c = 1 and d = 0. According to our codes, the string "abbc" becomes "0111". If we give the binary string to someone else and ask them to decode it, however, they will not be able to. The 0 at the beginning could be a or d, while the remaining 1s could be b or c. By repeating codes, we have made it impossible to decode the binary string and get the original string back. Based on this, every character should have a unique code, thus we need at least as many codes as characters. Two bits can create 4 different codes and we have 4 characters, so we use 2 bits for each character.(6 votes)

- Why are we able to represent a,c,t,g using 1, 2, or 3 bits, instead of 2 bits each? If we need to represent 4 characters with 2 bits each, don't we always have to include 2 bits to represent the characters? What allows Huffman compression to assign a single bit to a character?(2 votes)
- When choosing a set of binary codes (whose lengths are unknown during decompression) for a set of characters, the only rule we have to follow is that no code is a prefix for another code (i.e. no code appears at the beginning of another code). 2 bits is the minimum number of bits required to be able to have 4 codes of equal length, however, we could also choose 4 codes that are 6 bits each or codes that are {3, 4, 5, 6} bits long. If we want to, we can even make one of the codes 1 or 0 as long as that bit does not appear at the beginning of any other code.(2 votes)

- In the DNA example, I can see it can be decoded reliably if the start of the sequence is known. If you started partway through the sequence you could end up with different letters or incomprehensible sequences of bits. Does Huffman coding/decoding only work when the start of the sequence is known? Or is there some way to deliminate a character even if starting in an unknown position? Thanks.(1 vote)
- The typical creation of Huffman codes does not allow decoding from an arbitrary position in the stream. After considering this scenario, however, I believe that if you added the additional constraint that the codes be suffix-free (in addition to being prefix-free) it would be impossible to construct a sequence that ends up being completely valid when decoding from the middle of a code. So, if you tried to decode the sequence starting from the middle of a code, you would always end up with an invalid code at the end, which would let you know that you began in the middle of a code. You could then try to decode the sequence again starting 1 bit over from the start and keep repeating this process until you find a sequence that is completely decodable. This is not a particularly great solution though. It only works if the sequence to decode is finite in length. If the sequence and/or codes are long, this could take a while to find the beginning of the next code. In addition, this makes the process of constructing the codes more difficult and may make the codes longer, which means that the coding will be less efficient. There might be other (better) ways to do this, but after searching briefly online, I didn't see anything.(3 votes)