If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Storing text in binary

AP.CSP:
DAT‑1 (EU)
,
DAT‑1.A (LO)
,
DAT‑1.A.2 (EK)
,
DAT‑1.A.6 (EK)
,
DAT‑1.A.7 (EK)
Computers store more than just numbers in binary. But how can binary numbers represent non-numbers such as letters and symbols?
As it turns out, all it requires is a bit of human cooperation. We must agree on encodings, mappings from a character to a binary number.

A simple encoding

For example, what if we wanted to store the following symbols in binary?
☮️❤️😀
We can invent this simple encoding:
BinarySymbol
start text, 0, end text, start text, 1, end text☮️
start text, 10, end text❤️
start text, 11, end text😀
Let's call it the HPE encoding. It helps for encodings to have names, so that programmers know they're using the same encoding.
If a computer program needs to store the ❤️ symbol in computer memory, it can store start text, 10, end text instead. When the program needs to display start text, 10, end text to the user, it can remember the HPE encoding and display ❤️ instead.
Computer programs and files often need to store multiple characters, which they can do by stringing each character's encoding together.
A program could write a file called "msg.hpe" with this data:
start text, 0, end text, start text, 10111111010, end text
A program on another computer that understands the HPE encoding can then open "msg.hpe" and display the sequence of symbols.
Check your understanding
What sequence would the program display?
Choose 1 answer:
Choose 1 answer:

The HPE encoding only uses 2 bits, so that limits how many symbols it can represent.
Check your understanding
How many symbols can the 2-bit encoding represent?
  • Your answer should be
  • an integer, like 6
  • a simplified proper fraction, like 3, slash, 5
  • a simplified improper fraction, like 7, slash, 4
  • a mixed number, like 1, space, 3, slash, 4
  • an exact decimal, like 0, point, 75
  • a multiple of pi, like 12, space, start text, p, i, end text or 2, slash, 3, space, start text, p, i, end text

However, with more bits of information, an encoding can represent enough letters for computers to store messages, documents, and webpages.

ASCII encoding

ASCII was one of the first standardized encodings. It was invented back in the 1960s when telegraphy was the primary form of long-distance communication, but is still in use today on modern computing systems. start superscript, 1, end superscript
Teletypists would type messages on teleprinters such as this one:
Photo of a teletype machine, composed of a mechanical keyboard, a piece of paper coming out with typed letters, and a mechanism for reading input paper strips.
An ASR 33 teletype machine. Image source: Marcin Wichary
The teleprinter would then use the ASCII standard to encode each typed character into binary and then store or transmit the binary data.
This page from a 1972 teleprinter manual shows the 128 ASCII codes:
A scanned chart of ASCII encodings.
ASCII chart from TermiNet 300 printer. Image source: Wikipedia
Each ASCII character is encoded in binary using 7 bits. In the chart above, the column heading indicates the first 3 bits and the row heading indicates the final 4 bits. The very first character is "NUL", encoded as start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text.
The first 32 codes represent "control characters," characters which cause some effect besides printing a letter. "BEL" (encoded in binary as start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 111, end text) caused an audible bell or beep. "ENQ" (encoded as start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 101, end text) represented an enquiry, a request for the receiving station to identify themselves.
The control characters were originally designed for teleprinters and telegraphy, but many have been re-purposed for modern computers and the Internet—especially "CR" and "LF". "CR" (start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 1101, end text) represented a "carriage return" on teleprinters, moving the printing head to the start of the line. "LF" (start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 1010, end text) represented a "line feed", moving the printing head down one line. Modern Internet protocols, such as HTTP, FTP, and SMTP, use a combination of "CR" + "LF" to represent the end of lines.
The remaining 96 ASCII characters look much more familiar.
Here are the first 8 uppercase letters:
BinaryCharacter
start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 1, end textA
start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 10, end textB
start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 11, end textC
start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 10, end text, start text, 0, end textD
start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 101, end textE
start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 110, end textF
start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 111, end textG
start text, 10, end text, start text, 0, end text, start text, 10, end text, start text, 0, end text, start text, 0, end textH
Following the ASCII standard, we can encode a four-letter message into binary:
start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 11, end text, start text, 10, end text, start text, 0, end text, start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 101, end text, start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 110, end text
Check your understanding
What word does that ASCII-encoded binary data represent?
Choose 1 answer:
Choose 1 answer:

There are several problems with the ASCII encoding, however.
The first big problem is that ASCII only includes letters from the English alphabet and a limited set of symbols.
A language that uses less than 128 characters could come up with their own version of ASCII to encode text in just their language, but what about a text file with characters from multiple languages? ASCII couldn't encode a string like: "Hello, José, would you care for Glühwein? It costs 10 €".
And what about languages with thousands of logograms? ASCII could not encode enough logograms to cover a Chinese sentence like "你好,想要一盘饺子吗?十块钱。"
The other problem with the ASCII encoding is that it uses 7 bits to represent each character, whereas computers typically store information in bytes—units of 8 bits—and programmers don't like to waste memory.
When the earliest computers first started using ASCII to encode characters, different computers would come up with various ways to utilize the final bit. For example, HP computers used the eighth bit to represent characters used in European countries (e.g. "£" and "Ü"), TRS-80 computers used the bit for colored graphics, and Atari computers used the bit for inverted white-on-black versions of the first 128 characters. squared
The result? An "ASCII" file created in one application might look like gobbledygook when opened in another "ASCII"-compatible application.
Computers needed a new encoding, an encoding based on 8-bit bytes that could represent all the languages of the world.

Unicode

But first, how many characters do you even need to represent the world's languages? Which characters are basically the same across languages, even if they have different sounds?
In 1987, a group of computer engineers attempted to answer those questions. They eventually came up with Unicode, a universal character set which assigns each a "code point" (a hexadecimal number) and a name to each character. cubed
For example, the character "ą" is assigned to "U+0105" and named "Latin Small Letter A with Ogonek". There's a character that looks like "ą" in 13 languages, such as Polish and Lithuanian. Thus, according to Unicode, the "ą" in the Polish word "robią" and the "ą" in the Lithuanian word "aslą" are both the same character. Unicode saves space by unifying characters across languages.
But there are still quite a few characters to encode. The Unicode character set started with 7,129 named characters in 1991 and has grown to 137,929 named characters in 2019. The majority of those characters describe logograms from Chinese, Japanese, and Korean, such as "U+6728" which refers to "木". It also includes over 1,200 emoji symbols ("U+1F389" = "🎉"). start superscript, 4, end superscript
Unicode is a character set, but it is not an encoding. Fortunately, another group of engineers tackled the problem of efficiently encoding Unicode into binary.

UTF-8

In 1992, computer scientists invented UTF-8, an encoding that is compatible with ASCII encoding but also solves its problems. start superscript, 5, end superscript
UTF-8 can describe every character from the Unicode standard using either 1, 2, 3, or 4 bytes.
When a computer program is reading a UTF-8 text file, it knows how many bytes represent the next character based on how many 1 bits it finds at the beginning of the byte.
Number of bytesByte 1Byte 2Byte 3Byte 4
10xxxxxxx
2110xxxxx10xxxxxx
31110xxxx10xxxxxx10xxxxxx
411110xxx10xxxxxx10xxxxxx10xxxxxx
If there are no 1 bits in the prefix (if the first bit is a 0), that indicates a character represented by a single byte. The remaining 7 bits of the byte are used to represent the original 128 ASCII characters. That means a sequence of 8-bit ASCII characters is also a valid UTF-8 sequence.
Two bytes beginning with 110 are used to encode the rest of the characters from Latin-script languages (e.g. Spanish, German) plus other languages such as Greek, Hebrew, and Arabic. Three bytes starting with 1110 encode most of the characters for Asian languages (e.g. Chinese, Japanese, Korean). Four bytes starting with 11110 encode everything else, from rarely used historical scripts to the increasingly commonly used emoji symbols.
Check your understanding
According to the UTF-8 standard, how many characters are represented by these 8 bytes?
start text, 0, end text, start text, 10, end text, start text, 0, end text, start text, 10, end text, start text, 0, end text, start text, 1, end text, start text, 11110, end text, start text, 0, end text, start text, 0, end text, start text, 0, end text, start text, 10, end text, start text, 0, end text, start text, 11111, end text, start text, 10, end text, start text, 0, end text, start text, 10, end text, start text, 0, end text, start text, 10, end text, start text, 10, end text, start text, 0, end text, start text, 110, end text, start text, 0, end text, start text, 1, end text, start text, 1110, end text, start text, 0, end text, start text, 0, end text, start text, 10, end text, start text, 10, end text, start text, 0, end text, start text, 10, end text, start text, 0, end text, start text, 11, end text, start text, 10, end text, start text, 0, end text, start text, 0, end text, start text, 1010, end text
  • Your answer should be
  • an integer, like 6
  • a simplified proper fraction, like 3, slash, 5
  • a simplified improper fraction, like 7, slash, 4
  • a mixed number, like 1, space, 3, slash, 4
  • an exact decimal, like 0, point, 75
  • a multiple of pi, like 12, space, start text, p, i, end text or 2, slash, 3, space, start text, p, i, end text

Most modern programming languages have built-in support for UTF-8, so most programmers never need to know exactly how to convert from characters to binary.
✏️ Try out using JavaScript to encode strings in UTF-8 in the form below. Play around with multiple languages and symbols.
The UTF-8 encoding standard is now the dominant encoding of HTML files on the web, accounting for 94.5% of webpages as of December 2019. start superscript, 6, end superscript
🔎 If you right click and select "view page source" on this webpage right now, you can search for the string "utf-8" and see that this webpage is encoded as UTF-8.
Generally, a good encoding is one that can represent the maximum amount of information with the least number of bits. UTF-8 is a great example of that, since it can encode common English letters with just 1 byte but is flexible enough to encode thousands of letters with additional bytes.
UTF-8 is only one possible encoding, however. UTF-16 and UTF-32 are alternative encodings that are also capable of representing all Unicode characters. There are also language specific encodings such as Shift-JIS for Japanese. Computer programs can use the encoding that best suits their needs and constraints.

🙋🏽🙋🏻‍♀️🙋🏿‍♂️Do you have any questions about this topic? We'd love to answer— just ask in the questions area below!

Want to join the conversation?

  • blobby green style avatar for user Audditiya  Gangopadhyay
    This article was very informative and helpful overall. However, I got confused during the Javascript part, where I had to type in a number and they would convert that number to the code UTF-8. For example, I typed in the number 1 and the UTF-8 code translated it to 00110001, which is neither a binary conversion or a hexadecimal conversion. I am not understanding the correlation between these two values. Can you please explain? Thank you so much!
    (11 votes)
    Default Khan Academy avatar avatar for user
    • leaf green style avatar for user Shane McGookey
      When you typed in the number '1', it was being represented as a character rather than a numeric 1. Therefore, it will be stored in binary using its ASCII representation. The character '1' is encoded as the decimal number 49 per ASCII standards, and 49 represented in binary is 00110001. I hope that helps to answer your question!
      (5 votes)
  • leaf red style avatar for user layaz7717
    How does the computer know if we are trying to represent number or letters, etc. in binary? Do we have to put a certain thing before the binary code in order to tell it what to convert it to?
    (5 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Mathilde
    In the JS part, I typed in "Hello world" and all the bytes in the encoded sequence were starting with 0. I expected it to start with 110 as it's a language. Or have I not properly understood something ? Thanks for your answer :-)
    (4 votes)
    Default Khan Academy avatar avatar for user
  • starky sapling style avatar for user Humaira Islam
    sorry that im saying like this!!
    why did you guyd upgraded like so much more difficult/ :(
    (4 votes)
    Default Khan Academy avatar avatar for user
  • piceratops seed style avatar for user HIKIKOMORI
    Hi, I encountered an exercise question where I got the following binary data: 10001110 01011011 00111000 10001110 10101011 00101111 01100100 00010110 10111000 11000111 And later was asked how many characters are encoded by this binary data according to UTF-8 encoding.

    To my understanding the answer should be the following: 10001110 (don't know!) 01011011 (single byte character) 00111000 (single byte character) 10001110 (don't know!) 10101011 (don't know!) 00101111 (single byte character) 01100100 (single byte character) 00010110 (single byte character) 10111000 (don't know!) 11000111 (don't know!).

    I got the answer correct but I still don't understand what the 'don't know' bytes mean and what are they doing here? I think, it should give an error message if it were to run in a computer. I would really appreciate if anyone could help :-)
    (3 votes)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      "don't know" bytes are used as a continuation for multi-byte characters where the first byte of the character indicates how many bytes are used in total for that character. As you said, this particular set of bytes should result in an error since there are continuation bytes used outside of a multi-byte character.
      (5 votes)
  • blobby green style avatar for user Benjamin.O
    All the possible UTF-8 Sequences on a standard keyboard

    or -=+`1234567890_)(*&^%$£"!¬,./;'#[]<>?:@~{}qwertyuiopasdfghjklzxcvbnm\|QWERTYUIOPASDFGHJKLZXCVBNM
    (5 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user cue6christine
    Why are numbers read backwards in binary ie right to left but words and symbols appear to be left to right? I noticed this when doing the emoji question peace signs, smiley faces and hearts. I originally thought the answer would be Heart Heart Smiley Smiley Peace Peace.
    (3 votes)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      Whether the binary data is stored "left to right
      or "right to left" actually depends on the machine that is storing data. Some machines use "big-endian" (storing the most significant bits first) and others use "little-endian" (storing the least significant bits first).
      (2 votes)
  • stelly blue style avatar for user aqunyw
    I think "UTF-8 is only one possible encoding, however. " sentence is lacking "not" because it's somewhat confusing.
    (3 votes)
    Default Khan Academy avatar avatar for user
  • aqualine tree style avatar for user Prisha B.
    How do computers do arithmetic? When we do 5*6, we know what to do, but what does a computer do?
    (1 vote)
    Default Khan Academy avatar avatar for user
  • leaf red style avatar for user layaz7717
    Who decided how the alphabet would be written in standard binary? Why didn't they just start from one, or even use 8 bits so that we would have a byte? It seems like the generally use 7?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • leaf green style avatar for user Shane McGookey
      ASCII (American Standard Code for Information Interchange) is the standard for encoding the alphabet into a binary representation, and ASCII was developed under the oversight of the American Standards Association in the 1960s.

      ASCII encoding was originally done using 7-bits because 8-bit bytes had yet to become popularized as the standard. The encoding is generally viewed as an 8-bit (1 byte) encoding now-a-days, even though it is still limited to its original seven bit constraints.
      (4 votes)