If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

The genetic code

How is the information in an mRNA sequence decoded to make a polypeptide? Learn how groups of three nucleotides, called codons, specify amino acids (as well as start and stop signals for translation).


Have you ever written a secret message to one of your friends? If so, you may have used some kind of code to keep the message hidden. For instance, you may have replaced the letters of the word with numbers or symbols, following a particular set of rules. In order for your friend on the other end to understand the message, he or she would need to know the code and apply the same set of rules, in reverse, to figure out what you had written.
As it turns out, decoding messages is also a key step in gene expression, the process in which information from a gene is used to construct a protein (or other functional product). How are the instructions for building a protein encoded in DNA, and how are they deciphered by the cell? In this article, we'll take a closer look at the genetic code, which allows DNA and RNA nucleotide sequences to be translated into the amino acids they represent.

Overview: Gene expression and the genetic code

Genes that provide instructions for proteins are expressed in a two-step process.
  • In transcription, the DNA sequence of a gene is "rewritten" using RNA nucleotides. In eukaryotes, the RNA must go through additional processing steps to become a messenger RNA, or mRNA.
  • In translation, the sequence of nucleotides in the mRNA is "translated" into a sequence of amino acids in a polypeptide (protein or protein subunit).
Cells decode mRNAs by reading their nucleotides in groups of three, called codons. Each codon specifies a particular amino acid, or, in some cases, provides a "stop" signal that ends translation. In addition, the codon AUG has a special role, serving as the start codon where translation begins. The complete set of correspondences between codons and amino acids (or stop signals) is known as the genetic code.
The mRNA sequence is:
Translation involves reading the mRNA nucleotides in groups of three, each of which specifies and amino acid (or provides a stop signal indicating that translation is finished).
AUG Methionine AUC Isoleucine UCG Serine UAA "Stop"
Polypeptide sequence: (N-terminus) Methionine-Isoleucine-Serine (C-terminus)
In the rest of this article, we'll more closely at the genetic code. First, we'll see how it was discovered. Then, we'll look more deeply at its properties, seeing how it can be used to predict the polypeptide encoded by an mRNA.

Code crackers: How the genetic code was discovered

To crack the genetic code, researchers needed to figure out how sequences of nucleotides in a DNA or RNA molecule could encode the sequence of amino acids in a polypeptide.
Why was this a tricky problem? In one of the simplest potential codes, each nucleotide in an DNA or RNA molecule might correspond to one amino acid in a polypeptide. However, this code cannot actually work, because there are 20 amino acids commonly found in proteins and just 4 nucleotide bases in DNA or RNA. Thus, researchers knew that the code must involve something more complex than a one-to-one matching of nucleotides and amino acids.

The triplet hypothesis

In the mid-1950s, the physicist George Gamow extended this line of thinking to deduce that the genetic code was likely composed of triplets of nucleotides. That is, he proposed that a group of 3 successive nucleotides in a gene might code for one amino acid in a polypeptide.
Gamow's reasoning was that even a doublet code (2 nucleotides per amino acid) would not work, as it would allow for only 16 ordered groups of nucleotides (42), too few to account for the 20 standard amino acids used to build proteins. A code based on nucleotide triplets, however, seemed promising: it would provide 64 unique sequences of nucleotides (43), more than enough to cover the 20 amino acids.

Nirenberg, Khorana, and the identification of codons

Gamow’s triplet hypothesis seemed logical and was widely accepted. However, it had not been experimentally proven, and researchers still did not know which triplets of nucleotides corresponded to which amino acids.
The cracking of the genetic code began in 1961, with work from the American biochemist Marshall Nirenberg. For the first time, Nirenberg and his colleagues were able to identify specific nucleotide triplets that corresponded to particular amino acids. Their success relied on two experimental innovations:
  • A way to make artificial mRNA molecules with specific, known sequences.
  • A system to translate mRNAs into polypeptides outside of a cell (a "cell-free" system). Nirenberg's system consisted of cytoplasm from burst E. coli cells, which contains all of the materials needed for translation.
First, Nirenberg synthesized an mRNA molecule consisting only of the nucleotide uracil (called poly-U). When he added poly-U mRNA to the cell-free system, he found that the polypeptides made consisted exclusively of the amino acid phenylalanine. Because the only triplet in poly-U mRNA is UUU, Nirenberg concluded that UUU might code for phenylalanine. Using the same approach, he was able to show that poly-C mRNA was translated into polypeptides made exclusively of the amino acid proline, suggesting that the triplet CCC might code for proline.
Other researchers, such as the biochemist Har Gobind Khorana at University of Wisconsin, extended Nirenberg's experiment by synthesizing artificial mRNAs with more complex sequences. For instance, in one experiment, Khorana generated a poly-UC (UCUCUCUCUC…) mRNA and added it to a cell-free system similar to Nirenberg's. The poly-UC mRNA that it was translated into polypeptides with an alternating pattern of serine and leucine amino acids. These and other results unambiguously confirmed that the genetic code was based on triplets, or codons. Today, we know that serine is encoded by the codon UCU, while leucine is encoded by CUC.
By 1965, using the cell-free system and other techniques, Nirenberg, Khorana, and their colleagues had deciphered the entire genetic code. That is, they had identified the amino acid or "stop" signal corresponding to each one of the 64 nucleotide codons. For their contributions, Nirenberg and Khorana (along with another genetic code researcher, Robert Holley) received the Nobel Prize in 1968.
_Left: Image modified from "Marshall Nirenberg and Heinrich Matthaei," by N. MacVicar (public domain). Right: "Har Gobind Khorana" (public domain)._

Properties of the genetic code

As we saw above, the genetic code is based on triplets of nucleotides called codons, which specify individual amino acids in a polypeptide (or "stop" signals at its end). The codons of an mRNA are “read” one by one inside protein-and-RNA structures called ribosomes, starting at the 5’ end of the gene and moving towards the 3’ end. Let's take a closer look at the genetic code in the context of translation.

Types of codons (start, stop, and "normal")

Genetic code table. Each three-letter sequence of mRNA nucleotides corresponds to a specific amino acid, or to a stop codon. UGA, UAA, and UAG are stop codons. AUG is the codon for methionine, and is also the start codon.
_Image credit: "The genetic code," by OpenStax College, Biology (CC BY 3.0)._
Translation always begins at a start codon, which has the sequence AUG and encodes the amino acid methionine (Met) in most organisms. Thus, every polypeptide typically starts with methionine, although the initial methionine may be snipped off in later processing steps. A start codon is required to begin translation, but the codon AUG can also appear later in the coding sequence of an an mRNA, where it simply specifies the amino acid methionine.
Once translation has begun at the start codon, the following codons of the mRNA will be read one by one, in the 5' to 3' direction. As each codon is read, the matching amino acid is added to the C-terminus of the polypeptide. Most of the codons in the genetic code specify amino acids and are read during this phase of translation.
Translation continues until a stop codon is reached. There are three stop codons in the genetic code, UAA, UAG, and UGA. Unlike start codons, stop codons don't correspond to an amino acid. Instead, they act as "stop" signals, indicating that the polypeptide is complete and causing it to be released from the ribosome. More nucleotides may appear after the stop codon in the mRNA, but will not be translated as part of the polypeptide.

Reading frame

The start codon is critical because it determines where translation will begin on the mRNA. Most importantly, the position of the start codon determines the reading frame, or how the mRNA sequence is divided up into groups of three nucleotides inside the ribosome. As shown in the diagram below, the same sequence of nucleotides can encode completely different polypeptides depending on the frame in which it's read. The start codon determines which frame is chosen and thus ensures that the correct polypeptide is produced.
To see what reading frame is, it's helpful to consider an analogy using words and letters. The following message makes sense to us because we read it in the correct frame (divide it correctly into groups of three letters): MOM AND DAD ARE MAD. If we shift the reading frame by grouping letters into threes starting one position later, however, we get: OMA NDD ADA REM AD. The frameshift results in a message that no longer makes sense.
An important point to note here is that the nucleotides in a gene are not physically organized into groups of three. Instead, what constitutes a codon is simply a matter of where the ribosome begins reading, and of what sequence of nucleotides comes after the start codon. Mutations that insert or delete a single nucleotide may alter reading frame, resulting in the production of a “gibberish” protein similar to the scrambled sentence in the example above.

One amino acid, many codons

As previously mentioned, the genetic code consists of 64 unique codons. But if there are only 20 amino acids, what are the other 44 codons doing? As we saw, a few are stop codons, but most are not. Instead, the genetic code turns out to be a degenerate code, meaning that some amino acids are specified by more than one codon. For example, proline is represented by four different codons (CCU, CCC, CCA, and CCG). If any one of these codons appears in an mRNA, it will cause proline to be added to the polypeptide chain.
Most of the amino acids in the genetic code are encoded by at least two codons. In fact, methionine and tryptophan are the only amino acids specified by a single codon. Importantly, the reverse isn't true: each codon specifies just one amino acid or stop signal. Thus, there's no ambiguity (uncertainty) in the genetic code. A particular codon in an mRNA will always be predictably translated into a particular amino acid or stop signal.

The genetic code is (nearly) universal

With some minor exceptions, all living organisms on Earth use the same genetic code. This means that the codons specifying the 20 amino acids in your cells are the same as those used by the bacteria inhabiting hydrothermal vents at the bottom of the Pacific Ocean. Even in organisms that don't use the "standard" code, the differences are relatively small, such as a change in the amino acid encoded by a particular codon.
A genetic code shared by diverse organisms provides important evidence for the common origin of life on Earth. That is, the many species on Earth today likely evolved from an ancestral organism in which the genetic code was already present. Because the code is essential to the function of cells, it would tend to remain unchanged in species across generations, as individuals with significant changes might be unable to survive. This type of evolutionary process can explain the remarkable similarity of the genetic code across present-day organisms.

Want to join the conversation?