How is the information in an mRNA sequence decoded to make a polypeptide? Learn how groups of three nucleotides, called codons, specify amino acids (as well as start and stop signals for translation).

Introduction

Have you ever written a secret message to one of your friends? If so, you may have used some kind of code to keep the message hidden. For instance, you may have replaced the letters of the word with numbers or symbols, following a particular set of rules. In order for your friend on the other end to understand the message, he or she would need to know the code and apply the same set of rules, in reverse, to figure out what you had written.
As it turns out, decoding messages is also a key step in gene expression, the process in which information from a gene is used to construct a protein (or other functional product). How are the instructions for building a protein encoded in DNA, and how are they deciphered by the cell? In this article, we'll take a closer look at the genetic code, which allows DNA and RNA nucleotide sequences to be translated into the amino acids they represent.

Overview: Gene expression and the genetic code

Genes that provide instructions for proteins are expressed in a two-step process.
  • In transcription, the DNA sequence of a gene is "rewritten" using RNA nucleotides. In eukaryotes, the RNA must go through additional processing steps to become a messenger RNA, or mRNA.
  • In translation, the sequence of nucleotides in the mRNA is "translated" into a sequence of amino acids in a polypeptide (protein or protein subunit).
Cells decode mRNAs by reading their nucleotides in groups of three, called codons. Each codon specifies a particular amino acid, or, in some cases, provides a "stop" signal that ends translation. In addition, the codon AUG has a special role, serving as the start codon where translation begins. The complete set of correspondences between codons and amino acids (or stop signals) is known as the genetic code.
Genetic code table. Each three-letter sequence of mRNA nucleotides corresponds to a specific amino acid, or to a stop codon. UGA, UAA, and UAG are stop codons. AUG is the codon for methionine, and is also the start codon.
_Image credit: "The genetic code," by OpenStax College, Biology (CC BY 3.0)._
The mRNA sequence is:
5'-AUGAUCUCGUAA-5'
Translation involves reading the mRNA nucleotides in groups of three, each of which specifies and amino acid (or provides a stop signal indicating that translation is finished).
3'-AUG AUC UCG UAA-5'
AUG \rightarrow Methionine AUC \rightarrow Isoleucine UCG \rightarrow Serine UAA \rightarrow "Stop"
Polypeptide sequence: (N-terminus) Methionine-Isoleucine-Serine (C-terminus)
In the rest of this article, we'll more closely at the genetic code. First, we'll see how it was discovered. Then, we'll look more deeply at its properties, seeing how it can be used to predict the polypeptide encoded by an mRNA.

Code crackers: How the genetic code was discovered

To crack the genetic code, researchers needed to figure out how sequences of nucleotides in a DNA or RNA molecule could encode the sequence of amino acids in a polypeptide.
Why was this a tricky problem? In one of the simplest potential codes, each nucleotide in an DNA or RNA molecule might correspond to one amino acid in a polypeptide. However, this code cannot actually work, because there are 2020 amino acids commonly found in proteins and just 44 nucleotide bases in DNA or RNA. Thus, researchers knew that the code must involve something more complex than a one-to-one matching of nucleotides and amino acids.

The triplet hypothesis

In the mid-1950s, the physicist George Gamow extended this line of thinking to deduce that the genetic code was likely composed of triplets of nucleotides. That is, he proposed that a group of 33 successive nucleotides in a gene might code for one amino acid in a polypeptide.
Gamow's reasoning was that even a doublet code (22 nucleotides per amino acid) would not work, as it would allow for only 1616 ordered groups of nucleotides (424^2), too few to account for the 2020 standard amino acids used to build proteins. A code based on nucleotide triplets, however, seemed promising: it would provide 6464 unique sequences of nucleotides (434^3), more than enough to cover the 2020 amino acids.
There are 1616 unique groups of nucleotides if a doublet code is used, and 6464 unique groups if a triplet code is used. Why is this the case? Let's take a closer look at the math behind these statements.

Doublet code

Let’s look at the doublet code first. In a doublet code, an ordered group of two nucleotides codes for one amino acid. How many such groups of two nucleotides can we make? We know that there are 44 different possibilities for each of the 22 nucleotides in the doublet (A, T, C, and G, if we use DNA bases).
If we put an A in the first position, then any of the four other nucleotides can occupy the second position, resulting in four combinations (AA, AT, AG, AC) that begin with an A. We can repeat this for T (TT, TA, TC, TG), C (CC, CT, CA, CG), and G (GG, GC, GT, GA). If we count all of these possibilities, we'll find that there are 1616 of them in total.
You may find it faster and more foolproof to use a mathematical shortcut to quickly answer this type of question. Because we know there are 4 possible nucleotides for each position in the doublet, and because the order of the two slots matters, we can use the rules of permutations to calculate the number of possible groups as follows:
(44 possibilities for the first slot) \cdot 44 possibilities for the second slot) ==
44=164 \cdot4 = 16 possible ordered groups

Triplet code

What about the triplet code? In this case, we can use the same mathematical reasoning, but must add an additional slot to our setup. There are now 33 positions to fill, and each can be occupied by any of the four bases (A, T, C, or G). Since there are 44 possible choices for each position, we can multiply as follows:
(44 possibilities for the first slot) \cdot 44 possibilities for the second slot) \cdot (44 possibilities for the third slot) ==
444=644 \cdot 4 \cdot 4 =64 possible ordered groups

Nirenberg, Khorana, and the identification of codons

Gamow’s triplet hypothesis seemed logical and was widely accepted. However, it had not been experimentally proven, and researchers still did not know which triplets of nucleotides corresponded to which amino acids.
The cracking of the genetic code began in 1961, with work from the American biochemist Marshall Nirenberg. For the first time, Nirenberg and his colleagues were able to identify specific nucleotide triplets that corresponded to particular amino acids. Their success relied on two experimental innovations:
  • A way to make artificial mRNA molecules with specific, known sequences.
  • A system to translate mRNAs into polypeptides outside of a cell (a "cell-free" system). Nirenberg's system consisted of cytoplasm from burst E. coli cells, which contains all of the materials needed for translation.
First, Nirenberg synthesized an mRNA molecule consisting only of the nucleotide uracil (called poly-U). When he added poly-U mRNA to the cell-free system, he found that the polypeptides made consisted exclusively of the amino acid phenylalanine. Because the only triplet in poly-U mRNA is UUU, Nirenberg concluded that UUU might code for phenylalanine. Using the same approach, he was able to show that poly-C mRNA was translated into polypeptides made exclusively of the amino acid proline, suggesting that the triplet CCC might code for proline.
Other researchers, such as the biochemist Har Gobind Khorana at University of Wisconsin, extended Nirenberg's experiment by synthesizing artificial mRNAs with more complex sequences. For instance, in one experiment, Khorana generated a poly-UC (UCUCUCUCUC…) mRNA and added it to a cell-free system similar to Nirenberg's. The poly-UC mRNA that it was translated into polypeptides with an alternating pattern of serine and leucine amino acids. These and other results unambiguously confirmed that the genetic code was based on triplets, or codons. Today, we know that serine is encoded by the codon UCU, while leucine is encoded by CUC.
By 1965, using the cell-free system and other techniques, Nirenberg, Khorana, and their colleagues had deciphered the entire genetic code. That is, they had identified the amino acid or "stop" signal corresponding to each one of the 6464 nucleotide codons. For their contributions, Nirenberg and Khorana (along with another genetic code researcher, Robert Holley) received the Nobel Prize in 1968.
_Left: Image modified from "Marshall Nirenberg and Heinrich Matthaei," by N. MacVicar (public domain). Right: "Har Gobind Khorana" (public domain)._

Properties of the genetic code

As we saw above, the genetic code is based on triplets of nucleotides called codons, which specify individual amino acids in a polypeptide (or "stop" signals at its end). The codons of an mRNA are “read” one by one inside protein-and-RNA structures called ribosomes, starting at the 5’ end of the gene and moving towards the 3’ end. Let's take a closer look at the genetic code in the context of translation.

Types of codons (start, stop, and "normal")

Genetic code table. Each three-letter sequence of mRNA nucleotides corresponds to a specific amino acid, or to a stop codon. UGA, UAA, and UAG are stop codons. AUG is the codon for methionine, and is also the start codon.
_Image credit: "The genetic code," by OpenStax College, Biology (CC BY 3.0)._
Translation always begins at a start codon, which has the sequence AUG and encodes the amino acid methionine (Met) in most organisms. Thus, every polypeptide typically starts with methionine, although the initial methionine may be snipped off in later processing steps. A start codon is required to begin translation, but the codon AUG can also appear later in the coding sequence of an an mRNA, where it simply specifies the amino acid methionine.
Once translation has begun at the start codon, the following codons of the mRNA will be read one by one, in the 5' to 3' direction. As each codon is read, the matching amino acid is added to the C-terminus of the polypeptide. Most of the codons in the genetic code specify amino acids and are read during this phase of translation.
The codon table may look kind of intimidating at first. Fortunately, it's organized in a logical way, and it's not too hard to use once you understand this organization.
To see how the codon table works, let's walk through an example. Suppose that we are interested in the codon CAG and want to know which amino acid it specifies.
  1. First, we look at the left side of the table. The axis on the left side refers to the first letter of the codon, so we find C along the left axis. This tells us the (broad) row of the table in which our codon will be found.
  2. Next, we look at the top of the table. The upper axis refers to the second letter of the codon, so we find A along the upper axis. This tells us the column of the table in which our codon will be found.
The row and column from steps 1 and 2 intersect in a single box in the codon table, one containing four codons. It's often easiest to simply look at these four codons and see which one is the one you're looking for.
If you want to use the structure of the table to the maximum, however, you can use the third axis (on the right side of the table) corresponding to the intersect box. By finding the third nucleotide of the codon on this axis, you can identify the exact row within the box where your codon is found. For instance, if we look for G on this axis in our example above, we find that CAG encodes the amino acid glutamine (Gln).
Translation continues until a stop codon is reached. There are three stop codons in the genetic code, UAA, UAG, and UGA. Unlike start codons, stop codons don't correspond to an amino acid. Instead, they act as "stop" signals, indicating that the polypeptide is complete and causing it to be released from the ribosome. More nucleotides may appear after the stop codon in the mRNA, but will not be translated as part of the polypeptide.

Reading frame

The start codon is critical because it determines where translation will begin on the mRNA. Most importantly, the position of the start codon determines the reading frame, or how the mRNA sequence is divided up into groups of three nucleotides inside the ribosome. As shown in the diagram below, the same sequence of nucleotides can encode completely different polypeptides depending on the frame in which it's read. The start codon determines which frame is chosen and thus ensures that the correct polypeptide is produced.
To see what reading frame is, it's helpful to consider an analogy using words and letters. The following message makes sense to us because we read it in the correct frame (divide it correctly into groups of three letters): MOM AND DAD ARE MAD. If we shift the reading frame by grouping letters into threes starting one position later, however, we get: OMA NDD ADA REM AD. The frameshift results in a message that no longer makes sense.
An important point to note here is that the nucleotides in a gene are not physically organized into groups of three. Instead, what constitutes a codon is simply a matter of where the ribosome begins reading, and of what sequence of nucleotides comes after the start codon. Mutations that insert or delete a single nucleotide may alter reading frame, resulting in the production of a “gibberish” protein similar to the scrambled sentence in the example above.

One amino acid, many codons

As previously mentioned, the genetic code consists of 6464 unique codons. But if there are only 2020 amino acids, what are the other 4444 codons doing? As we saw, a few are stop codons, but most are not. Instead, the genetic code turns out to be a degenerate code, meaning that some amino acids are specified by more than one codon. For example, proline is represented by four different codons (CCU, CCC, CCA, and CCG). If any one of these codons appears in an mRNA, it will cause proline to be added to the polypeptide chain.
Most of the amino acids in the genetic code are encoded by at least two codons. In fact, methionine and tryptophan are the only amino acids specified by a single codon. Importantly, the reverse isn't true: each codon specifies just one amino acid or stop signal. Thus, there's no ambiguity (uncertainty) in the genetic code. A particular codon in an mRNA will always be predictably translated into a particular amino acid or stop signal.

The genetic code is (nearly) universal

With some minor exceptions, all living organisms on Earth use the same genetic code. This means that the codons specifying the 2020 amino acids in your cells are the same as those used by the bacteria inhabiting hydrothermal vents at the bottom of the Pacific Ocean. Even in organisms that don't use the "standard" code, the differences are relatively small, such as a change in the amino acid encoded by a particular codon.
A genetic code shared by diverse organisms provides important evidence for the common origin of life on Earth. That is, the many species on Earth today likely evolved from an ancestral organism in which the genetic code was already present. Because the code is essential to the function of cells, it would tend to remain unchanged in species across generations, as individuals with significant changes might be unable to survive. This type of evolutionary process can explain the remarkable similarity of the genetic code across present-day organisms.
This article is licensed under a CC BY-NC-SA 4.0 license.

References:

Arnaud, M.B., Inglis, D.O., Skrzypek, M.S., Binkley, J., Shah, P., Wymore, F., Binkley, G., Miyasato, S.R., Simison, M., and Sherlock, G. (2013). CGD help: Non-standard genetic codes. In Candida genome database. Retrieved from http://www.candidagenome.org/help/code_tables.shtml.
Codon. (2014). In Scitable. Retrieved from http://www.nature.com/scitable/definition/codon-155.
Gellene, Denise. (2011, November 14). H. Gobind Khorana, 89, Nobel-winning scientist, dies. The New York Times. Reterieved from http://www.nytimes.com/2011/11/14/us/h-gobind-khorana-1968-nobel-winner-for-rna-research-dies.html?_r=0
Guevara Vasquez, F. (2013). Cracking the genetic code. In ACCESS - cryptography 2013. Retrieved from http://www.math.utah.edu/~fguevara/ACCESS2013/Cracking_the_Code.pdf.
Khan Academy. (2016). The genetic code. In Biomolecules. Retrieved from https://www.khanacademy.org/test-prep/mcat/biomolecules/dna/v/the-genetic-code.
Nirenberg/Khorana: Breaking the genetic code. (n.d.). Retrieved from http://www.mhhe.com/biosci/genbio/raven6b/graphics/raven06b/howscientiststhink/14-lab.pdf.
Nirenberg, M. (2004). Historical review: Deciphering the genetic code – a personal account. TRENDS in Biochemical Sciences, 29(1), 46-54. http://dx.doi.org/10.1016/j.tibs.2003.11.009 0.
Nirenberg, M. and Leder, P. (1964). RNA codewords and protein synthesis. Science, 145(3639), 1399-1407. http://dx.doi.org/10.1126/science.145.3639.1399.
Nirenberg, M. W. and Matthaei, J. H. (1961). The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. PNAS, 47(10), 1588-1602. http://dx.doi.org/10.1073/pnas.47.10.1588.
Office of NIH History. (n.d.). The poly-U experiment. In Deciphering the genetic code: Marshall Nirenberg. Retrieved from https://history.nih.gov/exhibits/nirenberg/HS4_polyU.htm.
Openstax College, Biology. (2015, September 29). The genetic code. In OpenStax CNX. Retrieved from http://cnx.org/contents/GFy_h8cu@9.87:QEibhJMi@8/The-Genetic-Code.
Purves, W. K., Sadava, D. E., Orians, G. H., and Heller, H.C. (2004). The genetic code. In Life: The science of biology (7th ed., pp. 239-241). Sunderland, MA: Sinauer Associates.
Raven, P. H., Johnson, G. B., Mason, K. A., Losos, J. B., and Singer, S. R. (2014). The genetic code. In Biology (10th ed., AP ed., pp. 282-284). New York, NY: McGraw-Hill.
Reece, J. B., Urry, L. A., Cain, M. L., Wasserman, S. A., Minorsky, P. V., and Jackson, R. B. (2011). The genetic code. In Campbell biology (10th ed., pp. 337-340). San Francisco, CA: Pearson.
Söll, D., Ohtsuka, E., Jones, D. S., Lohrmann, R., Hayatsu, H., Nishimura, S., and Khorana, H. G. (1965). Studies on polynucleotides, XLIX. Stimulation of the binding of aminoacyl-sRNA's to ribosomes by ribotrinucleotides and a survey of codon assignments for 20 amino acids. PNAS, 54(5), 1378-1385. Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC219908/.
Loading