The genetic code links groups of nucleotides in an mRNA to amino acids in a protein. Start codons, stop codons, reading frame.

Introduction

Have you ever written a secret message to one of your friends? If so, you may have used a code to keep the message hidden. For instance, you may have replaced the letters of the word with numbers or symbols, following a particular set of rules. In order for your friend to understand the message, they would need to know the code and apply the same set of rules, in reverse, to decode it.
Decoding messages is also a key step in gene expression, in which information from a gene is read out to build a protein. In this article, we'll take a closer look at the genetic code, which allows DNA and RNA sequences to be "decoded" into the amino acids of a protein.

Background: Making a protein

Genes that provide instructions for proteins are expressed in a two-step process.
  • In transcription, the DNA sequence of a gene is "rewritten" in RNA. In eukaryotes, the RNA must go through additional processing steps to become a messenger RNA, or mRNA.
  • In translation, the sequence of nucleotides in the mRNA is "translated" into a sequence of amino acids in a polypeptide (protein chain).
If this is a new concept for you, you may want to learn more by watching Sal's video on transcription and translation.

Codons

Cells decode mRNAs by reading their nucleotides in groups of three, called codons. Here are some features of codons:
  • Most codons specify an amino acid
  • Three "stop" codons mark the end of a protein
  • One "start" codon, AUG, marks the beginning of a protein and also encodes the amino acid methionine
Codons in an mRNA are read during translation, beginning with a start codon and continuing until a stop codon is reached. mRNA codons are read from 5' to 3' , and they specify the order of amino acids in a protein from N-terminus (methionine) to C-terminus.
The mRNA sequence is:
5'-AUGAUCUCGUAA-5'
Translation involves reading the mRNA nucleotides in groups of three, each of which specifies and amino acid (or provides a stop signal indicating that translation is finished).
3'-AUG AUC UCG UAA-5'
AUG right arrow Methionine (Start) AUC right arrow Isoleucine UCG right arrow Serine UAA right arrow "Stop"
Polypeptide sequence: (N-terminus) Methionine-Isoleucine-Serine (C-terminus)
The two ends of a strand of DNA or RNA are different from each other. That is, a DNA or RNA molecule has directionality.
  • At the 5’ end of the chain, the phosphate group of the first nucleotide in the chain sticks out. The phosphate group is attached to the 5' carbon of the sugar ring, which is why this is called the 5' end.
  • At the other end, called the 3’ end, the hydroxyl of the last nucleotide added to the chain is exposed. The hydroxyl group is attached to the 3' carbon of the sugar ring, which is why this is called the 3' end.
Many processes, such as DNA replication and transcription, can only take place in one particular direction relative the the directionality of a DNA or RNA strand.
You can learn more in the article on nucleic acids.
Polypeptides (chains of linked amino acids) have two distinct ends:
  • An N-terminus with an amino group exposed
  • A C-terminus with a carboxyl group exposed
During translation, polypeptide is built from N- to C-terminus. You can learn more about N- and C-termini in the article on proteins and amino acids.

The genetic code table

The full set of relationships between codons and amino acids (or stop signals) is called the genetic code. The genetic code is often summarized in a table.
The codon table may look kind of intimidating at first. Fortunately, it's organized in a logical way, and it's not too hard to use once you understand this organization.
To see how the codon table works, let's walk through an example. Suppose that we are interested in the codon CAG and want to know which amino acid it specifies.
  1. First, we look at the left side of the table. The axis on the left side refers to the first letter of the codon, so we find C along the left axis. This tells us the (broad) row of the table in which our codon will be found.
  2. Next, we look at the top of the table. The upper axis refers to the second letter of the codon, so we find A along the upper axis. This tells us the column of the table in which our codon will be found.
The row and column from steps 1 and 2 intersect in a single box in the codon table, one containing four codons. It's often easiest to simply look at these four codons and see which one is the one you're looking for.
If you want to use the structure of the table to the maximum, however, you can use the third axis (on the left side of the table) corresponding to the intersect box. By finding the third nucleotide of the codon on this axis, you can identify the exact row within the box where your codon is found. For instance, if we look for G on this axis in our example above, we find that CAG encodes the amino acid glutamine (Glu).
Genetic code table. Each three-letter sequence of mRNA nucleotides corresponds to a specific amino acid, or to a stop codon. UGA, UAA, and UAG are stop codons. AUG is the codon for methionine, and is also the start codon.
Image credit: "The genetic code," by OpenStax College, Biology (CC BY 3.0).
Notice that many amino acids are represented in the table by more than one codon. For instance, there are six different ways to "write" leucine in the language of mRNA (see if you can find all six).
An important point about the genetic code is that it's universal. That is, with minor exceptions, virtually all species (from bacteria to you!) use the genetic code shown above for protein synthesis.

Reading frame

To reliably get from an mRNA to a protein, we need one more concept: that of reading frame. Reading frame determines how the mRNA sequence is divided up into codons during translation.
That's a pretty abstract concept, so let's look at an example to understand it better. The mRNA below can encode three totally different proteins, depending on the frame in which it's read:
mRNA sequence: 5'-UCAUGAUCUCGUAAGA-3'
Read in Frame 1:
5'-UCA UGA UCU CGU AAG A-3'
Ser-STOP-Ser-Arg-Lys
Read in Frame 2:
5'-U CAU GAU CUC GUA AGA-3'
His-Asp-Leu-Val-Arg
Read in Frame 3:
5'-UC AUG AUC UCG UAA GA-3'
Met(Start)-Ile-Ser-STOP
The start codon's position ensures that Frame 3 is chosen for translation of the mRNA.
So, how does a cell know which of these protein to make? The start codon is the key signal. Because translation begins at the start codon and continues in successive groups of three, the position of the start codon ensures that the mRNA is read in the correct frame (in the example above, in Frame 3).
Mutations (changes in DNA) that insert or delete one or two nucleotides can change the reading frame, causing an incorrect protein to be produced "downstream" of the mutation site:
Illustration shows a frameshift mutation in which the reading frame is altered by the deletion of two amino acids.
Image credit; "The genetic code: Figure 3," by OpenStax College, Biology, CC BY 4.0.

How was the genetic code discovered?

The story of how the genetic code was discovered is a pretty cool and epic one. We've stashed our version in the pop-up below, so as not to distract you if you're in a hurry. However, if you have some time, it's definitely interesting reading.

Discovery of the code

To crack the genetic code, researchers needed to figure out how sequences of nucleotides in a DNA or RNA molecule could encode the sequence of amino acids in a polypeptide.
Why was this a tricky problem? Let's imagine a very simple code to get the idea. In this code, each nucleotide in an DNA or RNA molecule might code for one amino acid in a protein. But this code can't actually work, because there are 20 amino acids commonly found in proteins and just 4 nucleotide bases in DNA or RNA.
So, the code had to involve something more complex than a one-to-one matching of nucleotides and amino acids. But what?

The triplet hypothesis

In the mid-1950s, physicist George Gamow extended this line of thinking to predict that the genetic code was likely composed of triplets of nucleotidesstart superscript, 1, end superscript. That is, he proposed that a group of 3 nucleotides in a gene might code for one amino acid in a protein.
Gamow's reasoning was that even a doublet code (2 nucleotides per amino acid) would not work, as it would allow for only 16 ordered groups of nucleotides (4, start superscript, 2, end superscript), too few to account for the 20 standard amino acids used to build proteins. A code based on nucleotide triplets, however, seemed promising: it would provide 64 unique sequences of nucleotides (4, start superscript, 3, end superscript), more than enough to cover the 20 amino acids.
Gamow had some other not-so-correct ideas about how the code was read (for example, he thought that the triplets overlapped, which we now know is not the case)start superscript, 1, end superscript. However, his core insight – that a triplet code was the "minimum" that could cover all the amino acids – proved to be correct.
There are 16 unique groups of nucleotides if a doublet code is used, and 64 unique groups if a triplet code is used. Why is this the case? Let's take a closer look at the math behind these statements.

Doublet code

Let’s look at the doublet code first. In a doublet code, an ordered group of two nucleotides codes for one amino acid. How many such groups of two nucleotides can we make? We know that there are 4 different possibilities for each of the 2 nucleotides in the doublet (A, T, C, and G, if we use DNA bases).
If we put an A in the first position, then any of the four other nucleotides can occupy the second position, resulting in four combinations (AA, AT, AG, AC) that begin with an A. We can repeat this for T (TT, TA, TC, TG), C (CC, CT, CA, CG), and G (GG, GC, GT, GA). If we count all of these possibilities, we'll find that there are 16 of them in total.
You may find it faster and more foolproof to use a mathematical shortcut to quickly answer this type of question. Because we know there are 4 possible nucleotides for each position in the doublet, and because the order of the two slots matters, we can use the rules of permutations to calculate the number of possible groups as follows:
(4 possibilities for the first slot) dot (4 possibilities for the second slot) equals
4, dot, 4, equals, 16 possible ordered groups

Triplet code

What about the triplet code? In this case, we can use the same mathematical reasoning, but must add an additional slot to our setup. There are now 3 positions to fill, and each can be occupied by any of the four bases (A, T, C, or G). Since there are 4 possible choices for each position, we can multiply as follows:
(4 possibilities for the first slot) dot (4 possibilities for the second slot) dot (4 possibilities for the third slot) equals
4, dot, 4, dot, 4, equals, 64 possible ordered groups

Matching codons to amino acids

Gamow’s triplet hypothesis seemed logical and was widely accepted. However, it had not been experimentally proven, and researchers still did not know which triplets of nucleotides corresponded to which amino acids.
The cracking of the genetic code began in 1961, with work from the American biochemist Marshall Nirenberg. For the first time, Nirenberg and his colleagues were able to identify specific nucleotide triplets that corresponded to particular amino acids. Their success relied on two experimental innovations:
  • A way to make artificial mRNA molecules with specific, known sequences.
  • A system to translate mRNAs into polypeptides outside of a cell (a "cell-free" system). Nirenberg's system consisted of cytoplasm from burst E. coli cells, which contains all of the materials needed for translation.
First, Nirenberg synthesized an mRNA molecule consisting only of the nucleotide uracil (called poly-U). When he added poly-U mRNA to the cell-free system, he found that the polypeptides made consisted exclusively of the amino acid phenylalanine. Because the only triplet in poly-U mRNA is UUU, Nirenberg concluded that UUU might code for phenylalaninestart superscript, 2, end superscript. Using the same approach, he was able to show that poly-C mRNA was translated into polypeptides made exclusively of the amino acid proline, suggesting that the triplet CCC might code for prolinestart superscript, 2, end superscript.
mRNA sequence: 5'-...UUUUUUUUUUUU...-3' (poly-U mRNA)
UUU right arrow phenylalanine (Phe)
Polypeptide sequence: (N terminus)...Phe-Phe-Phe-Phe...(C terminus)
Other researchers, such as the biochemist Har Gobind Khorana at University of Wisconsin, extended Nirenberg's experiment by synthesizing artificial mRNAs with more complex sequences. For instance, in one experiment, Khorana generated a poly-UC (UCUCUCUCUC…) mRNA and added it to a cell-free system similar to Nirenberg'sstart superscript, 3, comma, 4, end superscript.
The poly-UC mRNA that it was translated into polypeptides with an alternating pattern of serine and leucine amino acids. These and other results confirmed that the genetic code was based on triplets, or codons. Today, we know that serine is encoded by the codon UCU, while leucine is encoded by CUC.
mRNA sequence: 5'-...UCUCUCUCUCUC...-3' (poly-UC mRNA)
UCU right arrow serine (Ser)
CUC right arrow leucine (Leu)
Polypeptide sequence: (N terminus)...Ser-Leu-Ser-Leu...(C terminus)
By 1965, using the cell-free system and other techniques, Nirenberg, Khorana, and their colleagues had deciphered the entire genetic code. That is, they had identified the amino acid or "stop" signal corresponding to each one of the 64 nucleotide codons. For their contributions, Nirenberg and Khorana (along with another genetic code researcher, Robert Holley) received the Nobel Prize in 1968.
Photographs of Nirenberg and Khorana.
Left: Image modified from "Marshall Nirenberg and Heinrich Matthaei," by N. MacVicar (public domain). Right: "Har Gobind Khorana" (public domain).
I always like to imagine how cool it would have been to be one of the people who discovered the basic molecular code of life. Although we now know the code, there are many other biological mysteries still waiting to be solved (perhaps by you!).

Attribution:

This article is a modified derivative of "The genetic code," by OpenStax College, Biology, CC BY 4.0. Download the original article for free at http://cnx.org/contents/185cbf87-c72e-48f5-b51e-f14f21b5eabd@10.59.
The modified article is licensed under a CC BY-NC-SA 4.0 license.

Works cited:

  1. Lorch, M. (2012, August 16). The most beautiful wrong ideas in science. In Chemistry blog. Retrieved from http://www.chemistry-blog.com/2012/08/16/the-most-beautiful-wrong-ideas-in-science/.
  2. Nirenberg, M. (2004). Historical review: Deciphering the genetic code – a personal account. TRENDS in Biochemical Sciences, 29(1), 46-54. http://dx.doi.org/10.1016/j.tibs.2003.11.009.
  3. Gellene, Denise. (2011, November 14). H. Gobind Khorana, 89, Nobel-winning scientist, dies. The New York Times. Reterieved from http://www.nytimes.com/2011/11/14/us/h-gobind-khorana-1968-nobel-winner-for-rna-research-dies.html?_r=0.
  4. Nobel Media. (2014). Crack the code - how the code was cracked. In Nobelprize.org. Retrieved from https://www.nobelprize.org/educational/medicine/gene-code/history.html.

References:

Arnaud, M.B., Inglis, D.O., Skrzypek, M.S., Binkley, J., Shah, P., Wymore, F., Binkley, G., Miyasato, S.R., Simison, M., and Sherlock, G. (2013). CGD help: Non-standard genetic codes. In Candida genome database. Retrieved from http://www.candidagenome.org/help/code_tables.shtml.
Codon. (2014). In Scitable. Retrieved from http://www.nature.com/scitable/definition/codon-155.
Gellene, Denise. (2011, November 14). H. Gobind Khorana, 89, Nobel-winning scientist, dies. The New York Times. Reterieved from http://www.nytimes.com/2011/11/14/us/h-gobind-khorana-1968-nobel-winner-for-rna-research-dies.html?_r=0.
Guevara Vasquez, F. (2013). Cracking the genetic code. In ACCESS - cryptography 2013. Retrieved from http://www.math.utah.edu/~fguevara/ACCESS2013/Cracking_the_Code.pdf.
Nirenberg/Khorana: Breaking the genetic code. (n.d.). Retrieved from http://www.mhhe.com/biosci/genbio/raven6b/graphics/raven06b/howscientiststhink/14-lab.pdf.
Nirenberg, M. (2004). Historical review: Deciphering the genetic code – a personal account. TRENDS in Biochemical Sciences, 29(1), 46-54. http://dx.doi.org/10.1016/j.tibs.2003.11.009 0.
Nirenberg, M. and Leder, P. (1964). RNA codewords and protein synthesis. Science, 145(3639), 1399-1407. http://dx.doi.org/10.1126/science.145.3639.1399.
Nirenberg, M. W. and Matthaei, J. H. (1961). The dependence of cell-free protein synthesis in E. coli upon naturally occurring or synthetic polyribonucleotides. PNAS, 47(10), 1588-1602. http://dx.doi.org/10.1073/pnas.47.10.1588.
Office of NIH History. (n.d.). The poly-U experiment. In Deciphering the genetic code: Marshall Nirenberg. Retrieved from https://history.nih.gov/exhibits/nirenberg/HS4_polyU.htm.
Openstax College, Biology. (2015, September 29). The genetic code. In OpenStax CNX. Retrieved from http://cnx.org/contents/GFy_h8cu@9.87:QEibhJMi@8/The-Genetic-Code.
Purves, W. K., Sadava, D. E., Orians, G. H., and Heller, H.C. (2004). The genetic code. In Life: The science of biology (7th ed., pp. 239-241). Sunderland, MA: Sinauer Associates.
Raven, P. H., Johnson, G. B., Mason, K. A., Losos, J. B., and Singer, S. R. (2014). The genetic code. In Biology (10th ed., AP ed., pp. 282-284). New York, NY: McGraw-Hill.
Reece, J. B., Urry, L. A., Cain, M. L., Wasserman, S. A., Minorsky, P. V., and Jackson, R. B. (2011). The genetic code. In Campbell biology (10th ed., pp. 337-340). San Francisco, CA: Pearson.
Söll, D., Ohtsuka, E., Jones, D. S., Lohrmann, R., Hayatsu, H., Nishimura, S., and Khorana, H. G. (1965). Studies on polynucleotides, XLIX. Stimulation of the binding of aminoacyl-sRNA's to ribosomes by ribotrinucleotides and a survey of codon assignments for 20 amino acids. PNAS, 54(5), 1378-1385. Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC219908/.