If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Splitting data into related tables

So far, we've just been working with one table at a time, and seeing what interesting data we can select out of that table. But actually, most of the time, we have our data distributed across multiple tables, and those tables are "related" to each other in some way.
For example, let's say we have a table for logging how well students do on their tests, and we include emails in case we need to email parents about slipping grades:
Peter Rabbitpeter@rabbit.comNutrition95
Alice Wonderlandalice@wonderland.comNutrition92
Peter Rabbitpeter@rabbit.comChemistry85
Alice Wonderlandalice@wonderland.comChemistry95
We might also have a table for logging what books each student reads:
Peter RabbitThe Tale of Mrs. Tiggy-WinkleBeatrix Potter
Peter RabbitJabberwockyLewis Carroll
Alice WonderlandThe Hunting of the SnarkLewis Carroll
Alice WonderlandJabberwockyLewis Carroll
We might also have a table just for detailed student information:
What do you think of these tables? Would you change them in any way?
There is one big thing to realize about these tables: they are describing relational data - as in, they are describing data that relates to each other. Each of these tables describe data related to a particular student, and many of the tables replicate the same data. When the same data is replicated across multiple tables, there can be interesting consequences.
For example, what if a student's email changed? Which tables would we need to change?
We'd need to change the student information table, but since we also included that data in the grades table, we'd also have to find every row about that student, and change the email there too.
It's often preferable to make sure that a particular column of data is only stored in a single location, so there are fewer places to update and less risk of having different data in different places. If we do that, we need to make sure we have a way to relate the data across the tables, which we'll get to later.
Let's say we decide to remove email from the grades table, because we realize it's redundant with the email in the student details table. This is what we'd have:
Peter RabbitNutrition95
Alice WonderlandNutrition92
Peter RabbitChemistry85
Alice WonderlandChemistry95
How could we figure out the email for each student? We could find the row in the student info table, matching by name. What if 2 students had the same name? (Did you know that in Bali, every person has only 1 of 4 possible first names?) We can't rely on name to look up a student, and really, we should never rely on something like name to identify anything uniquely in a table.
So the best thing to do is to remove the student_name and replace that with student_id, since that is a guaranteed unique identifier:
We would make the same change to our books table, using student_id instead of student_name:
1The Tale of Mrs. Tiggy-WinkleBeatrix Potter
1JabberwockyLewis Carroll
2The Hunting of the SnarkLewis Carroll
2JabberwockyLewis Carroll
Notice how we have the title and author repeated twice for Jabberwocky? That's another warning sign that we could break our table up into multiple related tables, so that we don't have to update multiple places if something changes about a book.
We could have a table just about books:
1The Tale of Mrs. Tiggy-WinkleBeatrix Potter
2JabberwockyLewis Carroll
3The Hunting of the SnarkLewis Carroll
And then our student_books table becomes:
I know, this table doesn't look nearly as readable as the old table that had all of information stuffed into every row. But, tables are often not designed to be readable to humans-- they're designed to be the easiest to maintain and least prone to bugs. In many cases, it may be best to split information into multiple related tables, so that there is less redundant data and fewer places to update.
It's important to understand how to use SQL to deal with data that has been split up into multiple related tables, and bring the data back together across the tables when you need it. We do that using a concept called "join"s and that's what I'll show you next.

Want to join the conversation?

  • leaf yellow style avatar for user pratik nagori
    what is the purpose of having so many different tables, when we can have a single table with all information.?
    (143 votes)
    Default Khan Academy avatar avatar for user
  • marcimus purple style avatar for user Isabella
    I have many questions about this topic.
    1. Why can't you make something like a variable in SQL? I mean, you can do it in JavaScript, but why not SQL?
    2. Why do you have to go row by row changing each thing if you can just change a variable name?
    3. And why not put all the information from all the tables into 1 table?
    4. What do you need an id for?
    5. What's the importance of SQL tables?
    Please answer one of my questions.
    (37 votes)
    Default Khan Academy avatar avatar for user
    • spunky sam blue style avatar for user Dalendrion
      These are all good questions. It shows you're really thinking about this. I like that.

      1. SQL itself is not a programming language. It's a Query Language. You would use another programming language to write a program that sends queries to your database.
      Here I use JavaScript with a hypothetical performSqlQuery function.
      var query = "SELECT * FROM students WHERE id = " + studentId;
      var result = performSqlQuery(query);
      This leads straight to...

      2. You can use variables of the programming language where you create your SQL queries.

      3. We want to avoid putting duplicate data into the database.
      You could put things into one table like this:
      │ INVOICE │
      │ invoicenr │ customernr │ date │ name │ address │ zipcode │ city │ country │ iban │ product │ count | price │ vat │
      │ "0003417" │ "93276" │ "2017-09-01" │ "Joseph Smith" │ "120 Penstreet" │ "3829AD" │ "Somecity" │ "Somecountry" │ "GB29NWBK60161331926819" │ "Brause 361 nib" │ 1 │ 2.99 │ 19│
      │ "0003417" │ "93276" │ "2017-09-01" │ "Joseph Smith" │ "120 Penstreet" │ "3029AD" │ "Somecity" │ "Somecountry" │ "GB29NWBK60161331926819" │ "Art paper A4" │ 1 │ 13.29 │ 19│
      │ "0003417" │ "93276" │ "2017-09-01" │ "Joseph Smith" │ "120 Penstreet" │ "3829AD" │ "Somecity" │ "Somecountry" │ "GB29NWBK60161331926819" │ "Pen holder, blue" │ 1 │ 5.19 │ 19│
      (You may have to copy this somewhere to view correctly.)

      The problem here is that a lot of data is the same. Only the product and the price are different.
      Another problem is keeping the data consistent. In this table someone made a slight mistake. Can you spot it? This invoice may never arrive at the correct address.

      It's better to split the data. This way you have less data to store, but more importantly, the data is only in one place. If you want to change someone's ZIP code for example, you don't have to go looking for it in many different places.
      │ CUSTOMER │
      │ id │ customernr │ firstname │ prefix │ lastname │ address_id │ iban │
      │ 1 │ "93276" │ "Joseph" │ "" │ "Smith" │ 1 │ "GB29NWBK60161331926819" │
      │ ADDRESS │
      │ id │ streetname │ housenr │ zipcode │ city │ country │
      │ 1 │ "Penstreet" │ "120" │ "3829AD" │ "Somecity" │ "Somecountry" │
      │ PRODUCT │
      │ id │ name │ price │ vat │
      │ 1 │ "Brause 361 nib" │ 2.99 │ 19 │
      │ 2 │ "Art paper A4" │ 13.29 │ 19 │
      │ 3 │ "Pen holder, blue" │ 5.19 │ 19 │
      │ INVOICE │
      │ id │ invoicenr │ customer_id │ date │
      │ 1 │ "0003417" │ 1 │ "2017-09-01" │
      │ id │ invoice_id │ product_id │ count │
      │ 1 │ 1 │ 1 │ 1 │
      │ 2 │ 1 │ 2 │ 1 │
      │ 3 │ 1 │ 3 │ 1 │
      (You may have to copy this somewhere to view correctly.)

      4. In the INVOICE table you can see a customer_id of 1. That means that that particular invoice belongs to the customer with id 1. In this case that's Joseph Smith.
      Can you tell how the INVOICELINE table works?

      5. Perhaps, if you understand the previous answers, you have enough information to answer this yourself. :)
      (170 votes)
  • purple pi purple style avatar for user Js
    What's the different between sub query and "join"? When do we need to use join or sub query? Which is better?
    (27 votes)
    Default Khan Academy avatar avatar for user
  • mr pants teal style avatar for user Zachary Thomas
    What's a scenario where a book detail would change? Seems static to me.
    (1 vote)
    Default Khan Academy avatar avatar for user
  • mr pants teal style avatar for user R Ska
    What are some resources on how good relational databases are structured? I will only be querying databases and not building them. It is good to know some of the basics when it comes to structure, I imagine.
    (9 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user trinah.itadal
    How do the different tables help to organize?
    (3 votes)
    Default Khan Academy avatar avatar for user
    • aqualine ultimate style avatar for user AD Baker
      Typically, in a database system, each table will represent a different entity - students, classrooms, courses - or it will be a a bridge table (which is used to linked tables with many-to-many relationships).

      Keeping the data related to an entity in a single table logically organized and it can reduce the number of rows to a minimum. It also keeps data easier to maintain.

      Imagine you had a table that listed a row with a teacher's name, the course the teacher is teaching, and the classroom number. Now, imagine that you have one teacher who teaches eight different courses. This means you have rows with the teacher's name repeated eight times.

      A few days into the term, you discover that the teacher's name has been misspelled. If you had a single table with teacher names, you could just update the name in one row in the teachers table. However, in this case, you need to change the name eight different times.

      That may not sound so bad. Now, imagine that you are working for a large company that does business with other large companies. One of the companies that supplies you with hundreds of parts decides to rebrand and change its name. Now, imagine having to change the name across hundreds of parts and thousands of documents. Imagine trying to remember every place that the name is used. Changing the name in one table is much faster.
      (15 votes)
  • female robot amelia style avatar for user Shruti Jalihal
    When I have multiple Tables can I maintain a directory of sorts to help me relate which table stands for what?
    What happens in an Organization where there are hundreds of rows and columns and so many tables all relating to one another. And the person who created that database is no longer working there. How will I know what table is for what? And how does one maintain a directory, is there such a thing for SQL called "directory"?
    (5 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user Jondar4565
      There is a "SYSTEM" database that is maintained by database system itself, that is a collection of tables containing metadata about the "USER" databases. For example you can get a list of tables in your database by entering SELECT TABLE_NAME FROM SYSTEM.TABLES where DATABASE_NAME = "MY_MUSIC"; The actual syntax and SYSTEM tables is dependent on the specific database vendor. The column names, datatypes, primary keys, default values, foreign keys and other constraints plus much more are also in system tables
      (8 votes)
  • winston baby style avatar for user Matthew Vaughan
    By breaking down information into these multiple tables would that slow down the software or application because it is having to sift through so many tables? I guess what I'm asking is if it can be beneficial to require more detailed queries rather than writing in depth tables.
    (2 votes)
    Default Khan Academy avatar avatar for user
    • spunky sam blue style avatar for user Dalendrion
      Searching for it might be a bit slower, but writing the data may be quicker because you can omit duplicate data.
      Updating data will be even quicker and easier, because you don't have to search everywhere. This is the main reason to separate data into tables: if you do it right, you only have to enter a piece of data once.
      (12 votes)
  • blobby green style avatar for user Xiao Wang
    while it did not specify which SQL in this article that was used. I am learning SQL server and when I copied the code into SQL server, it does not work. Have you considered if the code would work on different platforms?
    (6 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user mimithechat
      KhanAcademy says in the introduction to this unit that they use SQLite, which is one particular version. The similarities between versions are greater than the differences, so the concepts carry over, but for your particular use you might have slightly different syntax so you should carefully read the documentation for the specific system you're trying to copy into.
      (33 votes)
  • starky seedling style avatar for user xcamacho
    when do games come in khan academy?
    (4 votes)
    Default Khan Academy avatar avatar for user