- Splitting data into related tables
- JOINing related tables
- Challenge: Bobby's Hobbies
- Joining related tables with left outer joins
- Challenge: Customer's orders
- Joining tables to themselves with self-joins
- Challenge: Sequels in SQL
- Combining multiple joins
- Challenge: FriendBook
- Project: Famous people
- More efficient SQL with query planning and optimization
More efficient SQL with query planning and optimization
Now that you've learned many ways of selecting data and are starting to do
SELECTs across multiple tables, it's a good time to talk about the efficiency of your SQL queries - how quickly do they execute, and could they execute faster?
SQL is a declarative language - each query declares what we want the SQL engine to do, but it doesn't say how. As it turns out, the how -- the "plan" -- is what affects the efficiency of the queries, however, so it's pretty important.
Why do SQL queries need a plan?
For example, let's say we had this simple query:
SELECT * FROM books WHERE author = "J K Rowling";
For this query, these are 2 different ways that SQL could find the results:
- Do a "full table scan": look at every single row in the table, return the matching rows.
- Create an "index": Make a copy of the table sorted by author, then do a binary search to find the row where the author is "J K Rowling", find the matching IDs, then do a binary search on the original table that returns the rows that match the ID.
Which one is faster? It depends on the data, and on how often the query will be executed. If the table is only 10 rows long, then a full table scan only requires looking at 10 rows, and the first plan would work out well.
If the table was 10 million rows long, then that full table scan would require looking at 10 million rows. It would be faster to do a binary search on a sorted table - we only need 23 lookups to find a value in 10 million rows. However, creating the sorted table would take a while (~230 million operations, depending on our engine). If we were executing that query many times (more than 23 times) or if we already had that table created, then that second plan would be better.
How does a SQL engine decide which plan to pick? That's an important step that we haven't talked about yet, because we've been focused on the syntax of our queries, not their implementation. As you get into more advanced SQL usage on big databases, the planning step is increasingly important.
The lifecycle of a SQL query
We can think of the SQL engine going through these steps for each query we give it:
Parse, then Optimize, then Execute
- The query parser makes sure that the query is syntactically correct (e.g. commas out of place) and semantically correct (i.e. the tables exist), and returns errors if not. If it's correct, then it turns it into an algebraic expression and passes it to the next step.
- The query planner and optimizer does the hard thinking work. It first performs straightforward optimizations (improvements that always result in better performance, like simplifying 5*10 into 50). It then considers different "query plans" which may have different optimizations, estimates the cost (CPU and time) of each query plan based on the number of rows in the relevant tables, then it picks the optimal plan and passes it on to the next step.
- The query executor takes the plan and turns it into operations for the database, returning the results back to us if there are any.
Where do humans come in?
The query planning and optimization happens for every query, and you could go your whole life issuing SQL queries and not realize it. However, once you start dealing with bigger data sets, you'll start to care more about the speed of your queries, and you might find yourself wondering if there's any way you can improve the performance of your queries.
Many times, especially for complex queries, there are indeed ways you can help optimize a query, and that's known as "query tuning".
The first step is to identify what queries you want to tune, which you can figure out by looking at which of your database calls are taking the longest or using the most resources, like with a SQL profiler. Sometimes, you might discover a badly performing query after it takes so long that it takes down your whole database. Hopefully, you find that out earlier.
The next step is to understand how a particular SQL engine is executing a query, and all SQL systems come with a way to ask the engine. In SQLite, you can stick
EXPLAIN QUERY PLANin front of any SQL to see what it's doing behind the scenes. If you use that, be prepared to dig deep into the
EXPLAIN QUERY PLANreference, because the "explanation" is pretty detailed and implementation specific. If you're using another SQL engine, you can search for "how do I get an execution plan in X".
Now comes the hard part: manual optimization to improve that execution plan. This is also the part that is often dependent on the particularities of the SQL engine you're using, and the particularities of your own data.
For example, remember that query we discussed at the top? If we knew ahead of time that we would want to do hundreds of queries that restricted
WHEREon the author column, then we could explicitly create the index, using
CREATE INDEX. Then the SQL engine would be able to use that index to efficiently find the matching rows. You can read this guide about SQLite query planning to help you understand when indexes would help.
Creating indexes can often make repeated queries more efficient. But there are many other approaches as well. For SQLite, you can get more insight in their query planner overview and take careful note of the "manual" sections.
We can't cover all the complexities of query optimization and query tuning here, so I recommend that you dive deeper into it when you need it.
(Here are some deep dives into different SQL query planners that I found interesting: SQL Server Query Optimizer, Oracle SQL Tuning, MSSQL Execution Plan Basics)
Want to join the conversation?
- Under "Why Do SQL Queries Need A Plan? You said that "we only need 23 lookups to find a value in 10 million rows" if we do a binary search on a sorted table. Why is it 23 lookups? I do not see a correlation.(88 votes)
- log base 2 of 10 million is 23. When using a binary search you want to know, 2 to the power of what will equal the number of data points? 2 to the 23rd power is about 10 million.(242 votes)
- What SQL Query Plan does Khan Academy (SQLlite) use ?(14 votes)
- The query plan is different for each SQL query and it is really easy to see any of the query plans used by Khan Academy's SQLite instance.
You can see them by going back to any of the queries we have been working on and simply paste
EXPLAIN QUERY PLANon a line that is directly on top of a functioning
SELECTstatement. You should immediately see the query plan for that
SELECTstatement in the results window.(42 votes)
- If there are SQL profilers to help you identify queries that run longer, are there other similar products that suggest modifications to SQL queries to make them run faster? Are there optimization metric tools that measure the differences between the two models (scanning the whole table vs. creating the index)? - Trying to understand why the manual methods are needed since there are some query planning tools listed, I feel like this could be entirely automated.(11 votes)
- In my experience as a software analyst for a large software company, sometimes investigatory queries are needed to drill on data. usually one time runs that generate a highly tailored report for a client. Manually tailoring your query is necessary as the query will not be used over and over again in the software but still has a deliverable. Sometimes we are working with a few thousand entries, other times we are north of 50 million. we don't like to leave it to chance as the software has users present 24 hours a day. Taking the system down with a poorly designed query is a big 'no-no', and usually is identified after the incident. The company I worked for had a proprietary query optimizer, but sometimes we needed to break from it's logic to complete a complex report with minimal user impact. Does that help as to why we would manually tailor a query vs letting an automation decide ?(19 votes)
- I don't really understand the difference between the "full table scan" and Create an "index"; what does a binary search entail?(2 votes)
- The article describes Do a "full table scan" as "look at every single row in the table, return the matching rows." Remember that your data was likely entered in a random order, not in order by author or title. (It may be somewhat chronological if books entered were published after the database was created.)
A full table scan would mean that the system would start with the first record and check the criteria. If the record matches, it is included in the query result. If the record does not match, it is not included in the query record. The system would then check the second record. If this record matches the criteria, it is included in the query result. If the record does not match, it is not included in the query record. The system would then check the third record. If this record matches the criteria, it is included in the query result. If the record does not match, it is not included in the query record. The system would then check the fourth record and so on.
When we create an "index". we make a copy of the original data table. We can then sort that table sorted by author, by title, or by date (or by any other field we choose). Then we could do a binary search. as Jim E. has described.
If we had sorted the data in our copied table by
author, a binary search will start with the middle row of the table and check the
authorfield. Because we know that our data is sorted by
author, if the `author in the middle row is "Tolkien,", we know that "Rowling" comes, alphabetically, before "Tolkien" and the records for "Rowling" must be in the first half of the table.
The system will then check the middle row of the first half of the data. If the author is "Gappah," we know that, alphabetically, "Rowling" comes after "Gappah" and the "Rowling" records must be in the second half of the first half of the table. A binary search will continue dividing the data in half and until the records are found or there is no more data to check.(30 votes)
- Please help me im not clear on how to do JOIN and ON.(4 votes)
- You use a
JOINclause when the data you are trying to select is stored in more than one table. In your
FROMclause, you would have code like
tablename1 JOIN tablename2
You use the keyword
ONto indicate the field that relates the two tables, like
ON tablename1.id = tablename2.table1_id(19 votes)
- Let's say there are five students, each of them took five tests during the entire semester. Now I want to not only see their average scores of the five tests, but also want to label their average scores using letters. So how can I combine the "avg" command with the "case...when..."? I tried many times myself, but it didn't work. Some help? Thanks!(3 votes)
- Not sure if you still need help, but if you do can you please post what you have got so far so we can provide better help(5 votes)
- Doesn't a binary search just return a true or false value based on whether or not an element exists within an array and would not return the specific row where the author is 'jk rowling.'(3 votes)
- that depends on the code! The most basic version of a binary search algorithm will do as you say, and return a True or False. But many search algorithms use the logic of a basic binary search to return the location or value of the element.
Here if we wanted every book where author is 'jk rowling' it would be pretty simple to write a search algorithm that takes in the sorted index and the value "jk rowling", does a binary search to find the location of any row where the author is "jk rowling", and then keeps looking up and down until it finds a row where author is not "jk rowling", and return the range of rows where author is "jk rowling". This is more than a basic binary search, but it uses the ordered table and binary search to find the queried data quickly.(4 votes)
- This may be a dumb question. When you create an index, is it permanent? As I understand it, the table is indexed by default using the primary key. So if you are using autoincrement, which I assume is the only way to manage a large tables primary key, what happens if you index that table to another column. The primary key would be out of order, so in the above case, wouldn't an insert result in potentially 10 million more lookups for autoincrement to identify the next integer available? I apologize in advance if this is answered elsewhere.
One further question would be if a row had been removed, is that integer reused as the next available primary key? At my job, our CRM is a database with a large number of tables. Ive always been frustrated that we cannot purge the database of customer accounts that have been inactive. But learning about how the relational information would need to match I can see a missed table in the purge could definitely destabilize the relationships between tables.(2 votes)
- Hello Steve,
A primary key will always have an index. In fact, it is an index with a unique constraint.
If you add another index, that means your table will have two indices. They don't interfere much with each other. So your lookup by the PK will not suffer.
It will suffer when you have indices on many columns, and you're querying all those columns. Overusing indices can work against you in some cases.
A table may have an auto_increment or not. If it does have auto_increment, you can leave the PK out of your insert statements. It will calculate the next value on its own.
It will memorize a value, add 1, and give that value to the next record.
When you delete a record, it will not decrease that memorized value. It will continue counting as though nothing happened.
It is very common for companies to keep customer data even if they're inactive. There are a number of reasons.
One is billing information and taxes. By law, you have to keep records of customers for a number of years. Or at least their invoices.
Keeping the records also allows you to get statistics.
It is possible to keep a separate archive database, but that opens a whole other can of worms.(4 votes)
- On this discussion you talked about typing "EXPLAIN QUERY PLAN" to see which kind of plan an engine uses, my question is where do we write this statement?(2 votes)
- Here is an example using the code at the end of the "Combine Multiple Joins" video:
You will have to type it yourself to see the results.
EXPLAIN QUERY PLAN SELECT a.title, b.title FROM project_pairs
JOIN student_projects a
ON project_pairs.project1_id = a.id
JOIN student_projects b
ON project_pairs.project2_id = b.id;(4 votes)
- CREATE TABLE persons (
id INTEGER PRIMARY KEY AUTOINCREMENT,
INSERT INTO persons (fullname, age) VALUES ("Bobby McBobbyFace", "12");
INSERT INTO persons (fullname, age) VALUES ("Lucy BoBucie", "25");
INSERT INTO persons (fullname, age) VALUES ("Banana FoFanna", "14");
INSERT INTO persons (fullname, age) VALUES ("Shish Kabob", "20");
INSERT INTO persons (fullname, age) VALUES ("Fluffy Sparkles", "8");
CREATE table hobbies (
id INTEGER PRIMARY KEY AUTOINCREMENT,
INSERT INTO hobbies (person_id, name) VALUES (1, "drawing");
INSERT INTO hobbies (person_id, name) VALUES (1, "coding");
INSERT INTO hobbies (person_id, name) VALUES (2, "dancing");
INSERT INTO hobbies (person_id, name) VALUES (2, "coding");
INSERT INTO hobbies (person_id, name) VALUES (3, "skating");
INSERT INTO hobbies (person_id, name) VALUES (3, "rowing");
INSERT INTO hobbies (person_id, name) VALUES (3, "drawing");
INSERT INTO hobbies (person_id, name) VALUES (4, "coding");
INSERT INTO hobbies (person_id, name) VALUES (4, "dilly-dallying");
INSERT INTO hobbies (person_id, name) VALUES (4, "meowing");
CREATE table friends (
id INTEGER PRIMARY KEY AUTOINCREMENT,
INSERT INTO friends (person1_id, person2_id)
VALUES (1, 4);
INSERT INTO friends (person1_id, person2_id)
VALUES (2, 3);
SELECT persons.fullname, hobbies.name as hobbies FROM persons
ON hobbies.person_id = persons.id;
SELECT a.fullname, b.fullname, friends.person1_id, friends.person2_id FROM persons, friends
JOIN persons a
ON friends.person1_id = a.id
JOIN persons b
ON friends.person2_id = b.id;
I need some help with fixing this code.(3 votes)
- In Step 2 of the Friendbook Challenge, we are asked to
You only need the two
use another SELECT with a JOIN to show the names of each
pair of friends, based on the data in the
fullnamefields in the
SELECTclause. Also, you don't need the
personstable in the
FROMclause - only in the