Main content
AP®︎/College Computer Science Principles
Computing basic statistics
Once we have data stored in a text file, spreadsheet, or database, we can compute statistics describing the data set.
There are many tools we can use for data analysis, depending on our needs and skills. We'll step through our analysis here in two of the most popular tools, spreadsheets and SQL, so that you can choose the one that works best for you. Our goal is to gain insights about the data, so any tool that can yield insights is equally helpful.
Using statistical functions
First things first: we need data. We always hear that fast food is bad for us, so let's analyze some fast food data and find out for ourselves. 🍔
I've loaded nutritional information about fast food menu items into a Google Spreadsheet and SQL database.
Now let's try to answer some questions about the data...
How much data is there?
The simplest statistic about a data set is the number of rows. That's the first thing we should find out, since it helps us understand how comprehensive our data set is (or is not), and make better sense of the results we draw from it.
- In a spreadsheet, we can look at the row numbers on the left side or use the
COUNTA
function. - In SQL, we can use the
COUNT
function.
What are the averages for the numeric columns?
To understand the range of values for a column, we can compute statistical metrics like the average, as well as more sophisticated metrics like median, mode, and standard deviation.
- In a spreadsheet, use the
AVERAGE
function on the desired column. - In SQL, use the
AVG
function.
What are the largest and smallest values for a column?
Another way of understanding numeric data is to compute the minimum and maximum.
- In a spreadsheet, use the
MIN/MAX
functions on the desired column. - In SQL, also use the
MIN/MAX
functions.
What is the total value of a column?
It can be useful to sum up the values in a column. For example, many companies track metrics that relate to their ability to succeed financially, like purchases or page views, and totaling those metrics helps them see how well they're doing.
- In a spreadsheet, use the
SUM
function on the desired column. - In SQL, use the
SUM
function.
Filtering data
When we want to compute statistics on a subset of a data set, we filter the data. It might be that we don't care much about burgers, but we really want to know all about the milkshakes.
The simplest filter is to look at only the rows where a column is exactly equal to a particular value. For example, we can filter down the fast food data set to only the rows where the "type" is "Milkshake".
- In a spreadsheet, use the
IF
functions, likeCOUNTIF
to count the number of rows that are equal to a value. Related functions areAVERAGEIF
,SUMIF
,MINIFS
,MAXIFS
. - In SQL, use the same functions as before, but add a
WHERE
clause with a condition.
Filters can get much more sophisticated. A filter can use conditions based on whether a column is less than or greater to a value, like
calories > 500
. A filter can also combine conditions on multiple columns, like calories > 500 AND serving_size < 200
. It all depends on how you want to slice and dice the data.Summarizing by group
Our earlier strategies all compute a single statistic, either of the entire data set or a subset. Sometimes we want to view a summary of statistics according to some grouping of the data, like the total number of items for each restaurant or the average calories for each food type.
- In a spreadsheet, create a pivot table that groups by a particular column and shows the desired statistic for that column.
- In SQL, use
GROUP BY
on the column.
Now that we know how to make a table summarizing statistics by group, we can answer multiple questions at once.
🙋🏽🙋🏻♀️🙋🏿♂️Do you have any questions about this topic? We'd love to answer—just ask in the questions area below!
Want to join the conversation?
- how does this affect the economy(4 votes)
- real question is how does this affect lebrons legacy(17 votes)
- Will this be on the AP exam?(6 votes)
- nope, just extra practice(2 votes)
- I opened the google spreadsheets link for summarising by group but got nowhere. Where is the pivot table editor? What do you mean by "specify the column to group by under "Rows"". I'm guessing this has something to do with the fact that I opened the link in a browser instead of an actual software that would give me this function?(3 votes)
- The pivot table editor is a feature within Google Sheets that allows you to create and customize pivot tables. To access the pivot table editor in Google Sheets, you can follow these steps:
1. Open your Google Sheets document.
2. Click on the tab at the bottom of the page that contains the data you want to summarize.
3. Select the range of cells you want to include in your pivot table.
4. Click on "Data" in the top menu bar, and then select "Pivot table" from the drop-down menu.
5. The pivot table editor will open on the right side of your screen.
Once you have the pivot table editor open, you can specify the column to group by under "Rows" by clicking on the "Add" button next to "Rows" and selecting the column you want to group by from the drop-down menu. This will create a new row in your pivot table for each unique value in that column, and the data in the other columns will be summarized based on the aggregation function you choose.
Hope this helps :)(2 votes)
- For the very last question i used this query:
SELECT restaurant, MAX(calories) FROM fast_foods GROUP BY restaurant;
However the order of the rows turns out alphabetically by the name of the restaurant, not by max calories count. How do I fix this in SQL so it turns out as in the example?(0 votes)- From the author:You need to add an ORDER BY at the end. Here's my query: https://www.khanacademy.org/computer-programming/ap-cs-p-example-fast-foods-db-group-by/4650304395640832(3 votes)
- Im confused about if it shows that i completed it or not(1 vote)
- Let’s say we are learning something new and our learning takes more time and is not exponential, but linear relative to time spent. If our learning rate is 20% per half an hour for the topic, how long will it take until we have learned 70% of the topic?(0 votes)
- 20% / half hour = 40% / hr
40% / hr * t = 70%
t = 70/40 hr
t = 1.75 hr(0 votes)
- The First Question has a wrong answer. The Average Calories are 224.
The Average Fat is 532.
Can someone correct this lesson?(0 votes)- From the author:I see an average fat of 28. The average of 532 for calories seems to be correct according to both the spreadsheet and the SQL: https://docs.google.com/spreadsheets/d/1XdrbwwIRasz3fBhXmBS4_b58k6sZNEhPq25ZZlozujY/edit#gid=743229954
https://www.khanacademy.org/computer-programming/ap-cs-p-example-fast-foods-db-average/5434016657932288
Can you show how you got to the 532 number? Thanks!(3 votes)