If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

## AP®︎/College Computer Science Principles

### Course: AP®︎/College Computer Science Principles>Unit 5

Lesson 1: Data tools

# Computing basic statistics

Once we have data stored in a text file, spreadsheet, or database, we can compute statistics describing the data set.
There are many tools we can use for data analysis, depending on our needs and skills. We'll step through our analysis here in two of the most popular tools, spreadsheets and SQL, so that you can choose the one that works best for you. Our goal is to gain insights about the data, so any tool that can yield insights is equally helpful.

### Using statistical functions

First things first: we need data. We always hear that fast food is bad for us, so let's analyze some fast food data and find out for ourselves. 🍔

#### How much data is there?

The simplest statistic about a data set is the number of rows. That's the first thing we should find out, since it helps us understand how comprehensive our data set is (or is not), and make better sense of the results we draw from it.
How many rows are there of fast foods data?

#### What are the averages for the numeric columns?

To understand the range of values for a column, we can compute statistical metrics like the average, as well as more sophisticated metrics like median, mode, and standard deviation.
What's the average number of calories (rounded)?

#### What are the largest and smallest values for a column?

Another way of understanding numeric data is to compute the minimum and maximum.
What's the smallest serving size?

What's the highest amount of sodium?

#### What is the total value of a column?

It can be useful to sum up the values in a column. For example, many companies track metrics that relate to their ability to succeed financially, like purchases or page views, and totaling those metrics helps them see how well they're doing.
If you ate every menu item on the list, how many calories would that be?

### Filtering data

When we want to compute statistics on a subset of a data set, we filter the data. It might be that we don't care much about burgers, but we really want to know all about the milkshakes.
The simplest filter is to look at only the rows where a column is exactly equal to a particular value. For example, we can filter down the fast food data set to only the rows where the "type" is "Milkshake".
• In a spreadsheet, use the IF functions, like COUNTIF to count the number of rows that are equal to a value. Related functions are AVERAGEIF, SUMIF, MINIFS, MAXIFS.
• In SQL, use the same functions as before, but add a WHERE clause with a condition.
How many milkshakes are in the data set?

Filters can get much more sophisticated. A filter can use conditions based on whether a column is less than or greater to a value, like calories > 500. A filter can also combine conditions on multiple columns, like calories > 500 AND serving_size < 200. It all depends on how you want to slice and dice the data.

### Summarizing by group

Our earlier strategies all compute a single statistic, either of the entire data set or a subset. Sometimes we want to view a summary of statistics according to some grouping of the data, like the total number of items for each restaurant or the average calories for each food type.
Now that we know how to make a table summarizing statistics by group, we can answer multiple questions at once.
What's the highest calorie count of the menu items from White Castle?

Which restaurant offers the menu item with the highest calorie count?

## Want to join the conversation?

• how does this affect the economy
• real question is how does this affect lebrons legacy
• Will this be on the AP exam?
• nope, just extra practice
• I opened the google spreadsheets link for summarising by group but got nowhere. Where is the pivot table editor? What do you mean by "specify the column to group by under "Rows"". I'm guessing this has something to do with the fact that I opened the link in a browser instead of an actual software that would give me this function?
• The pivot table editor is a feature within Google Sheets that allows you to create and customize pivot tables. To access the pivot table editor in Google Sheets, you can follow these steps:

2. Click on the tab at the bottom of the page that contains the data you want to summarize.
3. Select the range of cells you want to include in your pivot table.
4. Click on "Data" in the top menu bar, and then select "Pivot table" from the drop-down menu.
5. The pivot table editor will open on the right side of your screen.

Once you have the pivot table editor open, you can specify the column to group by under "Rows" by clicking on the "Add" button next to "Rows" and selecting the column you want to group by from the drop-down menu. This will create a new row in your pivot table for each unique value in that column, and the data in the other columns will be summarized based on the aggregation function you choose.

Hope this helps :)
• For the very last question i used this query:

SELECT restaurant, MAX(calories) FROM fast_foods GROUP BY restaurant;

However the order of the rows turns out alphabetically by the name of the restaurant, not by max calories count. How do I fix this in SQL so it turns out as in the example?
• Im confused about if it shows that i completed it or not
(1 vote)
• Let’s say we are learning something new and our learning takes more time and is not exponential, but linear relative to time spent. If our learning rate is 20% per half an hour for the topic, how long will it take until we have learned 70% of the topic?