If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

## AP®︎/College Computer Science Principles

### Course: AP®︎/College Computer Science Principles>Unit 5

Lesson 1: Data tools

# Computing basic statistics

Once we have data stored in a text file, spreadsheet, or database, we can compute statistics describing the data set.
There are many tools we can use for data analysis, depending on our needs and skills. We'll step through our analysis here in two of the most popular tools, spreadsheets and SQL, so that you can choose the one that works best for you. Our goal is to gain insights about the data, so any tool that can yield insights is equally helpful.

### Using statistical functions

First things first: we need data. We always hear that fast food is bad for us, so let's analyze some fast food data and find out for ourselves. 🍔

#### How much data is there?

The simplest statistic about a data set is the number of rows. That's the first thing we should find out, since it helps us understand how comprehensive our data set is (or is not), and make better sense of the results we draw from it.
How many rows are there of fast foods data?

#### What are the averages for the numeric columns?

To understand the range of values for a column, we can compute statistical metrics like the average, as well as more sophisticated metrics like median, mode, and standard deviation.
What's the average number of calories (rounded)?

#### What are the largest and smallest values for a column?

Another way of understanding numeric data is to compute the minimum and maximum.
What's the smallest serving size?

What's the highest amount of sodium?

#### What is the total value of a column?

It can be useful to sum up the values in a column. For example, many companies track metrics that relate to their ability to succeed financially, like purchases or page views, and totaling those metrics helps them see how well they're doing.
If you ate every menu item on the list, how many calories would that be?

### Filtering data

When we want to compute statistics on a subset of a data set, we filter the data. It might be that we don't care much about burgers, but we really want to know all about the milkshakes.
The simplest filter is to look at only the rows where a column is exactly equal to a particular value. For example, we can filter down the fast food data set to only the rows where the "type" is "Milkshake".
• In a spreadsheet, use the IF functions, like COUNTIF to count the number of rows that are equal to a value. Related functions are AVERAGEIF, SUMIF, MINIFS, MAXIFS.
• In SQL, use the same functions as before, but add a WHERE clause with a condition.
How many milkshakes are in the data set?

Filters can get much more sophisticated. A filter can use conditions based on whether a column is less than or greater to a value, like calories > 500. A filter can also combine conditions on multiple columns, like calories > 500 AND serving_size < 200. It all depends on how you want to slice and dice the data.

### Summarizing by group

Our earlier strategies all compute a single statistic, either of the entire data set or a subset. Sometimes we want to view a summary of statistics according to some grouping of the data, like the total number of items for each restaurant or the average calories for each food type.
Now that we know how to make a table summarizing statistics by group, we can answer multiple questions at once.