If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

## AP®︎/College Computer Science Principles

### Course: AP®︎/College Computer Science Principles>Unit 5

Lesson 1: Data tools

# Finding patterns in data sets

AP.CSP:
DAT‑2 (EU)
,
DAT‑2.A (LO)
,
DAT‑2.A.2 (EK)
,
DAT‑2.A.3 (EK)
,
DAT‑2.D (LO)
,
DAT‑2.D.1 (EK)
,
DAT‑2.D.5 (EK)
,
DAT‑2.E.3 (EK)
We often collect data so that we can find patterns in the data, like numbers trending upwards or correlations between two sets of numbers.
Depending on the data and the patterns, sometimes we can see that pattern in a simple tabular presentation of the data. Other times, it helps to visualize the data in a chart, like a time series, line graph, or scatter plot.
Let's explore examples of patterns that we can find in the data around us.

### Spotting trends

A trending quantity is a number that is generally increasing or decreasing.
Consider this data on babies per woman in India from 1955-2015:
YearBabies per woman
19605.91
19705.59
19804.83
19904.05
20003.31
20102.60
In this case, the numbers are steadily decreasing decade by decade, so this is a downward trend.
Now consider this data about US life expectancy from 1920-2000:
YearLife expectancy
192055.38
193059.57
194063.24
195068.07
196069.86
197070.86
198073.91
199075.4
200076.9

#### Visualizing with charts

Let's try identifying upward and downward trends in charts, like a time series graph.
This graph from GapMinder visualizes the babies per woman in India, based on data points for each year instead of each decade:
A line graph with years on the x axis and babies per woman on the y axis. The x axis goes from 1960 to 2010 and the y axis goes from 2.6 to 5.9. The line starts at 5.9 in 1960 and slopes downward until it reaches 2.5 in 2010.
There is a clear downward trend in this graph, and it appears to be nearly a straight line from 1968 onwards.
📉 Chart choices: The x axis goes from 1960 to 2010, and the y axis goes from 2.6 to 5.9. Would the trend be more or less clear with different axis choices? Experiment with the options on GapMinder to see for yourself.
This is a graph of life expectancy from GapMinder, again based on data points for each year instead of each decade:
A line graph with years on the x axis and life expectancy on the y axis. The x axis goes from 1920 to 2000, and the y axis goes from 55 to 77. A line starts at 55 in 1920 and slopes upward (with some variation), ending at 77 in 2000.
The trend isn't as clearly upward in the first few decades, when it dips up and down, but becomes obvious in the decades since.
📉 Chart choices: The x axis goes from 1920 to 2000, and the y axis starts at 55. How do those choices affect our interpretation of the graph? Try changing the options on GapMinder to see for yourself.
Google Analytics is used by many websites (including Khan Academy!) to track user behavior.
This Google Analytics chart shows the page views for our AP Statistics course from October 2017 through June 2018:
A line graph with months on the x axis and page views on the y axis. The x axis goes from October 2017 to June 2018. The y axis goes from 0 to 1.5 million. The chart starts at around 250,000 and stays close to that number through December 2017. It then slopes upward until it reaches 1 million in May 2018. After that, it slopes downward for the final month.
What trends are apparent in this chart?

#### Statistical fluctuations

Google Trends is a site that visualizes the popularity of Google search terms over time.
We can use Google Trends to research the popularity of "data science", a new field that combines statistical data analysis and computational skills.
This is their graph for "data science" from April 2014 to April 2019:
A line graph with time on the x axis and popularity on the y axis. The x axis goes from April 2014 to April 2019, and the y axis goes from 0 to 100. A very jagged line starts around 12 and increases until it ends around 80.
That graph shows a large amount of fluctuation over the time period (including big dips at Christmas each year). Yet, it also shows a fairly clear increase over time.
When we're dealing with fluctuating data like this, we can calculate the "trend line" and overlay it on the chart (or ask a charting application to add it for us). A trend line smoothes out the data and makes the overall trend more clear, if there is one to be found.
Here's the same graph with a trend line added:
A line graph with time on the x axis and popularity on the y axis. The x axis goes from April 2014 to April 2019, and the y axis goes from 0 to 100. A very jagged line starts around 12 and increases until it ends around 80. A straight line is overlaid on top of the jagged line, starting and ending near the same places as the jagged line.
The trend line shows a very clear upward trend, which is what we expected. It helps that we chose to visualize the data over such a long time period, since this data fluctuates seasonally throughout the year.
Whenever you're analyzing and visualizing data, consider ways to collect the data that will account for fluctuations. For time-based data, there are often fluctuations across the weekdays (due to the difference in weekdays and weekends) and fluctuations across the seasons.

### Making predictions

One reason we analyze data is to come up with predictions.
Consider this data on average tuition for 4-year private universities:
School yearTuition
2011-12$30,210 2012-13$30,970
2013-14$31,570 2014-15$32,140
2015-16$33,180 2016-17$34,100
We can see clearly that the numbers are increasing each year from 2011 to 2016. To make a prediction, we need to understand the rate at which the numbers are increasing.
One way to do that is to calculate the percentage change year-over-year. Here's the same table with that calculation as a third column:
School yearTuitionOne year % change
2011-12$30,210 2012-13$30,9702.5%
2013-14$31,5701.9% 2014-15$32,1401.8%
2015-16$33,1803.2% 2016-17$34,1002.8%
It can also help to visualize the increasing numbers in graph form:
A line graph with years on the x axis and tuition cost on the y axis. The x axis goes from 2011 to 2016, and the y axis goes from 30,000 to 35,000. There are 6 dots for each year on the axis, the dots increase as the years increase. A line connects the dots.
If the rate was exactly constant (and the graph exactly linear), then we could easily predict the next value. However, in this case, the rate varies between 1.8% and 3.2%, so predicting is not as straightforward.
Let's try a few ways of making a prediction for 2017-2018:
StrategyPredicted changePredicted tuition
Most recent rate2.8%$35,054 Average last 3 rates2.6%$34,986.6
Average all rates2.44%$34,932.04 Which strategy do you think is the best? As it turns out, the actual tuition for 2017-2018 was$34,740. It increased by only 1.9%, less than any of our strategies predicted. The closest was the strategy that averaged all the rates.
Statisticians and data analysts typically use a technique called linear regression, which finds the line that best fits the data so we can make predictions based on that line. With this data, a linear regression also predicts 2.44%.
How could we make more accurate predictions? We could try to collect more data and incorporate that into our model, like considering the effect of overall economic growth on rising college tuition.
Ultimately, we need to understand that a prediction is just that, a prediction. More data and better techniques helps us to predict the future better, but nothing can guarantee a perfectly accurate prediction.

### Finding correlations

Another goal of analyzing data is to compute the correlation, the statistical relationship between two sets of numbers.
A correlation can be positive, negative, or not exist at all. A scatter plot is a common way to visualize the correlation between two sets of numbers.
There's a positive correlation between temperature and ice cream sales:
A scatter plot with temperature on the x axis and sales amount on the y axis. The x axis goes from 0 degrees Celsius to 30 degrees Celsius, and the y axis goes from $0 to$800. 19 dots are scattered on the plot, with the dots generally getting higher as the x axis increases.
As temperatures increase, ice cream sales also increase.
There's a negative correlation between temperature and soup sales:
A scatter plot with temperature on the x axis and sales amount on the y axis. The x axis goes from 0 degrees Celsius to 30 degrees Celsius, and the y axis goes from $0 to$800. 19 dots are scattered on the plot, with the dots generally getting lower as the x axis increases.
As temperatures increase, soup sales decrease.
There's no correlation between temperature and salt sales:
A scatter plot with temperature on the x axis and sales amount on the y axis. The x axis goes from 0 degrees Celsius to 30 degrees Celsius, and the y axis goes from $0 to$800. 19 dots are scattered on the plot, all between $350 and$750. There is no particular slope to the dots, they are equally distributed in that range for all temperature values.
The increase in temperature isn't related to salt sales.
Statisticans and data analysts typically express the correlation as a number between minus, 1 and 1, where minus, 1 is a strong negative correlation, 1 is a strong positive correlation, and 0 is no correlation. You can learn more about correlation coefficients on Khan Academy.
A variation on the scatter plot is a bubble plot, where the dots are sized based on a third dimension of the data.
Here's a bubble plot from GapMinder that compares income to life expectancy, with each dot representing a country and its population:
A bubble plot with income on the x axis and life expectancy on the y axis. The x axis goes from 400 to 128,000, using a logarithmic scale that doubles at each tick. The y axis goes from 19 to 86. Bubbles of various colors and sizes are scattered across the middle of the plot, getting generally higher as the x axis increases.
📉 Chart choices: The dots are colored based on the continent, with green representing the Americas, yellow representing Europe, blue representing Africa, and red representing Asia. The y axis goes from 19 to 86, and the x axis goes from 400 to 96,000, using a logarithmic scale that doubles at each tick. A logarithmic scale is a common choice when a dimension of the data changes so extremely.
As countries move up on the income axis, they generally move up on the life expectancy axis as well. There's a positive correlation between income and life expectancy.
Here's another bubble plot from GapMinder, this time comparing CO2 emissions to life expectancy:
A bubble plot with CO2 emissions on the x axis and life expectancy on the y axis. The x axis goes from 0 to 100, using a logarithmic scale that goes up by a factor of 10 at each tick. The y axis goes from 19 to 86. Bubbles of various colors and sizes are scattered across the middle of the plot, starting around a life expectancy of 60 and getting generally higher as the x axis increases.
📉 Chart choices: This time, the x axis goes from 0.0 to 250, using a logarithmic scale that goes up by a factor of 10 at each tick.
We once again see a positive correlation: as CO2 emissions increase, life expectancy increases.
Wait a second, does this mean that we should earn more money and emit more carbon dioxide in order to guarantee a long life? No, not necessarily.
Correlation does not imply causation. A correlation tells us that there is some sort of association between two sets of numbers, but it does not tell us why there's an association.
In this case, the correlation is likely due to a hidden cause that's driving both sets of numbers, like overall standard of living.
In other cases, a correlation might be just a big coincidence. There are plenty of fun examples online of spurious correlations.
Finding a correlation is just a first step in understanding data. It can't tell you the cause, but it can point you in the direction of possible causes and experiments to learn more.
A bubble plot with productivity on the x axis and hours worked on the y axis. The x axis goes from $0/hour to$100/hour. The y axis goes from 1,400 to 2,400 hours. Bubbles of various colors and sizes are scattered on the plot, starting around 2,400 hours for \$2/hours and getting generally lower on the plot as the x axis increases.