AP®︎/College Computer Science Principles
Finding patterns in data sets
We often collect data so that we can find patterns in the data, like numbers trending upwards or correlations between two sets of numbers.
Depending on the data and the patterns, sometimes we can see that pattern in a simple tabular presentation of the data. Other times, it helps to visualize the data in a chart, like a time series, line graph, or scatter plot.
Let's explore examples of patterns that we can find in the data around us.
A trending quantity is a number that is generally increasing or decreasing.
Consider this data on babies per woman in India from 1955-2015:
|Year||Babies per woman|
In this case, the numbers are steadily decreasing decade by decade, so this is a downward trend.
Now consider this data about US life expectancy from 1920-2000:
Source: Gapminder, Life expectancy at birth.
In this case, the numbers are steadily increasing decade by decade, so this an upward trend.
Visualizing with charts
Let's try identifying upward and downward trends in charts, like a time series graph.
This graph from GapMinder visualizes the babies per woman in India, based on data points for each year instead of each decade:
A line graph with years on the x axis and babies per woman on the y axis. The x axis goes from 1960 to 2010 and the y axis goes from 2.6 to 5.9. The line starts at 5.9 in 1960 and slopes downward until it reaches 2.5 in 2010.
There is a clear downward trend in this graph, and it appears to be nearly a straight line from 1968 onwards.
📉 Chart choices: The x axis goes from 1960 to 2010, and the y axis goes from 2.6 to 5.9. Would the trend be more or less clear with different axis choices? Experiment with the options on GapMinder to see for yourself.
This is a graph of life expectancy from GapMinder, again based on data points for each year instead of each decade:
A line graph with years on the x axis and life expectancy on the y axis. The x axis goes from 1920 to 2000, and the y axis goes from 55 to 77. A line starts at 55 in 1920 and slopes upward (with some variation), ending at 77 in 2000.
The trend isn't as clearly upward in the first few decades, when it dips up and down, but becomes obvious in the decades since.
📉 Chart choices: The x axis goes from 1920 to 2000, and the y axis starts at 55. How do those choices affect our interpretation of the graph? Try changing the options on GapMinder to see for yourself.
Check your understanding
Google Analytics is used by many websites (including Khan Academy!) to track user behavior.
This Google Analytics chart shows the page views for our AP Statistics course from October 2017 through June 2018:
A line graph with months on the x axis and page views on the y axis. The x axis goes from October 2017 to June 2018. The y axis goes from 0 to 1.5 million. The chart starts at around 250,000 and stays close to that number through December 2017. It then slopes upward until it reaches 1 million in May 2018. After that, it slopes downward for the final month.
What trends are apparent in this chart?
Google Trends is a site that visualizes the popularity of Google search terms over time.
We can use Google Trends to research the popularity of "data science", a new field that combines statistical data analysis and computational skills.
This is their graph for "data science" from April 2014 to April 2019:
A line graph with time on the x axis and popularity on the y axis. The x axis goes from April 2014 to April 2019, and the y axis goes from 0 to 100. A very jagged line starts around 12 and increases until it ends around 80.
That graph shows a large amount of fluctuation over the time period (including big dips at Christmas each year). Yet, it also shows a fairly clear increase over time.
When we're dealing with fluctuating data like this, we can calculate the "trend line" and overlay it on the chart (or ask a charting application to add it for us). A trend line smoothes out the data and makes the overall trend more clear, if there is one to be found.
Here's the same graph with a trend line added:
A line graph with time on the x axis and popularity on the y axis. The x axis goes from April 2014 to April 2019, and the y axis goes from 0 to 100. A very jagged line starts around 12 and increases until it ends around 80. A straight line is overlaid on top of the jagged line, starting and ending near the same places as the jagged line.
The trend line shows a very clear upward trend, which is what we expected. It helps that we chose to visualize the data over such a long time period, since this data fluctuates seasonally throughout the year.
Whenever you're analyzing and visualizing data, consider ways to collect the data that will account for fluctuations. For time-based data, there are often fluctuations across the weekdays (due to the difference in weekdays and weekends) and fluctuations across the seasons.
One reason we analyze data is to come up with predictions.
Consider this data on average tuition for 4-year private universities:
We can see clearly that the numbers are increasing each year from 2011 to 2016. To make a prediction, we need to understand the rate at which the numbers are increasing.
One way to do that is to calculate the percentage change year-over-year. Here's the same table with that calculation as a third column:
|School year||Tuition||One year % change|
It can also help to visualize the increasing numbers in graph form:
A line graph with years on the x axis and tuition cost on the y axis. The x axis goes from 2011 to 2016, and the y axis goes from 30,000 to 35,000. There are 6 dots for each year on the axis, the dots increase as the years increase. A line connects the dots.
If the rate was exactly constant (and the graph exactly linear), then we could easily predict the next value. However, in this case, the rate varies between 1.8% and 3.2%, so predicting is not as straightforward.
Let's try a few ways of making a prediction for 2017-2018:
|Strategy||Predicted change||Predicted tuition|
|Most recent rate||2.8%||$35,054|
|Average last 3 rates||2.6%||$34,986.6|
|Average all rates||2.44%||$34,932.04|
Which strategy do you think is the best? As it turns out, the actual tuition for 2017-2018 was $34,740. It increased by only 1.9%, less than any of our strategies predicted. The closest was the strategy that averaged all the rates.
Statisticians and data analysts typically use a technique called linear regression, which finds the line that best fits the data so we can make predictions based on that line. With this data, a linear regression also predicts 2.44%.
How could we make more accurate predictions? We could try to collect more data and incorporate that into our model, like considering the effect of overall economic growth on rising college tuition.
Ultimately, we need to understand that a prediction is just that, a prediction. More data and better techniques helps us to predict the future better, but nothing can guarantee a perfectly accurate prediction.
Another goal of analyzing data is to compute the correlation, the statistical relationship between two sets of numbers.
A correlation can be positive, negative, or not exist at all. A scatter plot is a common way to visualize the correlation between two sets of numbers.
There's a positive correlation between temperature and ice cream sales:
A scatter plot with temperature on the x axis and sales amount on the y axis. The x axis goes from 0 degrees Celsius to 30 degrees Celsius, and the y axis goes from $0 to $800. 19 dots are scattered on the plot, with the dots generally getting higher as the x axis increases.
There's a negative correlation between temperature and soup sales:
A scatter plot with temperature on the x axis and sales amount on the y axis. The x axis goes from 0 degrees Celsius to 30 degrees Celsius, and the y axis goes from $0 to $800. 19 dots are scattered on the plot, with the dots generally getting lower as the x axis increases.
There's no correlation between temperature and salt sales:
A scatter plot with temperature on the x axis and sales amount on the y axis. The x axis goes from 0 degrees Celsius to 30 degrees Celsius, and the y axis goes from $0 to $800. 19 dots are scattered on the plot, all between $350 and $750. There is no particular slope to the dots, they are equally distributed in that range for all temperature values.
Statisticans and data analysts typically express the correlation as a number between and , where is a strong negative correlation, is a strong positive correlation, and is no correlation. You can learn more about correlation coefficients on Khan Academy.
A variation on the scatter plot is a bubble plot, where the dots are sized based on a third dimension of the data.
Here's a bubble plot from GapMinder that compares income to life expectancy, with each dot representing a country and its population:
A bubble plot with income on the x axis and life expectancy on the y axis. The x axis goes from 400 to 128,000, using a logarithmic scale that doubles at each tick. The y axis goes from 19 to 86. Bubbles of various colors and sizes are scattered across the middle of the plot, getting generally higher as the x axis increases.
📉 Chart choices: The dots are colored based on the continent, with green representing the Americas, yellow representing Europe, blue representing Africa, and red representing Asia. The y axis goes from 19 to 86, and the x axis goes from 400 to 96,000, using a logarithmic scale that doubles at each tick. A logarithmic scale is a common choice when a dimension of the data changes so extremely.
As countries move up on the income axis, they generally move up on the life expectancy axis as well. There's a positive correlation between income and life expectancy.
Here's another bubble plot from GapMinder, this time comparing CO2 emissions to life expectancy:
A bubble plot with CO2 emissions on the x axis and life expectancy on the y axis. The x axis goes from 0 to 100, using a logarithmic scale that goes up by a factor of 10 at each tick. The y axis goes from 19 to 86. Bubbles of various colors and sizes are scattered across the middle of the plot, starting around a life expectancy of 60 and getting generally higher as the x axis increases.
📉 Chart choices: This time, the x axis goes from 0.0 to 250, using a logarithmic scale that goes up by a factor of 10 at each tick.
We once again see a positive correlation: as CO2 emissions increase, life expectancy increases.
Wait a second, does this mean that we should earn more money and emit more carbon dioxide in order to guarantee a long life? No, not necessarily.
Correlation does not imply causation. A correlation tells us that there is some sort of association between two sets of numbers, but it does not tell us why there's an association.
In this case, the correlation is likely due to a hidden cause that's driving both sets of numbers, like overall standard of living.
In other cases, a correlation might be just a big coincidence. There are plenty of fun examples online of spurious correlations.
Finding a correlation is just a first step in understanding data. It can't tell you the cause, but it can point you in the direction of possible causes and experiments to learn more.
Check your understanding
Our World In Data is a non-profit website that collects and visualizes data about world trends.
Their research on Working Hours includes this chart that compares productivity (GDP per hour worked) to the average number of hours worked per person.
A bubble plot with productivity on the x axis and hours worked on the y axis. The x axis goes from $0/hour to $100/hour. The y axis goes from 1,400 to 2,400 hours. Bubbles of various colors and sizes are scattered on the plot, starting around 2,400 hours for $2/hours and getting generally lower on the plot as the x axis increases.
What best describes the relationship between productivity and work hours?
🙋🏽🙋🏻♀️🙋🏿♂️Do you have any questions about this topic? We'd love to answer—just ask in the questions area below!
Want to join the conversation?
- hijkjiewjtijijdiqjsnasm(23 votes)
- the answer for this would be msansjqidjijitjweijkjih(5 votes)
- how to tell how much money a car is?(0 votes)