Main content

## AP®︎/College Computer Science Principles

# Finding patterns in data sets

AP.CSP:

DAT‑2 (EU)

, DAT‑2.A (LO)

, DAT‑2.A.2 (EK)

, DAT‑2.A.3 (EK)

, DAT‑2.D (LO)

, DAT‑2.D.1 (EK)

, DAT‑2.D.5 (EK)

, DAT‑2.E.3 (EK)

We often collect data so that we can find patterns in the data, like numbers trending upwards or correlations between two sets of numbers.

Depending on the data and the patterns, sometimes we can see that pattern in a simple tabular presentation of the data. Other times, it helps to visualize the data in a chart, like a time series, line graph, or scatter plot.

Let's explore examples of patterns that we can find in the data around us.

### Spotting trends

A trending quantity is a number that is generally increasing or decreasing.

Consider this data on babies per woman in India from 1955-2015:

Year | Babies per woman |
---|---|

1960 | 5.91 |

1970 | 5.59 |

1980 | 4.83 |

1990 | 4.05 |

2000 | 3.31 |

2010 | 2.60 |

In this case, the numbers are steadily

*decreasing*decade by decade, so this is a**downward trend**.Now consider this data about US life expectancy from 1920-2000:

Year | Life expectancy |
---|---|

1920 | 55.38 |

1930 | 59.57 |

1940 | 63.24 |

1950 | 68.07 |

1960 | 69.86 |

1970 | 70.86 |

1980 | 73.91 |

1990 | 75.4 |

2000 | 76.9 |

*Source: Gapminder, Life expectancy at birth.*

In this case, the numbers are steadily increasing decade by decade, so this an

**upward trend**.#### Visualizing with charts

Let's try identifying upward and downward trends in charts, like a time series graph.

This graph from GapMinder visualizes the babies per woman in India, based on data points for each year instead of each decade:

There is a clear downward trend in this graph, and it appears to be nearly a straight line from 1968 onwards.

📉 Chart choices: The x axis goes from 1960 to 2010, and the y axis goes from 2.6 to 5.9. Would the trend be more or less clear with different axis choices? Experiment with the options on GapMinder to see for yourself.

This is a graph of life expectancy from GapMinder, again based on data points for each year instead of each decade:

The trend isn't as clearly upward in the first few decades, when it dips up and down, but becomes obvious in the decades since.

📉 Chart choices: The x axis goes from 1920 to 2000, and the y axis starts at 55. How do those choices affect our interpretation of the graph? Try changing the options on GapMinder to see for yourself.

#### Statistical fluctuations

Google Trends is a site that visualizes the popularity of Google search terms over time.

We can use Google Trends to research the popularity of "data science", a new field that combines statistical data analysis and computational skills.

This is their graph for "data science" from April 2014 to April 2019:

That graph shows a large amount of fluctuation over the time period (including big dips at Christmas each year). Yet, it also shows a fairly clear increase over time.

When we're dealing with fluctuating data like this, we can calculate the "trend line" and overlay it on the chart (or ask a charting application to add it for us). A trend line smoothes out the data and makes the overall trend more clear, if there is one to be found.

Here's the same graph with a trend line added:

The trend line shows a very clear upward trend, which is what we expected. It helps that we chose to visualize the data over such a long time period, since this data fluctuates seasonally throughout the year.

Whenever you're analyzing and visualizing data, consider ways to collect the data that will account for fluctuations. For time-based data, there are often fluctuations across the weekdays (due to the difference in weekdays and weekends) and fluctuations across the seasons.

### Making predictions

One reason we analyze data is to come up with predictions.

Consider this data on average tuition for 4-year private universities:

School year | Tuition |
---|---|

2011-12 | $30,210 |

2012-13 | $30,970 |

2013-14 | $31,570 |

2014-15 | $32,140 |

2015-16 | $33,180 |

2016-17 | $34,100 |

We can see clearly that the numbers are increasing each year from 2011 to 2016. To make a prediction, we need to understand the

*rate*at which the numbers are increasing.One way to do that is to calculate the percentage change year-over-year. Here's the same table with that calculation as a third column:

School year | Tuition | One year % change |
---|---|---|

2011-12 | $30,210 | |

2012-13 | $30,970 | 2.5% |

2013-14 | $31,570 | 1.9% |

2014-15 | $32,140 | 1.8% |

2015-16 | $33,180 | 3.2% |

2016-17 | $34,100 | 2.8% |

It can also help to visualize the increasing numbers in graph form:

If the rate was exactly constant (and the graph exactly linear), then we could easily predict the next value. However, in this case, the rate varies between 1.8% and 3.2%, so predicting is not as straightforward.

Let's try a few ways of making a prediction for 2017-2018:

Strategy | Predicted change | Predicted tuition |
---|---|---|

Most recent rate | 2.8% | $35,054 |

Average last 3 rates | 2.6% | $34,986.6 |

Average all rates | 2.44% | $34,932.04 |

Which strategy do you think is the best? As it turns out, the actual tuition for 2017-2018 was $34,740. It increased by only 1.9%, less than any of our strategies predicted. The closest was the strategy that averaged all the rates.

Statisticians and data analysts typically use a technique called linear regression, which finds the line that best fits the data so we can make predictions based on that line. With this data, a linear regression also predicts 2.44%.

How could we make more accurate predictions? We could try to collect more data and incorporate that into our model, like considering the effect of overall economic growth on rising college tuition.

Ultimately, we need to understand that a prediction is just that, a prediction. More data and better techniques helps us to predict the future better, but nothing can guarantee a perfectly accurate prediction.

### Finding correlations

Another goal of analyzing data is to compute the correlation, the statistical relationship between two sets of numbers.

A correlation can be positive, negative, or not exist at all. A scatter plot is a common way to visualize the correlation between two sets of numbers.

There's a

*positive*correlation between temperature and ice cream sales:There's a

*negative*correlation between temperature and soup sales:There's

*no*correlation between temperature and salt sales:Statisticans and data analysts typically express the correlation as a number between minus, 1 and 1, where minus, 1 is a strong negative correlation, 1 is a strong positive correlation, and 0 is no correlation. You can learn more about correlation coefficients on Khan Academy.

A variation on the scatter plot is a bubble plot, where the dots are sized based on a third dimension of the data.

Here's a bubble plot from GapMinder that compares income to life expectancy, with each dot representing a country and its population:

📉 Chart choices: The dots are colored based on the continent, with green representing the Americas, yellow representing Europe, blue representing Africa, and red representing Asia. The y axis goes from 19 to 86, and the x axis goes from 400 to 96,000, using a logarithmic scale that doubles at each tick. A logarithmic scale is a common choice when a dimension of the data changes so extremely.

As countries move up on the income axis, they generally move up on the life expectancy axis as well. There's a

*positive*correlation between income and life expectancy.Here's another bubble plot from GapMinder, this time comparing CO2 emissions to life expectancy:

📉 Chart choices: This time, the x axis goes from 0.0 to 250, using a logarithmic scale that goes up by a factor of 10 at each tick.

We once again see a positive correlation: as CO2 emissions increase, life expectancy increases.

Wait a second, does this mean that we should earn more money and emit more carbon dioxide in order to guarantee a long life? No, not necessarily.

Correlation does

*not*imply causation. A correlation tells us that there is some sort of association between two sets of numbers, but it does not tell us*why*there's an association.In this case, the correlation is likely due to a hidden cause that's driving both sets of numbers, like overall standard of living.

In other cases, a correlation might be just a big coincidence. There are plenty of fun examples online of spurious correlations.

Finding a correlation is just a first step in understanding data. It can't tell you the cause, but it

*can*point you in the direction of possible causes and experiments to learn more.🙋🏽🙋🏻♀️🙋🏿♂️Do you have any questions about this topic? We'd love to answer—just ask in the questions area below!

## Want to join the conversation?

- hijkjiewjtijijdiqjsnasm(23 votes)
- the answer for this would be msansjqidjijitjweijkjih(5 votes)

- how to tell how much money a car is?(0 votes)