If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Data and statistics FAQ

Frequently asked questions about data and statistics

What is a statistical question?

A statistical question is a question that we can answer by collecting and analyzing data from many different things or people. For example, "How tall are the students in our class?" is a statistical question, because we can measure the heights of all the students and look at how they vary. "How tall is the teacher?" is not a statistical question, because it only involves one thing or person, and we don't need data to answer it.

What are measures of center and why do we need them?

Sometimes we have a lot of data, like test scores, heights, weights, or temperatures, and we want to summarize them with one number that represents the whole group. This number is called a measure of center, because it is supposed to be close to the middle of the data. There are different ways to find the measure of center, depending on what kind of data we have and what we want to know.
The most common measures of center are the mean, the median, and the mode. The mean is the average of all the data values, which we find by adding them all up and dividing by how many there are. The median is the middle value of the data, which we find by putting them in order from smallest to largest and picking the one in the middle (or the average of the two in the middle, if there are an even number of values). The mode is the most frequent value of the data, which we find by counting how many times each value appears and picking the one that appears the most.
We can use measures of center to compare different groups of data, to see which one has higher or lower values overall, or to see how the data is distributed around the center. For example, we can compare the mean test scores of different classes, or the median heights of different sports teams, or the mode of the favorite colors of different groups of friends.

What are measures of variation and why do we need them?

Measures of center are useful, but they don't tell us everything about the data. Sometimes we also want to know how spread out the data is, or how much the values differ from each other and from the center. This is called variation, and we can measure it with different numbers, too.
Some of the most common measures of variation are the range, the interquartile range, and the mean absolute deviation. The range is the difference between the highest and the lowest values of the data, which we find by subtracting the minimum from the maximum. The interquartile range is the difference between the middle 50% of the data, which we find by dividing the data into four equal parts (called quartiles) and subtracting the first quartile from the third quartile. The mean absolute deviation is the average of how far each value is from the mean, which we find by subtracting the mean from each value, taking the absolute value (which means ignoring the negative sign), adding them all up, and dividing by how many there are.
We can use measures of variation to compare different groups of data, to see which one has more or less variability, or to see how the data is shaped around the center. For example, we can compare the range of temperatures in different seasons, or the interquartile range of incomes in different neighborhoods, or the mean absolute deviation of ages in different families.

How do we choose the best measure of center and variation for our data?

There is no single best measure of center and variation for all data, because different measures have different advantages and disadvantages, depending on the situation. We have to think about what kind of data we have, what we want to learn from it, and what we want to communicate to others.
Some things to consider are:
  • Is the data numerical or categorical? Numerical data can be measured with numbers, like heights, weights, or scores. Categorical data can be grouped into categories, like colors, animals, or genres. We can use mean, median, and mode for numerical data, but only mode for categorical data. We can use range, interquartile range, and mean absolute deviation for numerical data, but not for categorical data.
  • Is the data symmetrical or skewed? Symmetrical data has values that are evenly distributed around the center, like a bell-shaped curve. Skewed data has values that are more clustered on one side of the center and more spread out on the other side, like a tail. We can use mean, median, and mode for symmetrical data, but median and mode are more reliable for skewed data, because they are less affected by extreme values. Similarly, the interquartile range is less affected by extreme values than the range is.

How do we pick an appropriate data display?

There is no one right answer to how we pick an appropriate data display, but there are some things we can consider to help us decide. Some of the factors we can think about are:
  • The type and the size of the data. For example, if we have categorical data, such as favorite colors or types of animals, we might use a frequency table or a bar graph to show the data. If we have numerical data, such as heights or weights, we might use a histogram, a box plot, or a scatter plot to show the data. If we have a lot of data, we might use a graph to make it easier to see the patterns and trends in the data. If we have a small amount of data, we might use a table to show the exact values and frequencies of the data.
  • The purpose and the audience of the data display. For example, if we want to compare the data across different groups or categories, we might use a dot plot, a histogram, or a box plot to show the similarities and differences of the data. If we want to show the relationship between two variables, we might use a scatter plot or a line graph to show the correlation or the trend of the data. If we want to show the distribution or the shape of the data, we might use a histogram or a box plot to show the center, the spread, and the outliers of the data.
Whatever display type we choose, we would want it to be clear, easy to read, and attractive, with a title, labels, and scales. That way, our audience can read and interpret our display.

Want to join the conversation?