If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Types of statistical studies

Created by Sal Khan.

Video transcript

Voiceover: Let's say that you have a hunch that sugar is somehow causing heart disease. Is somehow causing heart disease. So you want to research this further. You want to see what kind of statistical studies can you perform to better understand sugar intake in the population generally and whether that seems to be causing heart disease in some way. Well, the first that you might want to do is just to try to get a sense of sugar intake in the population as a whole. Now clearly, you don't know know, there's no way of measuring exactly how much sugar every member, let's say you're talking about the United States, every member of that 300 million population is consuming everyday. The way that we try to get a sense of how much sugar is being consumed is by conducting a sample study. So you take your population, your 300 million people. So you take your population right over here. Population. And you sample it. You sample your population and not only do you sample it, but you randomly sample it. Obviously you don't want to just survey people who are exiting a cupcake store or people who are exiting a gym. You want it to be a random sample of people, where where you're sampling them shouldn't somehow affect whether or not their answer or how much sugar they might say that they are consuming. But they are gonna tell you how much sugar they consume, let's say, on an average day maybe by filling out a survey or through some other way. And through that you would take this data and obviously, the more samples you have the better and we talk about that in depth in other statistics videos. About how you get a better predictor of the actual true population parameter, the more samples that you might take. But you might do that to get a gauge of how much sugar the average American consumes in a given day. So this right over here where you are taking random samples of the population to, essentially, generate a statistic that is estimating a true parameter, which is the actual amount of sugar Americans are consuming each day. We call this a sample study. Sample study. It's a way, once again, of just estimating what the actual amount of sugar people are having each day. Let's say you want to go further. This will give you a sense of what is likely the amount of sugar that people are consuming each day, but you really want to see how that's related to heart disease. So instead, what you do is you go survey people. You go survey people and you say, "How much sugar have "you consumed over", and once again, when you pick these people, you should be doing it randomly. So let's say you go survey a random sample of 60 year olds. So once again, you wouldn't want to sample people who are in the hospital, you wouldn't want to sample people who are at or just at the gym. You would want to find a random sample or sample them in places where it shouldn't affect their answer, which way they are going to go. Let's say you surveyed 300 60 year olds and you asked them how much sugar have they consumed over the last 30 years. And you also asked them their condition of their heart. And what you get is something like this. So on the horizontal axis you plot sugar consumption and then on the vertical axis you plot heart disease risk or their level of heart disease. Heart disease risk, let's say at 60. And you find a plot that looks something like this. So each of these points. This is someone who consumed 200 grams of sugar per day and they're now at high heart disease risk now at age 60. But maybe this is someone who is at low heart disease risk at age 60 even though they consumed a lot of sugar everyday. And so we just keep plotting all of these points. And you say, "Well, you know what, it does look" and obviously I'm not going to all 300. You say, "Well, actually, it looks like there is "a rough correlation right over here." That if you tried to plot a line, there's definitely some outliers here, but it looks like there is a line that you could fit. And then you might say, "It looks like sugar and heart "disease risk at age 60, that they are correlated, "that they're related in this way, that they move together, "that if someone consumed a lot of sugar over "the last 30 years they seem to have a worse "heart situation and if they consumed a lot lower sugar, "they seem to have a lower or better heart situation." This often happens in medical science, when people see something like this they often jump to the conclusion, oh therefore consuming more sugar must drive up heart disease risk. That's very dangerous because just seeing this data doesn't tell you that sugar is causing heart disease risk. It could go the other way around. It could be that people who end up having a high heart disease risk, maybe they crave sugar, but actually there's some other underlying cause that's making it happen. Maybe they have some other deficiency that's making them crave sugar. So it's not clear which way the causality is happening. Is the sugar consumption driving the unhealthy heart or is somehow the unhealthy heart driving the sugar consumption or maybe there's some other factor. Maybe fat consumption is driving the heart disease and maybe people who have more fat also will have more sugar or vice versa, who knows. All this is telling you is that there is a correlation. So this right over here, you would call this an observational study. You've observed a relationship, but you really can't say what is causing what. So let me write this down. This is an observational study. You're probably saying, "Alright, then how could you prove "or feel better about the idea that sugar is actually a "cause, that there's actually causality there?" To do that, you would actually have to run an experiment. An experiment. To do an experimental study here, what you would do is try to take two groups of folks. You would have your experimental group. So that's your experimental group. Actually, let me make it a circle here so it's a pool of people. Let's say you have 100 people that are experimental. That's your experimental pool. Then you have your control. You have your control. What you would do, if you wanted to run this type of experiment and as we'll see, this type of experiment probably wouldn't be run because some would consider it unethical or actually I would consider it unethical as well. But what you would do is you would randomly, let's say take 30 year olds, you would randomly take 30 year olds and put them in one of these two groups. Once again, when we say randomly, you don't want to put all the healthy people in one group and all the unhealthy people in the other group or vice versa. You want it to be random, you don't want to put all of the people of one type, one demographic, one economic status in one group or the other, you want it to be random. So you randomly put people into these two groups and then the experimental group, you would change one variable. The variable you care about is sugar. So what you might say is, "Okay, all of the people in this "group right over here, whatever sugar that they would have "consumed, on top of that they have to drink, I don't know, "they have to drink a cup of syrup every night or they "have to have a minimum sugar intake." So that you essentially force sugar into this group that you're not forcing into this group. And then 30 years later, so one, this is probably unethical to force people to have something that is very likely not good for their health and then you would have to run it for a long period of time, you would wait 30 years. You would wait 30 years when they're 60 years old and you would see what was the heart condition of these folks. How many people maybe had heart attacks? At age 60, what is their health condition? And then statistically, is it unlikely that the difference would be due purely to chance alone? For example, if you did this, and let's say that yeah these people had a slightly higher chance for heart disease or heart attacks than these folks after 30 years. It would be a good experiment, but it wouldn't allow you to conclude that sugar is causing it, cause that might have happened through chance alone. But if, for example, after 30 years, let's say this group right over here has 10 times the risk of a heart attack or 10 times of whatever the risk factors for heart disease. You would statistically say the odds of that happening by chance alone, that the 100 people in this group right over here having 10 times the chance of heart attack as this group right over here, that's unlikely due to chance alone. So you would say, "Okay!" We would feel good about our conclusion that this forced sugar is what is causing that. Anyway, we'll dig deeper into each of these three types, but the whole point of this video is to just give you an appreciation that, you know, we use statistics a lot, but this gives you a context for how we're using it in different situations when we're performing statistical studies. This is to estimate the true parameter for a population. What is the actual sugar intake for the population? You randomly sample and then you use that sample data to create a statistic that estimates the true parameter. Observational study, you observe what's going on. Sugar intake versus heart risk. You say, "Hey, there's a relationship maybe "this is worth doing an experiment on." Because only through an experiment could you attempt to find some type of causality.