Main content
Current time:0:00Total duration:9:34

Video transcript

- [Instructor] What we're going to do in this video is think about how to visualize distributions of data, and then to analyze those visualizations, and we will eventually get to something known as a density curve. But let's start with a simple example, just to review some concepts. Let's say I go to 16 students and I ask them to measure how many glasses of water they drink per day for the last 30 days, and then to average it. And so this data point right over here tells us one student drank an average of 0.5 glasses of water per day. That person is probably very dehydrated. This person drank 8.1 glasses of water per day, on average, for the last 30 days, they are better hydrated. If we want to visualize that we can set up a frequency histogram, where we can create some categories. So this first category would be for data points that are greater than or equal to zero, and less than one, and we can see that two data points fall into that category, and that's why the bar right over here for that category is up to two. This category right over here is greater than or equal to three, and less than four. Notice, there are four data points in that category and on this frequency histogram the height of the bar is indeed four. So this is a nice way of looking at a distribution. But you might be more concerned with what percentage of my data falls into each of these categories, and that becomes especially interesting if we have many, many, many data points, and if we had, you know, 1,600,432,507 data points, well just knowing the absolute number that fit into each category isn't so useful, the percent that fits into each category is a lot more useful. And so for that, we could set up a relative frequency histogram. So notice, this is representing the same data. But in that first category, instead of the bar height being two, the bar height is now 12.5%. Why is that? Because two of the 16 data points fall into this category. 2/16 is 1/8, which is 12.5%. And this one right over here, notice, instead of the height being four for four data points, it's now 25%. But these are saying the same thing. Four out of the 16 data points fall into this category. 4/16 is 1/4, which is 25%. So both of these types of histograms are really useful and you will see them used all of the time. But there are also cases where you have many, many, many more data points, and you want more granular categories. So what you could do, is, well, let's just make our categories a little more granular. So for example, instead of them being one glass of water wide, maybe you make them half a glass of water wide. So this first category could be greater than or equal to zero, and less than 0.5, and that will give you a clearer picture, and I'm now assuming in a world where we have more than 16 data points, maybe we have 16 million data points, this would be percentages on the left hand side. But maybe that isn't good enough for you, maybe you wanna get even more granular. So you make everything, each category, a quarter of a glass. But maybe that doesn't satisfy, you wanna get more and more and more granular. Well, you could imagine where this is going. You could get to a point where you're approaching an infinite number of categories, and each category is infinitely thin, is super, super thin, to a point that if you just connect the tops of the bars that you will actually get a curve. And this type of curve is something that we actually use in the statistics, and, as promised at the beginning of the video, this is the density curve we talk about. And what's valuable about a density curve, it is a visualization of a distribution where the data points can take on any value in a continuum. They're not just thrown into these coarse buckets. So how would you interpret something like this? If you look over the entire interval from zero, let's say, to nine, assuming no one drank more than an average of nine glasses per day, even in our 16 million data points, well then the area under the curve over that interval is going to be 100%, or 1.0. This is going to be true for any density curve, that the entire area of the curve is 100%, it represents all of the data points. A density curve will also never take on a negative value, you won't see the curve dip down and do something strange like that. Now, with that out of the way, let's think about how we would make use of it. If I wanted to know what percentage of my data falls between two and four glasses, well I would look at that interval. I'd go from two to four, I would look at this interval right over here, and I would try to figure out the area under the curve here. And this area is going to be greater than or equal to zero, and less than or equal to 100%. When I eyeball it right over here, it looks like it's about 40% of the entire area under the curve, so just eyeballing it, I would say roughly 40% of my data falls into this interval. If I were to ask you what percentage of the data is greater than three, well then you would be looking at this area, and it looks like it is about 50%, but once again, I am estimating it. But you can start to see how, even with estimation, a density curve could be useful. In the real world, statisticians will often have tables that might represent the information for the density curve, they might have computer programs or some type of automated tool, and there are also well-known density curves. The famous Bell Curve that we will study later on, where there's a lot of precise data and a lot of tools to exactly figure out the areas. The last thing I'd like to cover is a key misconception for density curves. If I were to ask you, approximately what percentage of my data is exactly three glasses of water per day? And when I say exactly, I mean exactly the number 3.000 with zeroes just going on and on forever, the exact number three. Well, you might be tempted to just say okay, this is three. Let me see the corresponding point on the curve. It looks like it is about 0.2, or a little higher than that, so maybe you would say a little bit more than 20%, or approximately 20%. And what I would say to you, is this is wrong. Remember, the percentage of the data in an interval is not the height of the curve, it is the area under the curve in that interval. And if we're just talking about one precise value, like exactly the number three, there is no area under the curve. This vertical line that I just drew over the number three has no width, and this actually makes sense in the real world. Even if you were to look at 16 million people, it is very unlikely that even anyone would drink exactly three glasses of water per day. I'm talking about not one atom more or one atom less than three glasses. There might be many people between 2.9 and 3.1, but no one is exactly three glasses a day. When someone says I'm drinking three glasses of water per day, that would be a rough estimate. They're probably 3.001, or 2.99999, or 3.15, or whatever else. And so instead, you could say what percentage falls in the interval, maybe, that is greater than or equal to 2.9 and less than or equal to 3.1. And so once you have an interval, then you actually can look at the area, so we're gonna go from 2.9 to 3.1, so now we have an interval that actually has width, and so it'd be roughly the size of this yellow area that I'm shading in right over here, and we can approximate it with a rectangle even though the top of this curve isn't flat, but we can say, look, it's approximately like a rectangle that is 0.2 high, and what's the width? The width here, if we're going from 2.9 to 3.1, the width is going to be 0.2 wide, and so we could approximate this area by approximating this rectangle, the area of the rectangle. 0.2 times 0.2, that would give us an area of 0.04. Or we could say approximately 4% of the data falls in this interval.