Main content

## Density curves

Current time:0:00Total duration:9:34

# Density Curves

## Video transcript

- [Instructor] What we're
going to do in this video is think about how to visualize
distributions of data, and then to analyze those visualizations, and we will eventually get to something known as a density curve. But let's start with a simple example, just to review some concepts. Let's say I go to 16 students and I ask them to measure
how many glasses of water they drink per day for the last 30 days, and then to average it. And so this data point right over here tells us one student drank an average of 0.5 glasses of water per day. That person is probably very dehydrated. This person drank 8.1
glasses of water per day, on average, for the last 30
days, they are better hydrated. If we want to visualize that we can set up a frequency histogram, where we can create some categories. So this first category
would be for data points that are greater than or equal
to zero, and less than one, and we can see that two data points fall into that category, and that's why the bar right over here for that category is up to two. This category right over
here is greater than or equal to three, and less than four. Notice, there are four data
points in that category and on this frequency histogram the height of the bar is indeed four. So this is a nice way of
looking at a distribution. But you might be more concerned with what percentage of my data falls into each of these categories, and that becomes especially interesting if we have many, many, many data points, and if we had, you know,
1,600,432,507 data points, well just knowing the
absolute number that fit into each category isn't so useful, the percent that fits into each category is a lot more useful. And so for that, we could set up a relative frequency histogram. So notice, this is
representing the same data. But in that first category, instead of the bar height being two, the bar height is now 12.5%. Why is that? Because two of the 16 data
points fall into this category. 2/16 is 1/8, which is 12.5%. And this one right over here, notice, instead of the height being four for four data points, it's now 25%. But these are saying the same thing. Four out of the 16 data points
fall into this category. 4/16 is 1/4, which is 25%. So both of these types of
histograms are really useful and you will see them
used all of the time. But there are also cases where you have many, many, many more data points, and you want more granular categories. So what you could do, is, well, let's just make our categories
a little more granular. So for example, instead of them being one glass of water wide, maybe you make them half
a glass of water wide. So this first category could be greater than or equal to
zero, and less than 0.5, and that will give you a clearer picture, and I'm now assuming in a world where we have more than 16 data points, maybe we have 16 million data points, this would be percentages
on the left hand side. But maybe that isn't good enough for you, maybe you wanna get even more granular. So you make everything, each category, a quarter of a glass. But maybe that doesn't satisfy, you wanna get more and
more and more granular. Well, you could imagine
where this is going. You could get to a point
where you're approaching an infinite number of categories, and each category is infinitely thin, is super, super thin, to a point that if you just connect
the tops of the bars that you will actually get a curve. And this type of curve is something that we actually use in the statistics, and, as promised at the
beginning of the video, this is the density curve we talk about. And what's valuable about a density curve, it is a visualization of a distribution where the data points can take
on any value in a continuum. They're not just thrown
into these coarse buckets. So how would you interpret
something like this? If you look over the entire interval from zero, let's say, to
nine, assuming no one drank more than an average of
nine glasses per day, even in our 16 million data points, well then the area under
the curve over that interval is going to be 100%, or 1.0. This is going to be true
for any density curve, that the entire area of the curve is 100%, it represents all of the data points. A density curve will also
never take on a negative value, you won't see the curve dip down and do something strange like that. Now, with that out of the way, let's think about how
we would make use of it. If I wanted to know what
percentage of my data falls between two and four glasses, well I would look at that interval. I'd go from two to four, I would look at this
interval right over here, and I would try to figure out
the area under the curve here. And this area is going to be
greater than or equal to zero, and less than or equal to 100%. When I eyeball it right over here, it looks like it's about 40% of the entire area under the curve, so just eyeballing it, I would
say roughly 40% of my data falls into this interval. If I were to ask you what
percentage of the data is greater than three, well then you would be
looking at this area, and it looks like it is about 50%, but once again, I am estimating it. But you can start to see
how, even with estimation, a density curve could be useful. In the real world, statisticians
will often have tables that might represent the
information for the density curve, they might have computer programs or some type of automated tool, and there are also
well-known density curves. The famous Bell Curve that
we will study later on, where there's a lot of precise data and a lot of tools to
exactly figure out the areas. The last thing I'd like to cover is a key misconception for density curves. If I were to ask you,
approximately what percentage of my data is exactly three
glasses of water per day? And when I say exactly, I mean exactly the number 3.000 with zeroes
just going on and on forever, the exact number three. Well, you might be tempted to
just say okay, this is three. Let me see the corresponding
point on the curve. It looks like it is about 0.2,
or a little higher than that, so maybe you would say a
little bit more than 20%, or approximately 20%. And what I would say to
you, is this is wrong. Remember, the percentage
of the data in an interval is not the height of the curve, it is the area under the
curve in that interval. And if we're just talking
about one precise value, like exactly the number three, there is no area under the curve. This vertical line that I just drew over the number three has no width, and this actually makes
sense in the real world. Even if you were to look
at 16 million people, it is very unlikely that even anyone would drink exactly three
glasses of water per day. I'm talking about not one atom more or one atom less than three glasses. There might be many people
between 2.9 and 3.1, but no one is exactly three glasses a day. When someone says I'm drinking three glasses of water per day, that would be a rough estimate. They're probably 3.001, or 2.99999, or 3.15, or whatever else. And so instead, you could say what percentage falls in the interval, maybe, that is greater
than or equal to 2.9 and less than or equal to 3.1. And so once you have an interval, then you actually can look at the area, so we're gonna go from 2.9 to 3.1, so now we have an interval
that actually has width, and so it'd be roughly the size of this yellow area that I'm
shading in right over here, and we can approximate it with a rectangle even though the top of
this curve isn't flat, but we can say, look, it's approximately like a rectangle that is 0.2 high, and what's the width? The width here, if we're
going from 2.9 to 3.1, the width is going to be 0.2 wide, and so we could approximate this area by approximating this rectangle,
the area of the rectangle. 0.2 times 0.2, that would
give us an area of 0.04. Or we could say
approximately 4% of the data falls in this interval.