If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Density curves

# Density Curves

An introduction to density curves for visualizing distributions. A brief review of frequency histograms and relative frequency histograms as well.

## Want to join the conversation?

• what's the unit on y-axis once you convert the frequency histogram to a density curve? you marked the peak as 0.2 in the density curve. what does 0.2 stand for? •  I was struggling with this as well. The unit on the y-axis of the density curve is not percentage, it is density. Density values can be greater than 1. In the frequency histogram the y-axis was percentage, but in the density curve the y-axis is density and the area gives the percentage. When creating the density curve the values on the y-axis are calculated (scaled) so that the total area under the curve is 1. This allows us to then define an arbitrary interval on the x-axis and calculate the percentage from the area for that continuous interval, as opposed to calculating the percentage for a discrete value from the frequency histogram. It can be confusing because the visual shapes of the histogram and density curve are the same, but the units on the y-axis are different things. I assume this is explained better than I can in a video down the line.
• Why if we become more granular, and therefore there are less data per rectangle, the percentage doesn't decrease? • Because there are more rectangles.

When you have a relative frequency histogram (discrete) you plot the probability that the data falls in one of those rectangles, but each value of your data can only be exactly in one of them. If you add them all you get 100%.

Imagine that you have a continuous data set (like the quantity of water that people drink every day), and you try to fit the values in your discrete rectangles. Then each rectangle should be a range, and you plot the probability that someone would drink water inside that range.

What you are thinking has a lot of sense. If I simply plot the probability of someone drinking water inside a smaller range, then the probability should be smaller, and the shape of the rectangles should change, being squashed.

BUT, then the area under the curve wouldn't be equal to 100%, because the total range is the same but the probabilities are smaller.

What that numbers represent is the probability that some value falls into those ranges, DIVIDED by the range. Those values are the probability PER UNIT. Now it makes sense to have infinitely many small rectangles.

If you put the probability divided by the range in a graph, then you can make an integral summing up infinitely many infinitely small rectangles. Rectangles of an area equal to the probability of some value falling into it. And that area would be smaller as we make smaller rectangles, because the width is smaller, but not the height.

In summary, what you thought that was the height of the rectangle in reality is the area of that rectangle. The height is the probability per unit of range, and the width is the range in which we could include values.
• When we switch from histogram to a density curve, we are told to conceptualize the buckets as getting infinitesimally narrow until they are no longer represented by bars but by a line connecting their tops.

However, the likelihood of any data point matching a given value becomes less and less likely as our buckets get narrower and narrower. So if you could zoom in it would look like a spike followed by zero for awhile then a spike.

Is the best way to conceptualize this as incredibly small buckets connected by a sort of trend line that is "stiff" (ie not dropping to zero in the space between buckets)? Or should the x-axis of density curves be conceptualized as legitimately continuous values (ie not buckets)? And if so, why isn't the line at 0 most of the time? • not to be picky but it would be molecule right? • So if there are no values that are exactly 3, This part confuses me a little bit. What if you asked how many data points drank 2.9 glasses of water, which was an original data point, is there any way to tell what % drank 2.9 glasses of water, or do you just have to assume that it isn't exactly 2.9 glasses? • At he mentioned that you have "more data points" and want "more granular categories", which suggests that the data set used is not the same one in the first histogram. ( - 16 million data points)

It seems like the lesson assumed that the first data set is discrete, 'cause it's countable (16 students), and the precision is only 1 decimal digit. It takes a huge number of data points to construct a distribution function of a continuous variable, and the variable itself, when described as a curve, is not countable.

In reality, the number of glasses of water drank per day of, say, 1 billion people, with each data point written with 10 decimal digits, is a continuous variable. The probability of someone who drinks 2.9000000000 glasses of water/day is close to 0.
• With a discrete (think that's the right word) dataset, making the graph more and more granular wouldn't result in a curve like Sal shows at right?** So he's talking abt a continuous dataset.

I understand that the area under the curve is the probability of getting an X in that interval. But for a rel frequency histogram, the tops of the rects represent that probability. So the analogy of a rel freq histogram with infinitely thin rectangles being the density curve is incorrect right?

Even if our dataset wasn't specific values but instead a continuous variable X, and we imagine a rel freq histogram with very thin rects, then all the tops of the rects would be very close to zero right?

So a relative frequency histogram with infinitely thin rectangles is not the density curve?

**
If we take Sal's data set at and the rel frequency histogram, and we make the histogram's rectangles very thin, we won't get a curve; we'll get a bunch of very thin rectangles with their tops at zero, and then thin rects at all the data points with a 1/16 probability (except for 3.2, which has a 2/16 probability since there's 2 in the dataset). •  • I think the example of #glasses of water doesn't suit the density curve, since it's a discrete variable (integer/30). If you drink 90 glasses of water per 30 days, you are drinking 3.00000 glasses of water per day on average, and one atom less can't make a glass of water not a glass of water. So there should be some probability that somebody drinks exactly 3.00000 glasses of water per day.  