If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content
Current time:0:00Total duration:9:34

Video transcript

what we're going to do in this video is think about how to visualize distributions of data then to analyze those visualizations and we will eventually get to something known as a density curve but let's start with a simple example just to review some concepts let's say I go to 16 students and I ask them to measure how many glasses of water they drink per day for the last 30 days and then to average it and so this data point right over here tells us one student drank an average of 0.5 glasses of water per day that person is probably very dehydrated this person drank 8.1 glasses of water per day on average for the last 30 days they are better hydrated if we want to visualize that we can set up a frequency histogram where we can create some categories so this first category would be 4 data points that are greater than or equal to 0 and less than 1 and we can see that two data points fall into that category and that's why the bar right over here for that category is up to two this category right over here is greater than or equal to 3 and less than 4 notice there are four data points in that category and on this frequency histogram the height of the bar is indeed 4 so this is a nice way of looking at a distribution but you might be more concerned with what percentage of my data falls into each of these categories and that becomes especially interesting if we have many many many many data points and if we had you know 1 billion 600 million 430 2507 data points well just knowing the absolute number that fit into each category isn't so useful the percent that fits into each category is a lot more useful and so for that we could set up a relative frequency histogram so notice this is representing the same data but in that first category instead of the bar height being to the bar height is now 12.5% why is that because two of the 16 data points fall into the category 216 is 1/8 which is 12.5% in this one right over here notice instead of the height being 4 4 4 data points it's now 25% but these are saying the same thing 4 out of the 16 data points fall into this category 4 16 is 1/4 which is 25% so both of these types of histograms are really useful and you will see them used all of the time but there are also cases where you have many many many more data points and you want more granular categories so what you could do is well let's just make our categories a little more granular so for example instead of them being one glass of water why maybe make them half a glass of water wide so this first category could be greater than or equal to 0 and less than 0.5 and that will give you a clearer picture and I'm now assuming a world where we have more than 16 data points maybe we have 16 million data points this would be percentages on the left-hand side but maybe that isn't good enough for you maybe you want to get even more granular so you make everything each category a quarter of a glass but maybe that doesn't satisfy you want to get more and more and more granular well you could imagine where this is going you could get to a point where you're approaching an infinite number of categories and each category is infinitely thin is super super thin to a point that if you just connect the tops of the bars that you will actually get a curve and this type of curve is something that we actually use in statistics and as promised at the beginning of the video this is the density curve we talked about and what's valuable about a density curve it is a visualization of a distribution where the data points can take on any value in a continuum they're not just thrown into these coarse buckets so how would you interpret like this if you look over the entire interval from zero let's say to nine assuming no one drank more than an average of nine glasses per day even in our 16 million data points well then the area under the curve over that interval is going to be one hundred percent or one point zero this is going to be true for any density curve that the entire area of the curve is 100% it represents all of the data points a density curve will also never take on a negative value you won't see the curve dip down and do something strange like that now with that out of the way let's think about how we would make use of it if I wanted to know what percentage of my data falls between two and four glasses well I would look at that interval I'd go from two to four I would look at this interval right over here and I would try to figure out the area under the curve here and this area is going to be greater than or equal to zero and less than or equal to 100% when I I ball right over here it looks like it's about 40% of the entire area under the curve so just eyeballing it I would say roughly 40% of my data falls into this interval if I were to ask you what percentage of the data is greater than three well then you would be looking at this area and it looks like it is about 50% but once again I am estimating it but you can start to see how even with estimation a density curve could be useful in the real world statisticians will often have tables that might represent the information for the density curve they might have computer programs or some type of automated tool and there are also well known density curves the famous bell curve that we will study later on where there's a lot of precise data and a lot of tools to exactly figure out the areas the last thing I'd like to is a key misconception for density curves if I were to ask you approximately what percentage of my data is exactly three glasses of water per day and when I say exactly I mean exactly the number three point zero zero zero zeros goes just going on and on forever the exact number three well you might be tempted to just say okay this is three let me see the corresponding point on the curve it looks like it is about zero point two or a little higher than that so maybe you would say a little bit more than twenty percent or approximately twenty percent and what I would say to you is this is wrong remember the percentage of the data and an interval is not the height of the curve it is the area under the curve in that interval and if we're just talking about one precise value like exactly the number three there is no area under the curve this vertical line that I just drew over the number of three has no width and this actually makes sense in the real world even if you were to look at 16 million people it is very unlikely that even anyone would drink exactly three glasses of water per day I'm talking about not one atom or or one atom less than three glasses there might be many people between two point nine and three point one but no one is exactly three glasses a day when someone says I'm drinking three glasses of water per day that'd be a rough estimate there probably three point zero zero zero one or two point nine nine nine nine nine or three point one five or whatever else and so instead you could say what percentage falls into in the interval maybe that is greater than or equal to two point nine and less than or equal to three point one and so once you have an interval then you actually can look at the area so we're going to go from two point nine to three point one so now we have an interval that actually has width and so it'd be roughly the size of this yellow area that I'm shading in right over here and we can approximate it with a even though the top of this curve isn't flat but we could say look it's approximately like a rectangle that is 0.2 high and what's the width the width here for going from 2.9 to 3.1 the width is going to be 0.2 wide and so we could approximate this area by approximating this rectangle the area of the rectangle 0.2 times 0.2 that would give us an area of 0.04 or we could say approximately 4 percent of the data falls in this interval