Describing and comparing distributions
Current time:0:00Total duration:5:26
- [Voiceover] What I wanna do with this video is look at some examples of data represented in different ways, and think about which representation is the best, or can help us answer different questions? So we see this first example. A statistician recorded the length of each of Pixar's first 14 films. The statistician made a dot plot, each dot is a film, a histogram, and a box plot to display the running time data. Which display could be used to find the median? To find the median. All right, so let's look at these displays. So over here we see, this is the dot plot. We have a dot for each of the 14 films. So one film had a running time of 81 minutes. We see that there. One film had a running time of 92. One had a running time of 93. We see one had a running time of 95. We see two had running times of 96 minutes, and so on and so forth. So I claim that I could use this to figure out the median, because I could make a list of all of the running times of the films, I could order them, and then I could find the middle value. I could literally make a list. I could write down 81, and then write down 92, then write down 93, then write down 95, then I could write down 96 twice, and then I could write down 98, then I could write down 100. I think you see where this is going. I could write out the entire list, and then I could find the middle values. So the dot plot, I could definitely use to find the median. Now, what about the histogram? This is the histogram right over here. And the key here is, for a median, to figure out a median, I just need to figure out a list of numbers. I need to figure out a list of numbers. So here, I don't know, they say I have one film that's between 80 and 85, but I don't know its exact running time. Its running time might have been 81 minutes, its running time might have been 84 minutes. So I don't know here, and so I can't really make a list of the running times of the films and find the middle values, so I don't think I'm gonna be able to do it using the histogram. Now, with the box plot right over here, so I'm not gonna click histogram. With the box plot over here, I might not be able to make a list of all the values, but the box plot explicitly tells us what the median is. This middle line in the middle of the box, that tells us the median is, what is this, this median is, if this is 100, this is 99. So this is 95, 96, 97, 98, 99. It explicitly tells us the median is 99. This is actually the easiest for calculating the median. So I'll go with the box plot. So the histogram is of no use to me if I wanna calculate the median. Let's do a couple more of these. Nam owns a used car lot. He checked the odometers of the cars and recorded how far they had driven. He then created both a histogram and a box plot to display the same data, both diagrams are shown below. Which display can be used to find how many vehicles had driven more than 200,000 kilometers? So how many vehicles had driven more than 200,000 kilometers? So it looks like here in this histogram, I have three vehicles that were between 200 and 250, and then I have two vehicles that are between 250 and 300. So it looks pretty clear that I have five vehicles, three that had a mileage between 200,000 and 250,000, and then I had two that had mileage between 250,000 and 300,000. So I may be able to answer the question. Five vehicles had a mileage more than 200,000, and so I would say that the histogram is pretty useful. But let's verify that the box plot isn't so useful. So I wanna know how many vehicles had a mileage more than 200,000. Well, I know that if I have a mileage more than 200,000, I'm going to be in the fourth quartile, but I don't know how many values I have sitting there in the fourth quartile just looking at this data over here, so that's not gonna be useful for answering that question. Let's look at the second question. Which display can be used to find that the median distance, which display can be used to find that the median distance was approximately 140,000 kilometers? Well, to calculate the median, you essentially wanna be able to list all of the numbers and then find the middle number. And over here, I can't list all of the numbers. I know that there's three values that are between zero and 50,000 kilometers, but I don't know what they are. Could be 10,000, 10,000, 10,000. It could be 10,000, 15,000, and 40,000. I don't know what they are, and so if I can't list all of these things and put them in order, I really am going to have trouble finding the middle value. The middle value, it's going to be in this range right around here, but I don't know exactly what it's going to be. The histogram is not useful, because throwing all the values into these buckets. While on the box plot, it explicitly, it directly tells me the median value. This line right over here, the middle of the box, this tells us the median value, and we see that the median value here, this is 140,000 kilometers. Right, this is 100, 110, 120, 130, 140,000 kilometers is the median mileage for the cars. And so the box plot clearly... clearly gives us that data.