Main content

## Random sampling and data collection

Current time:0:00Total duration:9:13

# Techniques for random sampling and avoiding bias

AP.STATS:

DAT‑2 (EU)

, DAT‑2.C (LO)

, DAT‑2.C.3 (EK)

, DAT‑2.C.4 (EK)

CCSS.Math: ## Video transcript

- [Instructor] Let's
say that we run a school and in that school there is a population of students right over here. And that is our population. And we want to get a sense of how these students feel about the
quality of math instruction at the school, so we construct a survey, and we just need to decide
who are we going to get to actually answer this survey. One option is to just go to
every member of the population, but let's just say it's
a really large school. Let's say we're a college and there's 10,000 people in the college. We say, well, we can't
just talk to everyone. So instead, we say, let's
sample this population to get an indication of how
the entire school feels. So we are going to sample it. We are going to sample that population. Now in order to avoid having bias in our response, in order for it to have the best chance of
it being indicative of the entire population, we want
our sample to be random. So our sample could either be random, random, or not random. Not random. And it might seem, at first,
pretty straightforward to do a random sample, but when
you actually get down to it, it's not always as straightforward
as you would think. So one type of random sample
is just a simple random sample. So, simple, simple, random, random, sample, and this is saying, alright, let me maybe assign a number to
every person in the school, maybe they already have
a student ID number, and I'm just going to get a computer, a random number generator, to generate the 100 people, the 100 students, so let's say there's a
sample of 100 students, that I'm going to apply the survey to, so that would be a simple random sample. We are just going into this
whole population and randomly, let me just draw this. So this is the population,
we are just randomly picking people out, and we
know it's random because a random number generator, or
we have a string of numbers or something like that,
that is allowing us to pick the students. Now that's pretty good, it's
unlikely that you're going to have bias from this
sample, but there is some probability that, just by chance, your random number generator
just happened to select maybe a disproportionate
number of boys over girls, or a disproportionate number of freshmen, or a disproportionate
number of engineering majors versus English majors,
and that's a possibility. So even though you are
taking a simple random sample that is truly random, once
again, it's some probability that it's not indicative
of the entire population. And so to mitigate that,
there are other techniques at our disposal. One technique is a stratified sample. Stratified. And so this is the idea of
taking our entire population and essentially stratifying it. So let's say we want to, we
take that same population, we take that same
population, I'll draw it as a square here just for convenience, and we're gonna stratify it by, let's say we're concerned that
we get a appropriate sample of freshmen, sophomores,
juniors, and seniors. So we'll stratify it by
freshmen, sophomores, juniors, and seniors, and then we sample 25 from each of these groups. So these are the stratifications. This is freshmen, sophomore,
juniors, and seniors, and instead of just sampling
100 out of the entire pool, we sample 25 from each of these. So just like that. And so that makes sure that you are getting indicative responses from at least all of the different age groups or levels within your university. Now there might be another
issue where you say, well, I'm actually more
concerned that we have accurate representation of
males and females in the school, and there is some probability, you know, if I do 100
random people, it's very likely that it's close to
50/50, but there's some chance, just due to randomness,
there's disproportionately male or disproportionately female. And that's even possible
in the stratified case. And so what you might say is, well, you know what I'm gonna do? I'm going to, there's a technique
called a clustered sample. Let me write this right
over here, clustered, a clustered sample, and what
we do is we sample groups. Each of those groups we feel confident has a good balance of male females. So, for example, we might, instead of sampling individuals
from the entire population, we might say, look, you know, on Tuesdays and Thursdays, and this, well, even there as you can tell this is not a trivial thing to do, let's
just say that we can split, let's say we can split our population into groups, maybe these are classrooms, and each of these classrooms
have an even distribution of males and females, or pretty
close to even distributions. And so what we do is we
sample the actual classrooms, so that's why it's called
cluster, or cluster technique, or clustered random
sample, because we're going to randomly sample our
classrooms, each of which have a close or maybe a exact
balance of males and females so we know that we're gonna
get good representation, but we are still sampling,
we are sampling from the clusters, but then we're gonna survey every single person in
each of these clusters, every single person in
one of these classrooms. So, once again, these are
all forms of random surveys, or random samples, you have
the simple random sample, you can stratify, or
you can cluster and then randomly pick the clusters and then survey everyone in that cluster. Now if these are all random samples, what are the non-random things like? Well, one case of
non-random, you could have a voluntary survey, or voluntary sample,
and this might just be you tell every student at the school, "Hey, here's a web address. "If you're interested, come
and fill out this survey." And that's likely to
introduce bias because you might have maybe the
students who really like the math instruction at their school more likely to fill it out,
maybe the students who really don't like it are more
likely to fill it out, maybe it's just the
kids who have more time more likely to fill it out. So this has a good chance
of introducing bias. The students who fill out the survey might be just more skewed
one way or the other because, you know, they volunteered for it. Another not random sample would be called you're introducing bias
because of convenience is the term that's often used, and this might say, well,
let's just sample the 100 first students who show up in school. And that's just convenient for me because I didn't have to use random numbers, or do the stratification, or
doing any of this clustering, but you can understand how
this also would introduce bias, because the first 100 students
who show up at school, maybe those are the
most diligent students, maybe they all take an
early math class that has a very good instructor where
they're all happy about it. Or it might go the other way, the instructor there
isn't the best one, and so it might introduce bias the other way. So if you let people
volunteer or you just say, "Oh, let me do the first N students." Or you say, "Hey, let me just
talk to all of the students "who happen to be in
front of me right now." They might be in front of
you out of convenience, but they might not be
a true random sample. Now there is other reasons
why you might introduce bias, and it might not be
because of the sampling. You might introduce bias because of the wording of your survey. You could imagine a survey that says, do you consider yourself
lucky to get a math education that very few other people
in the world have access to? Well, that might bias you to say, "Well, yeah, I guess I feel lucky." Well, if the wording was, do you like the fact
that a disproportionate more students at your school tend to fail algebra than our surrounding schools? Well, that might bias you negatively. So the wording really, really,
really matters in surveys, and there is a lot that
would go into this. And the other one is just people's, you know, it's called response bias. And, once again, this isn't about... Response bias. And this is just people not
wanting to tell the truth or maybe not wanting to respond at all. Maybe they're afraid that somehow their response is gonna show up in front of their math
teacher or the administrators, or if they're too negative, it might be taken out on them in some way. And because of that, they
might not be truthful, and so they might be overly positive or not fill it out at all. So anyway, this is a
very high level overview of how you could think about sampling. You want to go random
because it lowers the probability of their
introducing some bias into it. And then these are some techniques. And also think about
whether you're falling into some of these pitfalls
that have a good chance of introducing bias.

AP® is a registered trademark of the College Board, which has not reviewed this resource.