If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Techniques for random sampling and avoiding bias

Techniques for random sampling and avoiding bias.

Want to join the conversation?

  • aqualine ultimate style avatar for user Aarti Jain
    what is the difference between a cluster sample and a stratified random sample?
    (33 votes)
    Default Khan Academy avatar avatar for user
  • hopper happy style avatar for user Tim
    Is it possible that clustering technique itself can introduce bias? Sal's example of sampling by classroom might allow selection of an even male/female sample but isn't this a bit risky? Factors that affect outcome (maybe more strongly than gender) may cluster in classrooms - e.g. teacher quality, classroom resources, social groups, or some unknown factor(s). This may return us to another problem of random/fair sampling from among clusters
    (20 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      Yes, the clustering technique itself can introduce bias if certain factors that affect the outcome are clustered within the groups being sampled (in this case, classrooms). For example, if classrooms differ significantly in teacher quality, resources, or peer influences, sampling by classroom may not adequately represent the diversity within the school population. To mitigate this risk, careful consideration should be given to how clusters are defined and whether they truly represent distinct, homogeneous groups within the population.
      (2 votes)
  • marcimus pink style avatar for user jacqueline  oien
    When would you use non-random sampling?
    (13 votes)
    Default Khan Academy avatar avatar for user
  • piceratops ultimate style avatar for user Jozefm
    Sal mentions that in a stratified sample he could take 25 students from each year to make up the 100 student sample. But what if there are say 50% more seniors than juniors. Wouldn't you have to take more from the seniors in order to reduce the bias?
    (6 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      In a stratified sample, the goal is to ensure proportional representation of each stratum (e.g., each year level). If there are more seniors than juniors, then indeed, you would need to sample more seniors to maintain proportionality and reduce bias. The sample size from each stratum should be proportional to the size of that stratum within the population to ensure accurate representation.
      (2 votes)
  • winston default style avatar for user felix.erlandson
    to
    Can't you just instead split your age group sample into genders too?
    (3 votes)
    Default Khan Academy avatar avatar for user
    • cacteye blue style avatar for user Jerry Nilsson
      That's definitely a possibility.

      Cluster surveys are quick and effective, though.
      Instead of tracking down people one by one, upon which half of them will probably say that they don't have time to answer,
      you just go into the classroom and wait for five minutes, and because they are in a group it's much easier to get a response from everyone.
      (4 votes)
  • winston baby style avatar for user peterbpesch
    Doesn't the clustered system introduce a lot of bias?

    For instance, in the example in the video they seem to choose a single class in each of the 4 years.
    - within students of a single class, there's a lot more shared history than between randomly chosen students. So that will probably influence the results.
    - if one of the chosen classes is significantly smaller or bigger than one of the other chosesn classes, that year will be over- or under-representated ...
    (4 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      Indeed, the clustered sampling method, as described, may introduce biases due to shared experiences within selected classes and variations in class sizes. By choosing only one class from each year level, the survey may inadvertently reflect the unique dynamics and characteristics of those specific classes rather than providing a representative sample of the entire student population. Additionally, unequal class sizes could lead to disproportionate representation of certain year levels, further skewing the results. To mitigate these biases, it's crucial to implement random or systematic approaches for class selection, consider stratification based on relevant factors, increase the number of sampled clusters, and employ robust data analysis techniques to account for discrepancies and ensure the reliability of the survey findings.
      (1 vote)
  • blobby green style avatar for user anikapruthi
    When do you use stratified sampling vs clustered sampling besides cluster sampling being more for geographical purposes?
    (3 votes)
    Default Khan Academy avatar avatar for user
  • winston baby style avatar for user peterbpesch
    Doesn't the stratified method also introduce some bias?

    For instance, the method used in the example in this video assumes that the school has an equal number of students in each of the 4 years ...
    (2 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      While the stratified method aims to reduce bias by ensuring representation of different subgroups, it can still introduce bias if the stratification criteria are not chosen appropriately or if the sampling within each stratum is not conducted properly. For example, if the strata are defined based on characteristics that are not relevant to the research question or if the sample size within each stratum is not proportional to its size in the population, bias may occur. Additionally, stratification does not eliminate bias entirely but rather aims to minimize its impact by providing more accurate estimates for each subgroup.
      (1 vote)
  • aqualine sapling style avatar for user Musical Prodigy
    Is it possible to have a stratified clustered sample?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      Yes, it is possible to have a stratified clustered sample. This would involve first stratifying the population based on relevant characteristics (e.g., year level, gender) and then randomly selecting clusters within each stratum. Finally, all individuals within the selected clusters would be included in the sample. This approach allows for both the benefits of stratification (ensuring representation of important subgroups) and clustering (efficient sampling of groups) techniques.
      (1 vote)
  • blobby green style avatar for user Hannah Banana.
    What is Bias exactly?
    (1 vote)
    Default Khan Academy avatar avatar for user

Video transcript

- [Instructor] Let's say that we run a school and in that school there is a population of students right over here. And that is our population. And we want to get a sense of how these students feel about the quality of math instruction at the school, so we construct a survey, and we just need to decide who are we going to get to actually answer this survey. One option is to just go to every member of the population, but let's just say it's a really large school. Let's say we're a college and there's 10,000 people in the college. We say, well, we can't just talk to everyone. So instead, we say, let's sample this population to get an indication of how the entire school feels. So we are going to sample it. We are going to sample that population. Now in order to avoid having bias in our response, in order for it to have the best chance of it being indicative of the entire population, we want our sample to be random. So our sample could either be random, random, or not random. Not random. And it might seem, at first, pretty straightforward to do a random sample, but when you actually get down to it, it's not always as straightforward as you would think. So one type of random sample is just a simple random sample. So, simple, simple, random, random, sample, and this is saying, alright, let me maybe assign a number to every person in the school, maybe they already have a student ID number, and I'm just going to get a computer, a random number generator, to generate the 100 people, the 100 students, so let's say there's a sample of 100 students, that I'm going to apply the survey to, so that would be a simple random sample. We are just going into this whole population and randomly, let me just draw this. So this is the population, we are just randomly picking people out, and we know it's random because a random number generator, or we have a string of numbers or something like that, that is allowing us to pick the students. Now that's pretty good, it's unlikely that you're going to have bias from this sample, but there is some probability that, just by chance, your random number generator just happened to select maybe a disproportionate number of boys over girls, or a disproportionate number of freshmen, or a disproportionate number of engineering majors versus English majors, and that's a possibility. So even though you are taking a simple random sample that is truly random, once again, it's some probability that it's not indicative of the entire population. And so to mitigate that, there are other techniques at our disposal. One technique is a stratified sample. Stratified. And so this is the idea of taking our entire population and essentially stratifying it. So let's say we want to, we take that same population, we take that same population, I'll draw it as a square here just for convenience, and we're gonna stratify it by, let's say we're concerned that we get a appropriate sample of freshmen, sophomores, juniors, and seniors. So we'll stratify it by freshmen, sophomores, juniors, and seniors, and then we sample 25 from each of these groups. So these are the stratifications. This is freshmen, sophomore, juniors, and seniors, and instead of just sampling 100 out of the entire pool, we sample 25 from each of these. So just like that. And so that makes sure that you are getting indicative responses from at least all of the different age groups or levels within your university. Now there might be another issue where you say, well, I'm actually more concerned that we have accurate representation of males and females in the school, and there is some probability, you know, if I do 100 random people, it's very likely that it's close to 50/50, but there's some chance, just due to randomness, there's disproportionately male or disproportionately female. And that's even possible in the stratified case. And so what you might say is, well, you know what I'm gonna do? I'm going to, there's a technique called a clustered sample. Let me write this right over here, clustered, a clustered sample, and what we do is we sample groups. Each of those groups we feel confident has a good balance of male females. So, for example, we might, instead of sampling individuals from the entire population, we might say, look, you know, on Tuesdays and Thursdays, and this, well, even there as you can tell this is not a trivial thing to do, let's just say that we can split, let's say we can split our population into groups, maybe these are classrooms, and each of these classrooms have an even distribution of males and females, or pretty close to even distributions. And so what we do is we sample the actual classrooms, so that's why it's called cluster, or cluster technique, or clustered random sample, because we're going to randomly sample our classrooms, each of which have a close or maybe a exact balance of males and females so we know that we're gonna get good representation, but we are still sampling, we are sampling from the clusters, but then we're gonna survey every single person in each of these clusters, every single person in one of these classrooms. So, once again, these are all forms of random surveys, or random samples, you have the simple random sample, you can stratify, or you can cluster and then randomly pick the clusters and then survey everyone in that cluster. Now if these are all random samples, what are the non-random things like? Well, one case of non-random, you could have a voluntary survey, or voluntary sample, and this might just be you tell every student at the school, "Hey, here's a web address. "If you're interested, come and fill out this survey." And that's likely to introduce bias because you might have maybe the students who really like the math instruction at their school more likely to fill it out, maybe the students who really don't like it are more likely to fill it out, maybe it's just the kids who have more time more likely to fill it out. So this has a good chance of introducing bias. The students who fill out the survey might be just more skewed one way or the other because, you know, they volunteered for it. Another not random sample would be called you're introducing bias because of convenience is the term that's often used, and this might say, well, let's just sample the 100 first students who show up in school. And that's just convenient for me because I didn't have to use random numbers, or do the stratification, or doing any of this clustering, but you can understand how this also would introduce bias, because the first 100 students who show up at school, maybe those are the most diligent students, maybe they all take an early math class that has a very good instructor where they're all happy about it. Or it might go the other way, the instructor there isn't the best one, and so it might introduce bias the other way. So if you let people volunteer or you just say, "Oh, let me do the first N students." Or you say, "Hey, let me just talk to all of the students "who happen to be in front of me right now." They might be in front of you out of convenience, but they might not be a true random sample. Now there is other reasons why you might introduce bias, and it might not be because of the sampling. You might introduce bias because of the wording of your survey. You could imagine a survey that says, do you consider yourself lucky to get a math education that very few other people in the world have access to? Well, that might bias you to say, "Well, yeah, I guess I feel lucky." Well, if the wording was, do you like the fact that a disproportionate more students at your school tend to fail algebra than our surrounding schools? Well, that might bias you negatively. So the wording really, really, really matters in surveys, and there is a lot that would go into this. And the other one is just people's, you know, it's called response bias. And, once again, this isn't about... Response bias. And this is just people not wanting to tell the truth or maybe not wanting to respond at all. Maybe they're afraid that somehow their response is gonna show up in front of their math teacher or the administrators, or if they're too negative, it might be taken out on them in some way. And because of that, they might not be truthful, and so they might be overly positive or not fill it out at all. So anyway, this is a very high level overview of how you could think about sampling. You want to go random because it lowers the probability of their introducing some bias into it. And then these are some techniques. And also think about whether you're falling into some of these pitfalls that have a good chance of introducing bias.