Learning Objectives

  • Explore the relationships between populations, samples, and the sampling distribution.
  • Practice random-number generation in R
  • Develop intuition about the sampling distribution.

Populations

Recall that in the Frequentist world, we assume the population is infinitely large and unknowable. In addition, we may assume that it follows a particular parametric distribution. For example, if our scenario consists of looking for bird presence/absence in a set of 20 forest patches, our ‘population’ is simply an infinite set of possible observations of the number of presences we might make in the patches. It may be reasonable to use a binomial distribution to model our population.

Recall the assumptions of the binomial?

  • A set of n trials with a binary outcome.
  • The trials are independent.
  • Each trial has the same probability of success.

The binomial distribution has two parameters: n (number of trials) and p (probability of success).

The probability mass function of a binomial distribution with \(n = 20\) and \(p = 0.1\) would look like:

x = 0:20

barplot(
  dbinom(x, size = 20, prob = 0.1),
  names.arg = x, space = 0,
  main = "Binomial PMF: n = 20, p = 0.1",
  ylab = "Pr(x)", xlab = "x = n successes")

Things to note:

  • The distribution is not symmetrical; it’s right-skewed.
  • It’s a discrete distribution.

Create a ‘Population’

We know that in Frequentism we assume the population is infinite, but for purposes of this activity we need a finite colleciton of numbers so we’ll use R’s random number generating capabilities.

Let’s make a ‘Population’ of 1 million binomially-distributed numbers:

set.seed(12345)
sim_population = rbinom(n = 1000000, size = 20, prob = 0.1)

A histogram of our population:

It closely resembles our binomial probability mass function above, which is a good thing!

  • Also note that the maximum x value is 12. That means that even though the domain of a binomial distribution with \(n = 20\) and \(p = 0.1\) goes from 0 to 20, the probability of observing counts greater than 12 is so low that our ‘popualtion’ of one million doesn’t contain any!
max(sim_population)
## [1] 12

Samples

We hope that when we create a randomized sampling scheme that we will obtain a representative sample

We also know that due to sampling error, it’s possible that purely by chance we’ll get an unrepresentative sample, and that as our sample size grows, the sampling error will shrink.

Let’s take some samples from our population to build some intuition.

Here’s an example of sampling 20 observations with replacement:

set.seed(5431213)
sim_sample = sample(sim_population, size = 20, replace = T)
hist(sim_sample + 0.00001, main = "sample size = 20", xlab = "x")

If you didn’t already know, it would be hard to tell that that sample came from a binomially-distributed population.

Now you should modify the code try different sample sizes.

Things to note:

  • As you increase the sample size, does the distribution of the sample resemble a normal distribution?
  • How big does your sample have to be for the histogram of the sample to resemble the population?
  • Try calculating the standard deviation (r function sd()) and mean of your samples. What do you expect to happen to these values as you increase your sample size?

Sampling Distribution

Now we can use our population and samples to explore the sampling distribution.

I’ve provided a function you can use to explore the sampling distribution:

mean_sampler = function(pop, sample_size, n_means)
{
  # pre-allocate a results vector
  means = vector(mode = "numeric", length = n_means)
  
  # sampling loop
  for (i in 1:n_means)
  {
    samp = sample(pop, size = sample_size, replace = TRUE)
    means[i] = mean(samp)
  }
  
  return(means)
}

Here’s an example application with sample size 30 and 200 iterations:

sample_means = mean_sampler(
  pop = sim_population, 
  sample_size = 30,
  n_means = 200)

hist(
  sample_means,
  main = "Distribution of Sample Means\nsample size: 30, number of means: 200",
  xlab = "sample mean")

Here’s what the results look like with 1000 iterations:

Here are some additional simulations with differen sample sizes and numbers of means:

Things to try

  • Set the sample size to a small number, say 2 or 3. What do you notice about the sampling distribution of the mean? Does it look normal? Does it look like the parent distribution?
  • Try using a different distribution to create sim_population. For example, you could try an exponential distribution with rate parameter of 0.2:

  • Check out the rexp() function to make your exponentially-distributed population.
  • A good continuous distribution to experiment with is the gamma. Here’s what the PDF looks like for a gamma distribution with shape parameter = 2 and rate parameter = 0.3:

  • Check out the rgamma() function to make your gamma-distributed population.

Questions

This activity is not graded, it’s for your own intuition-building.

I encourage you to explore and answer the following:

  • What happens to the sample standard deviation as you increase the sample size?
  • What happens to the sample standard error as you increase the sample size?
  • What happened to the histogram of sample means as your sample size got larger or smaller.
  • Compare the population histogram to a histogram of sample means calculated using a sample size of 1. Notice anything unusual?

Seeing Theory

I encourage you to check out the Central Limit Theorem part of the Probability Distributions page at Seeing Theory. It provides an excellent visualization of the sampling distribution using the beta distribution as the source.

The whole site is excellent.

I especially recommend you check out the confidence intervals page.