Recall that in the Frequentist world, we assume the population is infinitely large and unknowable. In addition, we may assume that it follows a particular parametric distribution. For example, if our scenario consists of looking for bird presence/absence in a set of 20 forest patches, our ‘population’ is simply an infinite set of possible observations of the number of presences we might make in the patches. It may be reasonable to use a binomial distribution to model our population.
Recall the assumptions of the binomial?
The binomial distribution has two parameters: n (number of trials) and p (probability of success).
The probability mass function of a binomial distribution with \(n = 20\) and \(p = 0.1\) would look like:
x = 0:20
barplot(
dbinom(x, size = 20, prob = 0.1),
names.arg = x, space = 0,
main = "Binomial PMF: n = 20, p = 0.1",
ylab = "Pr(x)", xlab = "x = n successes")
Things to note:
We know that in Frequentism we assume the population is infinite, but for purposes of this activity we need a finite colleciton of numbers so we’ll use R’s random number generating capabilities.
Let’s make a ‘Population’ of 1 million binomially-distributed numbers:
set.seed(12345)
sim_population = rbinom(n = 1000000, size = 20, prob = 0.1)
A histogram of our population:
It closely resembles our binomial probability mass function above, which is a good thing!
max(sim_population)
## [1] 12
We hope that when we create a randomized sampling scheme that we will obtain a representative sample
We also know that due to sampling error, it’s possible that purely by chance we’ll get an unrepresentative sample, and that as our sample size grows, the sampling error will shrink.
Let’s take some samples from our population to build some intuition.
Here’s an example of sampling 20 observations with replacement:
set.seed(5431213)
sim_sample = sample(sim_population, size = 20, replace = T)
hist(sim_sample + 0.00001, main = "sample size = 20", xlab = "x")
If you didn’t already know, it would be hard to tell that that sample came from a binomially-distributed population.
Now you should modify the code try different sample sizes.
Things to note:
sd()
) and mean of your samples. What do you expect to
happen to these values as you increase your sample size?Now we can use our population and samples to explore the sampling distribution.
I’ve provided a function you can use to explore the sampling distribution:
mean_sampler = function(pop, sample_size, n_means)
{
# pre-allocate a results vector
means = vector(mode = "numeric", length = n_means)
# sampling loop
for (i in 1:n_means)
{
samp = sample(pop, size = sample_size, replace = TRUE)
means[i] = mean(samp)
}
return(means)
}
Here’s an example application with sample size 30 and 200 iterations:
sample_means = mean_sampler(
pop = sim_population,
sample_size = 30,
n_means = 200)
hist(
sample_means,
main = "Distribution of Sample Means\nsample size: 30, number of means: 200",
xlab = "sample mean")
Here’s what the results look like with 1000 iterations:
Here are some additional simulations with differen sample sizes and numbers of means:
sim_population
. For example, you could try an exponential
distribution with rate parameter of 0.2:rexp()
function to make your
exponentially-distributed population.rgamma()
function to make your
gamma-distributed population.This activity is not graded, it’s for your own intuition-building.
I encourage you to explore and answer the following:
I encourage you to check out the Central Limit Theorem part of the Probability Distributions page at Seeing Theory. It provides an excellent visualization of the sampling distribution using the beta distribution as the source.
The whole site is excellent.
I especially recommend you check out the confidence intervals page.