Data

For this lab, you’ll need the data files:

  • bird.sub.csv
  • hab.sub.csv

You’ll need to read the two data files and use merge() to combine them into a single data.frame object.

You need to merge based on the columns: ‘basin’, and ‘sub’.

  • Save the merged data frame to a variable called birdhab. Make sure to call it birdhab so that I can run your code!
  • Your merged data.frame should have the following dimensions:
dim(birdhab)
## [1]  30 159

We’ll be working with the standardized Brown Creeper abundance (column BRCR) and late successional forest (column ls) data within birdhab

Introduction

The goals and objectives of this lab include

  • Introduce you to simulating environmental patterns.
  • Sharpen your intuition about statistical power.
  • Learn how to use simulation to explore statistical power.
  • Learn how to plot gridded (raster) data.
  • Learn how to create static and interactive 3D plots.

Simulation Modeling

Simulation is sometimes called forward modeling, to emphasize that you pick a model and parameters and work forward to predict patterns in the data. Environmental scientists use simulation to explore what kinds of patterns emerge from proposed models.

Often they use theoretical models without accompanying data, in order to understand qualitative patterns and plan future studies.

Simulation techniques are especially useful when you have data: you can use data both to guide the structure of your simulation and to calibrate/validate your simulation model.

Some powerful uses of simulation modeling include:

  • Explore potential statistical models of your data: You can try out different deterministic and stochastic models and compare them to real data.
    • Finding models and parameter values that look like your data help confirm the validity your choice of models.
  • Test your estimation procedures: Since we use imperfect measurements and models to answer environmental questions, simulation allows us to try out different procedures to get as close to the true model as possible.
    • Simulation allows you to test whether your statistical models can reliably estimate the true parameters of an environmental system.

It’s always a good idea to test a best-case scenario, where you know that the functions and distributions you’re using are correct, before you proceed to real data.

There are many different types of simulation models, we’ll cover one type in this lab.

Power Analysis

Power analysis using simulation models allows you to explore how different experimental designs and sample sizes might affect your ability to get a reasonably precise estimate of your parameters.

We’ll work through an example of a power analysis simulation in the context of a linear regression.

Simulating Static Environmental Processes: Linear Regression

Static environmental processes, where the data represent a snapshot of some environmental system, are relatively easy to simulate.

Keeping in mind the dual model paradigm, we can specify:

  • a deterministic function to simulate the average behavior in our system
  • a stochastic model to simulate the variability in the system.

Our deterministic models might be very simple, for example: a linear function.
Alternatively we could use any of the more complex mechanistic or phenomenological functions we have studied such as the Ricker or logistic functions.

For our stochastic model, we could use R’s random number generating functions from distributions like the Normal (rnorm()), Poisson (rpois()), or binomial (rbinom()).

Here we will illustrate the process of simulating a static environmental process using a simple linear regression model based on the now familiar Oregon birds data set.

For this example, let’s examine the relationship between brown creeper abundance and the extent of late-successional forest across 30 sub-basins in the central Oregon Coast Range.

Graphical Exploration

  • Create a scatterplot of brown creeper abundance and the extent of late-successional forest:

  • Does the relationship look linear?
  • Examining the scatterplot, do you see any potential challenges for a Group 1 model?

Fit a model

Recall that a simple linear regression model can be written symbolically as: \(Y \sim Normal(a + bx, \sigma ^2)\)

This means that the ith value of Y, \(y_i\), is equal to \(a + bx_i\) plus a normally distributed error with mean zero and variance \(\sigma^2\).

  • Fit a simple linear regression using lm(). Save your model to a variable: fit_1.
  • Plot the regression line over the scatterplot using abline().
  • Examine the model coefficient table.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0991 0.0508 1.952 0.061
ls 0.0058 0.0009 6.705 0.000

Does the model suggest a significant relationship between late-successional forest and Brown Creeper abundance?

  • Can we say with confidence that it is real or could it be a sampling artifact or a product of chance?

Recall that the data-collection effort was one realization of the stochastic process of sampling. There are many, many other possible observations that could have been sampled!

Stochastic simulation can help us explore what the data might look like if we could take another snapshot.

We can use simulation to build new datasets using the parameters estimated from the observed data.

We can also vary the parameter values, or even simulate data under alternative models!

Simulator Function

We can simulate the system using our model and see what kind of range of data we might observe.

This can give us great insight into the level of certainty we have in our model.

This is in fact the theoretical basis for the Frequentist approach to statistical inference, in which we evaluate the likelihood of our data given the model, which implicitly means how likely would we observe our original data if we were to repeatedly sample the system.

This is exactly what simulation allows us to do: repeatedly sample the environmental system under the assumption that the model is the truth.

Components of the model

To generate data based on a Group 1 Simple Linear Regression, what model parts do we need to know?

  • Parameters for the deterministic model:
    • y-intercept
    • slope coefficient
  • Parameter for the stochastic model: \(\sigma\)

Deterministic Model: Linear Function

Let’s build an R function to calculate the value of a linear function given a value of x, a slope parameter, and an intercept parameter.

Recall the equation for a line: \(y = \alpha + \beta x\)

  • Create an R function, linear(), that accepts three named arguments:
    1. x: a vector of x-values
    2. y_int: the y-intercept
    3. slope: the slope of the line
  • Your function needs to return a single numeric value.
  • Your function must be able to handle values of x that are single numbers, or numeric vectors.

Have we built a similar function in a prior lab or lecture assignment?

Self test

To test your implementation, you can compare your function’s outputs to these reference outputs:

Click to show/hide
linear(x = 1, y_int = 1, slope = 1)
linear(x = 3:5, y_int = 1, slope = 1)
linear(x = 3:5, y_int = -1, slope = 1)
linear(x = 3:5, y_int = -1, slope = 0.01)
## [1] 2
## [1] 4 5 6
## [1] 2 3 4
## [1] -0.97 -0.96 -0.95

Stochastic Model: Normal Distribution

This part is easy, it’s just rnorm() with appropriate arguments for mean and standard deviation!

Simulation Function

Now you can assemble your deterministic and stochastic functions into a single data simulator.

  • Create a function, linear_simulator(), that will generate random data based on a linear deterministic function with normally-distributed errors.
  • Your function needs to accept the following four named arguments:
    1. x: a vector of x-values
    2. y_int: the y-intercept
    3. slope: the slope of the line
    4. st_dev: the standard deviation of the stochastic model

There are many approaches your implementation could use. A simple plan might be:

  1. Generate the y-values on the line.
  2. Add normally-distributed errors and return.

You are free to implement linear_simulator() any way you like…as long as it works correctly.

  • I’ve provided a self-test below.

Test Your Simulator Function

Since linear_simulator() will return different values each time, it’s harder to check your implementation than a strictly deterministic function. For now, you can use a graphical validation approach.

Run the code below and compare your plots to the reference plots.

These plots use the following parameter values:

  • y intercept of 1.0
  • slope of 4.5
  • standard deviation of 0.1
n = 200

par(mfrow = c(2, 2), mar = c(1, 1, 1, 1))
for (i in 1:4)
{
  x = runif(n = n)
  plot(
    x,
    linear_simulator(x, y_int = 1, slope = 4.5, st_dev = 0.1),
    main = "", xlab = "x", ylab = "y",
    pch = 16, col = rgb(0, 0.2, 0, 0.2),
    axes = FALSE)
  box()
}

Now compare your implementation to the reference using different slope, intercept, and standard deviation parameters:

  • y intercept of 10
  • slope of -6.5
  • standard deviation of 1.1
n = 400

par(mfrow = c(2, 2), mar = c(1, 1, 1, 1))
for (i in 1:4)
{
  x = runif(n = n)
  plot(
    x, linear_simulator(x, y_int = 10, slope = -6.5, st_dev = 1.1),
    main = "", xlab = "x", ylab = "y",
    pch = 16, col = rgb(0, 0.2, 0, 0.2),
    axes = FALSE)
  box()
}

Your generated data won’t be exactly the same as mine, but your plots should look similar to the reference plots.

  • Highly detailed or multi-panel figures often display poorly in RStudio’s plots pane. Try the following to get a better view of your figures:
    • Press the zoom button on the top bar of the plot panel. You can resize or maximize your figure.
    • Save your figure to an image file and view in an image application.

Build the simulation

Retrieve the model coefficients

To simulate new data, you need to retrieve the intercept, slope, and standard deviation from the model you fit to the data. You can use coefficients() to extract the intercept and slope values. Use str() to examine the output of coefficients() to

fit_1_coefs = coefficients(fit_1)
str(fit_1_coefs)
##  Named num [1:2] 0.0991 0.00584
##  - attr(*, "names")= chr [1:2] "(Intercept)" "ls"

Retrieving the standard deviation parameter is slightly more difficult:

  • Create an model summary object using summary().
  • The value you want is stored in an element called sigma. You can extract it using the dollar sign.
fit_1_summary = summary(fit_1)
str(fit_1_summary)
fit_1_summary$sigma
## List of 11
##  $ call         : language lm(formula = BRCR ~ ls, data = birdhab)
##  $ terms        :Classes 'terms', 'formula'  language BRCR ~ ls
##   .. ..- attr(*, "variables")= language list(BRCR, ls)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "BRCR" "ls"
##   .. .. .. ..$ : chr "ls"
##   .. ..- attr(*, "term.labels")= chr "ls"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(BRCR, ls)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "BRCR" "ls"
##  $ residuals    : Named num [1:30] -0.0747 -0.1297 0.0256 0.1255 0.1009 ...
##   ..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
##  $ coefficients : num [1:2, 1:4] 0.099104 0.00584 0.050764 0.000871 1.952266 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "ls"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "ls"
##  $ sigma        : num 0.141
##  $ df           : int [1:3] 2 28 2
##  $ r.squared    : num 0.616
##  $ adj.r.squared: num 0.602
##  $ fstatistic   : Named num [1:3] 45 1 28
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:2, 1:2] 0.129129 -0.001909 -0.001909 0.000038
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "ls"
##   .. ..$ : chr [1:2] "(Intercept)" "ls"
##  - attr(*, "class")= chr "summary.lm"
## [1] 0.1412668

Store the intercept, slope, and standard deviation parameters from the model into the following variables:

  • int_obs
  • slope_obs
  • sd_obs

Choose Predictor Values

The first goal of the simulation is to generate some new “data” using our model.

We have several options:

  • Allow x (the late successional forest extent) vary randomly between 1 and 100.
  • Use the observed x values.
  • Resample the observed x values.

Any of the approaches are legitimate, so for our purpose, let’s keep the original values of x and let brown creeper abundance vary among simulations.

Simulate Data

Now that you know the model parameter values and have a strategy for choosing x-values, you can create some sampled Brown Creeper data!

  • Try running the following code a few times to view different simulation results:
plot(
  x = birdhab$ls, 
  y = linear_simulator(
    x = birdhab$ls,
    y_int = int_obs,
    slope = slope_obs,
    st_dev = sd_obs
  ),
  main = "Simulated Data",
  xlab = "late-successional forest",
  ylab = "Brown Creeper Abundance")

Alternatively, you could plot the observed data first, then add the simulated data to the existing plot:

plot(
  birdhab$ls, birdhab$BRCR, 
  xlab = "late-successional forest extent",
  ylab = "Brown Creeper abundance",
  pch = 19)

points(
  x = birdhab$ls, 
  y = linear_simulator(
    x = birdhab$ls,
    y_int = int_obs,
    slope = slope_obs,
    st_dev = sd_obs
  ),
  col = adjustcolor("red", alpha = 0.3),
  pch = 16)

legend(
  "topleft",
  legend = c("data", "simulation"),
  pch = 16,
  col = c(1, adjustcolor("red", alpha = 0.3)))

You might want to run the model a few more times to see how variable the results are.

  • Is the new pattern of points the same as the original pattern?
    • Are there any notable discrepancies?
    • Does the pattern of the simulated points vary much among realizations?

After running the model a few times, do you notice any potential problems with the model?

In other words, does the model reproduce the patterns in the original data perfectly or are there issues with the spread of values or with the generation of illogical values?

One problem with the use of the normal distribution is that it is unbounded on the lower limit.

Thus, negative values are possible.

In this case, because the y-intercept is close to 0, the simulation is likely to produce negative values occasionally when x = 0.

Since brown creeper abundance cannot be negative, this is an undesirable behavior of the model.

  • For now, we’ll ignore this discrepancy, but you should think about how you could modify linear_simulator() to correct the illogical values.
    • Hint: how could you use a logical test to help you?

Power analysis for the linear regression model

Power analysis in the narrowest sense means figuring out the (frequentist) statistical power, the probably of correctly rejecting the null hypothesis when it is false.

While we are generally less concerned with power analysis in the conventional sense of hypothesis testing, we are very interested in the role of power analysis in addressing a much broader question:

  • How do the quality and quantity of the data and the true properties (parameters) of the environmental system affect the quality and of the answers to our questions about environmental systems?

For any real experiment or observation situation, we don’t know what is really going on (the “true” model or parameters), so we don’t have the information required to answer these questions from the data alone.

Historically, questions about statistical power could only be answered by sophisticated analyses, and only for standard statistical models and experimental designs such as one-way ANOVA or linear regression.

Increases in computing power have extended power analysis to many new areas, and R’s capability to run repeated stochastic simulations is a great help.

Here, we will illustrate the use of stochastic simulation for power analysis using the linear regression model above.

Single Simulation

Let’s start by finding out whether we can reject the null hypothesis in a single experiment.

To do this we can use the following procedure:

  • Choose a critical p-value (alpha) to use as a rejection criterion. This is often 0.05.
  • Simulate a dataset with given intercept, slope, and number of observations.
  • Create a simple linear regression model and extract the p-value.
    • Recall that the p-value represents the probability of observing a pattern as extreme, or more extreme, as that observed in our data if the null hypothesis of no relationship were true.
  • Null hypothesis: brown creeper abundance is independent of late-successional forest, i.e. there is no habitat preference.
  • Compare the p-value from the model of simulated data to the critical p-value you chose above.
y_sim = linear_simulator(
  x = birdhab$ls,
  y_int = int_obs,
  slope = slope_obs,
  st_dev = sd_obs
)

fit_sim = lm(y_sim ~ birdhab$ls)
summary(fit_sim)
## 
## Call:
## lm(formula = y_sim ~ birdhab$ls)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.245329 -0.082111  0.002297  0.075614  0.263888 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0897077  0.0463911   1.934   0.0633 .  
## birdhab$ls  0.0057097  0.0007961   7.172 8.33e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1291 on 28 degrees of freedom
## Multiple R-squared:  0.6475, Adjusted R-squared:  0.635 
## F-statistic: 51.44 on 1 and 28 DF,  p-value: 8.33e-08

Extracting p-values from R analyses can be tricky.

In this case, the coefficients of the summary() of the linear fit are a matrix including the standard error, t statistic, and p-value for each parameter. We can use matrix indexing to pull out the specific value of interest.

Here’s the matrix of coefficients:

sum_1 = summary(fit_sim)
sum_1$coefficients
##                Estimate  Std. Error  t value     Pr(>|t|)
## (Intercept) 0.089707726 0.046391146 1.933725 6.331403e-02
## birdhab$ls  0.005709735 0.000796068 7.172422 8.330308e-08
  • Which matrix cell contains the p-value of interest?
Click to show/hide hint
sum_1$coefficients[2, 4]
## [1] 8.330308e-08

Repeated Simulations

To estimate the probability of successfully rejecting the null hypothesis when it is false (the power), we have to repeat this procedure many times and calculate the proportion of the time that we reject the null hypothesis.

First, we specify the number of simulations to run and set up a vector to hold the p-value for each simulation.

Then, we repeat what we did above saving the p-values to the storage vector.

  • We will re-use the same x-values, intercept, slope, and population standard deviation.

We can calculate the statistical power as the number of times we correctly rejected the null hypothesis divided by the total number of simulations.

  • In this case, we know the null hypothesis is false because we’re using a slope value different from zero.
  • How could you use this idea to calculate the Type II (false negative) error rate?

Here’s one possible implementation:

n_sims = 1000
p_vals = numeric(n_sims)
alpha = 0.05
for(i in 1:n_sims)
{
  y_sim = linear_simulator(
    x = birdhab$ls,
    y_int = int_obs,
    slope = slope_obs,
    st_dev = sd_obs
  )
  fit_sim = lm(y_sim ~ birdhab$ls)
  
  p_vals[i] = summary(fit_sim)$coefficients[2, 'Pr(>|t|)']
}
sum(p_vals < alpha) / n_sims
## [1] 1

Based on the simulation output:

  • How many simulations found a significant slope coefficient?
  • What is our statistical power?

This is the rate at which our chosen statistical model, a simple linear regression, can detect a slope of roughly b = 0.006 with a sample size of N = 30.

We’re going to be building a lot of models, so let’s write a quick function to save some time and make our simulation loops a little simpler:

linear_sim_fit = function(x, slope, y_int, st_dev)
{
  y_sim = linear_simulator(
    x = x,
    y_int = y_int,
    slope = slope,
    st_dev = st_dev
  )
  fit_sim = lm(y_sim ~ x)
  return(fit_sim)
}

Simulating Effect Sizes

Since we don’t know the true population values, we often we want to estimate the statistical power over a range of model parameter values (or even different models).

We can repeat the simulation process multiple times while varying some parameters of the simulation such as the model slope or sample size.

  • Coding this in R usually involves nested for-loops.

This example simulation estimates statistical power as a function of the slope (the effect size).

  • Note that this may take a minute or two to run on your computer.
  • You can use a smaller value for n_sims to speed it up if needed.
alpha = 0.05
n_sims = 1000
p_vals = numeric(n_sims)

n_effect_sizes = 20
effect_sizes_1 = seq(-.01, .01, length.out = n_effect_sizes)

effect_size_powers = numeric(n_effect_sizes)

for(j in 1:n_effect_sizes)
{
  for(i in 1:n_sims)
  {
    fit_sim = linear_sim_fit(
      x = birdhab$ls,
      y_int = int_obs,
      slope = effect_sizes_1[j],
      st_dev = sd_obs
    )
    
    p_vals[i] = summary(fit_sim)$coefficients[2, 'Pr(>|t|)']
  }
  effect_size_powers[j] = sum(p_vals < alpha) / n_sims
}

sim_effect_size = 
  data.frame(
    effect_size = effect_sizes_1,
    power       = effect_size_powers)

Note that this code is just an elaboration of the code we used to perform the simulation above:

  • We created a sequence of effect sizes to try.
  • We nested the original code inside an outer loop that iterates over each of the effect sizes.
  • Instead of saving p-values directly, we saved a vector of statistical powers:
  • The p-values were calculated in the inner loop for a specific effect size.
  • The statistical powers were calculated in the outer loop and stored in effect_size_powers.
  • We saved the results as a data.frame with columns for effect size and statistical power.

We can plot the result and add a vertical line to show the slope of our original data set.

plot(
  power ~ effect_size, data = sim_effect_size,
  type = 'l', xlab = 'Effect size', ylab = 'Power')
abline(v = slope_obs, lty = 2, col = 'red')

What is the power for our observed effect size of 0.0058?

Simulating Sample Sizes

We can do the same thing for a gradient in sample sizes, as follows

alpha = 0.05
n_sims = 1000
p_vals = numeric(n_sims)

sample_sizes = seq(5, 100)
sample_size_powers = numeric(length(sample_sizes))

# The maximum x value in the simulation.
# Use the maximum observed x-value in the data
max_x = max(birdhab$ls)

for(j in 1:length(sample_sizes))
{
  # A sequence of equally-spaced x-values:
  x_vals = seq(0, max_x, length.out = sample_sizes[j])
  
  for(i in 1:n_sims)
  {
    fit_sim = linear_sim_fit(
      x = x_vals,
      y_int = int_obs,
      slope = slope_obs,
      st_dev = sd_obs
    )
    p_vals[i] = summary(fit_sim)$coefficients[2, 'Pr(>|t|)']
  }
  sample_size_powers[j] = sum(p_vals < alpha) / n_sims
}


sim_sample_size = 
  data.frame(
    sample_size = sample_sizes,
    power       = sample_size_powers)
plot(
  power ~ sample_size, data = sim_sample_size,
  type = 'l', xlab = 'Sample size', ylab = 'Power')
abline(v = nrow(birdhab), lty = 2, col = 'red')

How much power is lost if we reduce the sample size from 30 to 20?

Bivariate Power Analysis

We could repeat this process for other parameters such as the error component of the model, but you get the idea.

While we can do these power analysis simulations for one parameter at a time, it might be more interesting to vary combinations of parameters, say of slope and sample size, using yet another loop, saving the results in a matrix, and using contour() or persp() to plot the results.

Effect Size and Sample Size

Try the following:

alpha = 0.01
n_sims = 50

p_vals = numeric(n_sims)

n_effect_sizes = 20
effect_sizes = seq(-.01, .01, length.out = n_effect_sizes)

# The maximum x value in the simulation.
# Use the maximum observed x-value in the data
max_x = max(birdhab$ls)

sample_sizes = seq(10, 50)

sim_output_2 = matrix(nrow = length(effect_sizes), ncol = length(sample_sizes))

for(k in 1:length(effect_sizes))
{
  effect_size = effect_sizes[k]
  for(j in 1:length(sample_sizes))
  {
    x_vals = seq(0, max_x, length.out = sample_sizes[j])
    
    for(i in 1:n_sims)
    {
      fit_sim = linear_sim_fit(
        x = x_vals,
        y_int = int_obs,
        slope = effect_size,
        st_dev = sd_obs
      )
      p_vals[i] = summary(fit_sim)$coefficients[2, 'Pr(>|t|)']
    }
    sim_output_2[k, j] = sum(p_vals < alpha) / n_sims
  }
  print(paste0("computing effect size ", k," of ", length(effect_sizes)))
}

sim_n_effect_size = 
  list(
    power = sim_output_2,
    effect_size = effect_sizes,
    sample_size = sample_sizes
  )

Note, the only difference in this code is that we added a third outer loop and created a matrix to store the results, since we have a power result for each combination of slope and sample size.

Our output power data are now stored in a matrix (instead of a vector), so we need a way to visualize 2-dimensional gridded data, i.e. raster data.

I’ve packaged the output power matrix, the effect sizes, and the sample sizes into a list for future use.

Plotting a Matrix

  • image() is a quick way to plot a matrix as if it were raster data, that is to plot a grid in which the pixel color is determined by the value of the matrix element.
image(
  sim_n_effect_size$power,
  xlab = "Effect size",
  ylab = "Sample Size",
  axes = FALSE)

# add x-axis labels
axis(
  1, 
  at = c(0, 0.5, 1), 
  labels = c(-.01, 0.0, .01))

# add y=axis labels
axis(
  2, 
  at = c(0, 1), 
  labels = c(sample_sizes[1], tail(sample_sizes, 1)))

I used larger values for n_sims, n_effect_sizes, and I used a sample range from 2 to 200.

  • My simulation took about an hour to run.

Your plot will look different because you’re using smaller values.

Plotting 3-Dimensional Data

NOTE: your plots may look rougher or more pixellated than the figures in the lab walkthrough.

  • To make the walkthrough plots, I ran simulations with many reps and small increments in parameter values.
    • Some of my simulations took multiple hours to run.

I suggest you set n_sims and n_effect_sizes to small values as you are experimenting with the code. You can set them to higher values later when you are confident your code works.

You’ll want to make sure you save your simulation output to a file!

  • I’ve included some instructions below.

Contour plotting

Contour plots are similar to topographic maps.

The [iso]lines show interpolated lines at which the value is the same.

Let’s plot the result using the contour() function:

  • contour() expects arguments for x, y, and z.
  • The x- and y- coordinates set up the axes. In our case we can set the x-axis to be effect size and the y-axis to be sample size.
  • the z-coordinates are a matrix in which cells represent the values for which to create the contours.

Use contour() to re-create the following figure:

contour(
  x = sim_n_effect_size$effect_size,
  y = sim_n_effect_size$sample_size,
  z = sim_n_effect_size$power,
  xlab = "effect size",
  ylab = "sample size",
  main = "Contour Plot of Statistical Power",
  levels = seq(0, 1, length.out = 9),
  drawlabels = TRUE,
  # method = "simple")
  method = "edge")

3D plotting

There are several functions and packages for creating 3D plots in R.

Perspective Plots

Perspective plots can plot a 3D surface.

Static Plot

Let’s try a static perspective plot using the persp() function, as follows:

persp(
  x = sim_n_effect_size$effect_size,
  y = sim_n_effect_size$sample_size,
  z = sim_n_effect_size$power,
  xlab = "beta", ylab = "n", zlab = "power",
  col = 'lightblue',
  theta = 30, phi = 30, expand = .75,
  ticktype = 'detailed')

Interactive Plot

The r package rgl allows you to create similar plots that are interactive.

  • Install rgl now so that you can use the persp3d() function.

You can use the same syntax as for persp() to create a basic interactive plot.

Try it out and see what you get!

Click here for an example

Click here for an example using a LOESS smoother.

Saving an Interactive Plot

rgl also has a function for saving an interactive 3D plot to a html file.

For example, I used the following code to save the linked 3D plot above:

require(htmlwidgets)
saveWidget(
  rglwidget(),
  file = here(
    "docs", "webGL",
    "n_effect_size_power_sim_plot.html"),
  selfcontained = TRUE
)

Saving R Data Objects

You may have noticed that some of the simulations take a long time to run!

When you do a simulation study, it is often convenient to save your simulation results so that you don’t have to re-run your simulation every time.

You can save R objects to files using save().

You can save any R object to a .Rdata file. These are binary files that can only be read by R.

If you have data stored in a data.frame or matrix, you can write to a text file. Comma-separated value (CSV) files are commonly used.

An advantage of text files is that you can open them in other applications.

  • Here’s how I saved the results of my effect size and sample size simulation to a file:
save(
  sim_n_effect_size,
  file = here::here("data", "lab_11_n_effect_sizes.Rdata"))

R saved it to a binary file with the extension “.Rdata”.

When I want to load the data again I can use:

load(file = here::here("data", "lab_11_n_effect_sizes.Rdata")))

For the lab exercises, you should save your simulation results to RData files so that you can use them again later without having to re-run the simulations.

  • Note: In the walkthrough examples and code templates I used relatively small numbers of simulations (the n_sim variable). I suggest you use small values as you build your code.

When your simulations are working as expected, you should increase the number of simulations to something large, 1000 or greater.

You can stage your simulations to run and save the results when you plan to be away from your computer for a while. Simulation results using lager numbers of iterations will make much smoother curves.

Conclusions and Exercises

What does the power surface reveal about the relationship between slope and sample size?

If you wanted say a power of > 0.8 to detect a slope of b = .002, how large would your sample size need to be?

As you can see, stochastic simulation is an extremely powerful tool for examining power, even in this simple linear regression example where canned approaches exist.

For more complex models, the coding is more complex, but the process is the same. The same basic tools learned in this example can be extended to more complex situations.

Dispersion Simulations

We varied the sample and effect sizes to examine statistical power, however we didn’t try varying the dispersion in the data.

After you complete the lab walkthrough, you will carry out additional power analysis on the data dispersion, i.e. the population standard deviation.

  • NOTE: You’ll need to work through the lab walkthrough and build the walkthrough simulations to complete the following exercises.
  • NOTE: You’ll be conducting a related (but different) set of simulations from those in the walkthrough. You’ll only need to make slight modifications to your walkthrough code to complete the exercises.

Population Dispersion Analysis

Review the effect size simulation above. You will carry out a similar simulation on the population standard deviation instead.

To do the simulation you’ll need to:

  • Decide the range of population standard deviations to test.
  • What was the standard deviation of the residuals from our model?
  • I suggest starting with a range of values starting close to zero (make sure your starting value is less than the observed model standard deviation) and going up to 3 times the observed standard deviation. You will want to experiment with different ranges in order to capture the full picture.

Here’s a template:

Click to show/hide
alpha = 0.05
n_sims = 100
p_vals = ...

# What was the observed standard deviation?
sd_obs

# specify the number of different standard deviation values to simulate:
n_sds = 20
pop_sds = seq(from = 0.01, to = 1.5, length.out = n_sds)

pop_sd_powers = numeric(...)

for(j in 1:length(pop_sds))
{
  pop_sd_j = ...
  for(i in 1:n_sims)
  {
    fit_sim = linear_sim_fit(...)
    p_vals[i] = ...
  }
  pop_sd_powers[j] = ...
}

sim_output_dispersion = data.frame(
  sd = ...,
  power = ...)

# You should save your simulation results so you don't have to run it every time.
save(
  sim_output_dispersion, 
  file = here::here("data", "lab_ll_dat_dispersion_sim.RData"))

# Line plot of standard deviation (x-axis) and statistical power (y-axis)
plot(...)

# Add a dotted vertical red line at the observed population standard deviation value.
abline(v = ...)

Population Dispersion and Sample Size Analysis

It looks like statistical power drops off pretty quickly as the population variability increases.

  • Could we improve our statistical power with larger sample sizes?

You’ll now modify your simulation of population standard deviation to include sample size. Review the simulation of both sample size and effect size above.

Here’s a template for the dispersion and sample size simulation:

Click to show/hide
alpha = 0.05

# Start with a small number
n_sims = 10
p_vals = ... 

# What was the observed standard deviation?
sd_obs

# specify the number of different standard deviation values to simulate:
# Start with a small number
n_sds = 20
pop_sds = seq(from = 0.05, to = , length.out = n_sds)

# The maximum x value in the simulation.
# Use the maximum observed x-value in the data
max_x = max(birdhab$ls)


pop_sd_powers = numeric(...)

sample_sizes = seq(5, 100)

sim_output_3 = matrix(...)

for(k in 1:length(pop_sds))
{
  pop_sd_k = pop_sds[k]
  
  for(j in 1:length(sample_sizes))
  {
    x_vals = seq(0, max_x, length.out = sample_sizes[j])
    
    for(i in 1:n_sims)
    {
      fit_sim = ...
      p_vals[i] = ...
    }
    
    sim_output_3[k, j] = ...
  }
  print(paste0("Testing standard deviation ", k, " of ", n_sds))
}

image(sim_output_3)

sim_3_dat = 
  list(
    power       = sim_output_3,
    sample_size = sample_sizes,
    pop_sd      = pop_sds)


# You should save your simulation results so you don't have to run it every time.
save(
  sim_3_dat, 
  file = here::here("data", "lab_ll_sim_output_dispersion_n_1000.RData"))

Plots

You should try out some of the plots from the walkthrough.

Here’s an example of an interactive 3D perspective plot I created from my simulation data.

For the lab questions you’ll have to create plots using your dispersion simulation results.

Report Questions

Q1-2: Dispersion and Statistical Power

Create a line plot of dispersion (x-axis) and statistical power (y-axis) with a dotted vertical red line at the observed population standard deviation value.

  • Q1 (2 pts.): Include a figure of your line plot in your report.
  • Q2 (2 pts.): Why do you think that statistical power decreases as population dispersion increases?

Q3-4: Dispersion, Sample Size, and Statistical Power

Create a contour plot of the sample size and population dispersion power simulation.

  • Q3 (2 pts.): Include a figure of your contour plot in your report.
  • Q4 (2 pts.): Qualitatively describe the patterns you see in the contour plot. Make sure you discuss the effects of sample size and population dispersion on statistical power.

Q5-6: Dispersion and Statistical Power

Create an interactive 3D perspective plot, using persp3d() in package rgl, of the sample size and population dispersion power simulation.

You need to save this to an html file using saveWidget() in package htmlwidgets.

  • Check out my example in the lab walkthrough if you need a hint.

  • To save the 3D plot, first you’ll need to run the persp3d() function and the plot will open in a separate window. Next, run saveWidge() while the 3D plot is open to save it to an html file.

  • Your surface must be a different color than my example here.

  • Q5 (5 pts.): Upload your plot as an interactive html html file. NOTE: some Mac users are not able to use RGL. You may also upload a static plot created with persp() if you can’t get RGL to work on your computer.
  • Q6 (2 pts.): Describe how you could use the information shown in your plot when designing an experiment.