This lab covers an assortment of [hopefully fun] techniques.
Random walks are useful in various fields. In ecology, along with diffusion models, they serve as useful models of animal movements.
The basic idea behind random walks is that you have a walker (an agent) who takes a series of discrete steps. At each step, the walker must choose a random direction and step length.
One specific use of random walk concepts is in so-called Brownian bridge models. These models are helpful for quantifying the range of possible intermediate locations when we have location data at specific time points.
The multiple trajectories between the two known points represent possible paths the animal could have taken. As a whole, the set of possible paths helps us quantify the uncertainty in position between the two known points.
You can also use random walks as a naive path-finding method. Here’s an optional fun video about using random walks to find your way throug a maze.
We can use the basic idea of random walks to formulate a procedure for modeling them in two dimensions. Consider the items that we need to keep track of:
We can be proactive in how we think about our random walk components.
Since our walker will be making multiple steps, it should be obvious that eventually we’ll need to use a for-loop.
We will also likely want to carry out multiple simulations, perhaps with different numbers of steps. That means we’ll want to package our random walk loop into a function.
In a random walk simulation, we’d like to keep track of the location
of all the walker’s steps. A simple way to do this is to create a
matrix
or data.frame
to store the x- and
y-coordinates at each time step. A possible data.frame
structure might be:
# How many steps?
# This is a variable... could be useful as a function argument later on.
n_steps = 1000
# Create the empty data.frame:
walk = data.frame(
step = numeric(n_steps + 1),
x = numeric(n_steps + 1),
y = numeric(n_steps + 1)
)
Note that I made the number of rows n_steps + 1 so that we can store the initial position in the first row.
Choosing a random direction is a little bit more difficult and we need a little bit of trigonometry. Don’t worry, we’ll just use simple sin() and cosine() functions to figure out the x- and y- coordinates of an angle!
We might (or might not) recall from trigonometry that the cosine function gives us the x-coordinate of the position on the unit circle given an angle theta. The sine function gives us the y-coordinate.
We might also remember that angles can be expressed in degrees or radians. We’re usually more familiar with degrees, but radians are computationally easier to work with.
A complete circle, i.e. 360 degrees, is \(2\pi\) radians. Therefore some common angles in radians are:
Our walker needs to choose a random direction for each step. One option is to create a discrete set of possible angles from which to choose. If we choose a set of four angles, pointing in the cardinal directions, we would end up with a walk along a grid.
# Discrete set of angles.
# R works better with radians than degrees.
thetas = c(0, 0.5 * pi, pi, 1.5 * pi)
Next, we could use sample()
to pick a random direction
and calculate the x- and y-distances traveled:
# Choose a random angle
theta = sample(thetas, size = 1)
print(theta, digits = 3)
## [1] 1.57
# Calculate the x- and y- distances
# Round to 3 decimal points
delta_x = round(cos(theta), 3)
delta_y = round(sin(theta), 3)
print(data.frame(delta_x = delta_x, delta_y = delta_y))
## delta_x delta_y
## 1 0 1
Notice that the delta_x term is zero. This makes intuitive sense because the x-coordinate of a 90 degree angle is zero. The delta_y term is one.
In other words, the walker will move one unit up, and zero units in the horizontal direction:
Another option is to let the angle vary continuously. In this case, we can choose a random angle from a uniform distribution, recalling that the range of 0 to 360 degrees is equivalent to 0 to \(2\pi\) in radians:
# Choose a random angle
theta = runif(
n = 1,
min = 0,
max = 2 * pi
)
print(theta, digits = 3)
## [1] 3.68
# Calculate the x- and y- distances
# Round to 3 decimal points
delta_x = round(cos(theta), 3)
delta_y = round(sin(theta), 3)
print(data.frame(delta_x = delta_x, delta_y = delta_y))
## delta_x delta_y
## 1 -0.858 -0.513
This time, the walker travels to the left and down:
Step length is much easier!
In the simplest case, we can use a fixed step length:
step_length = 1
If we want to vary the step length randomly, we have many options.
For example, we could let the step length vary uniformly between 1 and 2:
step_length = runif(n = 1, min = 1, max = 2)
Another useful technique is to let the step follow a right-skewed distribution. For example, we could have exponentially-distributed step lengths:
step_length = rexp(n = 1, rate = 1)
Random walks with step lengths that follow a right-skewed distribution (often a distribution with “fat” tails such as a Lévy or power-law distribution) are called Lévy flights because they have been used to describe foraging behavior in albatrosses.
If we’ve been clever in constructing our angle and step length code, calculating the next position is simple.
It turns out that a straight-line movement from the origin to any position on the unit circle has a distance of 1.
That means we can simply multiply the delta-x and delta-y we calculated from theta by the step length to get the overall change in position.
Then we just need to add these changes in the x- and y-coordinates to the current position!
For example, to add the coordinates of our first step to our
data.frame
of positions, we could use something like:
# the step number.
# This syntax might be useful in the context of a loop...
i = 1
# Current position:
current_x = walk$x[i]
current_y = walk$y[i]
# Change in position in the x- and y-direction:
delta_x = step_length * round(cos(theta), 3)
delta_y = step_length * round(sin(theta), 3)
# Record the new position.
# Note we use i + 1 because the first row contains the starting point:
walk$x[i + 1] = current_x + delta_x
walk$y[i + 1] = current_y + delta_y
# Record the step number:
walk$step[i + 1] = i
Since we were thoughtful in how we set up our code for the different components, it should be straightforward to combine them into a simple random walk loop.
Here’s a skeleton:
n_steps = ...
# Create our empty data frame
walk = data.frame(
...
)
for (i in 1:n_steps)
{
# Choose a step length:
step_length = ...
# Choose a random direction:
theta = ...
# What is the current position?
current_x = ...
current_y = ...
# Figure out the change in the x- and y-directions:
delta_x = ...
delta_y = ...
# Record the new position
...
}
You’ve got the basic components, now it’s time to combine them and do some walking!
Use the skeleton and the random walk components we developed above to create a grid-based random walk with a fixed step length of 1.
You should end up with a data.frame
called
walk
.
Create a test plot of your walk data using this code:
plot(y ~ x, data = walk, type = "l", asp = 1)
# Plot a blue-filled point at the starting point
points(
y ~ x, data = head(walk, 1),
pch = 21, col = 1, bg = "steelblue")
# Plot a red-filled point at the end of the walk
points(
y ~ x, data = tail(walk, 1),
pch = 21, col = 1, bg = "red")
Some things to notice:
We’ve used the head()
function before, but you might not
be familiar with the tail()
function. tail()
gives you a glimpse of the last few rows of a data.frame
object. The first argument (x
) is the
data.frame
you want to preview, the optional second
argument (n
) is the number of rows to retrieve.
data.frame
without having to calculate the number of rows
is to simply use tail()
to retrieve the last row by passing
a value of 1 for the n
argument.I used the following arguments to plot()
pch
. This stands for ‘plotting character’. Plotting
character #21 is a circle with an optional fill color.col
. The color argument determines the outline color of
character #21.bg
. The background argument determines the fill
color.asp
. This sets the aspect ratio of the plot
window. Since we need our walks to plot without being stretched
vertically or horizontally, we can use a value of 1.That was a nice way to plot our walk. Since we’ll be plotting numerous walks, let’s build a function so we don’t have to re-write the plotting code every time:
plot_walk = function(dat_walk, ...)
{
plot(y ~ x, data = dat_walk, type = "l", asp = 1, ...)
# Plot a blue-filled point at the starting point
points(
y ~ x, data = head(dat_walk, 1),
pch = 21, col = 1, bg = "steelblue")
# Plot a red-filled point at the end of the walk
points(
y ~ x, data = tail(dat_walk, 1),
pch = 21, col = 1, bg = "red")
}
Let’s test it out on our walk data:
plot_walk(walk)
Our plotting function expects a data.frame
with the
following columns:
You may have noticed the ...
argument that I used. You
might also notice that I used it in my call to plot()
within the function. I won’t go into detail of the use of ellipses in
functions, but using this argument allows us to pass additional
arguments along to plot()
.
If you’re interested, here is a nice blog entry that describes the use of ellipses in custom functions.
For now we’re not using the step
column in our walk
pltos, but we’ll need it later.
To paraphrase the old saying:
“If you have to perform a task more than once, you should write a function!”
A single realization of a random walk is cool, but the real power of random walks comes from simulating multiple walks.
We’d like to be able to execute our grid-based walk multiple times using a compact syntax. By now you’re a whiz at creating your own functions and you can probably guess what I’m going to ask you to do:
Package your random-walk loop into a function!
n_steps
,
which is the number of steps to take.data.frame
that
contains the step number, the x-, and the y-coordinates.step
, x
, and
y
.Name your function grid_walk()
Hint: Don’t overthink this! You only need a couple of extra lines of code to package your loop into a function.
Once you’ve got your walk function working, you can test it out. Use the following as a self-check.
set.seed(123)
par(mfrow = c(2, 2), mar = c(3, 3, 1, 1))
{
plot_walk(grid_walk(100))
plot_walk(grid_walk(100))
plot_walk(grid_walk(100))
plot_walk(grid_walk(100))
}
Your walkers’ trajectories might not match mine exactly since there is randomness built in to the random walk function.
For the next random walk, we’ll allow the walker to turn in any direction, i.e. it won’t be constrained to a grid. In this walk, we’ll keep the step length fixed again.
Build a random walk function, call it omni_walk
, that
accepts a single argument n_steps
. This function should
perform a random walk in which the walker can turn in any direction.
Hint: If you implemented grid_walk()
using the skeleton
and template components above, it’s possible to build
omni_walk()
by changing only a single line of code.
Test out your omni_walk()
function:
set.seed(123)
par(mfrow = c(2, 2), mar = c(3, 3, 1, 1))
{
plot_walk(omni_walk(100))
plot_walk(omni_walk(100))
plot_walk(omni_walk(100))
plot_walk(omni_walk(100))
}
A natural question to ask is: How far does the walker travel, on average, in n steps?
Random walk theory has closed-form predictions for the average distance. We’ll perform a simulation experiment to explore the variability in distance traveled.
First we need a function to calculate the overall distance traveled given a set of x- and y-coordinates. You may remember the formula for Euclidean distance in 2 dimensions:
\(d = \sqrt{x^2 + y^2}\)
We can take advantage of R’s vectorization to calculate the distances for a set of x- and y-coordinates from a random walk simulation:
set.seed(1234)
dat_sim = omni_walk(1000)
dist_sim = sqrt(dat_sim$x^2 + dat_sim$y^2)
# Plot both the walk and the distance:
par(mfrow = c(1, 2))
plot_walk(dat_sim)
title("Random Walk")
plot(
x = dat_sim$step,
y = dist_sim,
xlab = "step", ylab = "distance",
type = "l", main = "Distance Traveled"
)
Since we’ll want to calculate distance for multiple simulations, we should make a distance function:
euc_dist = function(x, y)
{
return(sqrt(x^2 + y^2))
}
sim_grid = grid_walk(5000)
par(mfrow = c(1, 2))
plot_walk(sim_grid, main = "Grid Walk")
plot(
x = sim_grid$step,
y = euc_dist(sim_grid$x, sim_grid$y),
type = "l")
Individual random walks are idiosyncratic. We can conduct a simulation experiment to understand the behavior of many walks.
n_sims = 200
n_steps = 1000
dat_sim = matrix(
0,
nrow = n_steps + 1,
ncol = n_sims)
for(i in 1:n_sims)
{
sim_dat_i = omni_walk(n_steps)
dist_i = euc_dist(sim_dat_i$x, sim_dat_i$y)
dat_sim[, i] = dist_i
}
We can plot the distance traveled of each simulation with
matplot()
:
matplot(
dat_sim,
type = "l",
col = gray(0, 0.1),
xlab = "step", ylab = "distance")
It looks like some of our walkers traveled a long distance while others ended up near the origin.
We could create a 90% null envelope. We could use the
envelope to characterize a range of distances in which we expect most
random walkers to travel. The quantile()
function will be
our friend here:
envelope_90 = t(apply(
dat_sim,
1,
function(x) quantile(x, c(0.05, 0.95))))
dim(envelope_90)
## [1] 1001 2
head(envelope_90)
## 5% 95%
## [1,] 0.0000000 0.000000
## [2,] 0.9995958 1.000488
## [3,] 0.2017680 1.998388
## [4,] 0.4125787 2.806290
## [5,] 0.4692512 3.509248
## [6,] 0.5960585 3.765909
Now we can add the envelope, and the mean, to our plot:
matplot(
dat_sim,
type = "l",
col = gray(0, 0.05),
xlab = "step", ylab = "distance")
# Lower edge of envelope
lines(
x = 1:(n_steps + 1),
y = envelope_90[, 1],
lwd = 1, col = 2)
# Upper edge of envelope
lines(
x = 1:(n_steps + 1),
y = envelope_90[, 2],
lwd = 2, col = 2)
lines(
x = 1:(n_steps + 1),
y = apply(dat_sim, 1, mean),
lwd = 2, col = "blue"
)
What do you notice about the mean distance traveled? Does it seem linear, monotonic, logarithmic, or something else?
If you are interested in random walks, you can try implementing some optional extensions to the code you’ve already built.
Here are some suggestions:
Ordinations are focused on finding interpretable trends in data, represented by changes in species composition with possible underlying changes in environmental gradients.
Ordinations fall under the umbrella of multivariate statistics. An aim of many ordination techniques is dimensionality reduction. In other words, to distill the information contained in a complex dataset with many variables into a small number of new variables that capture as much of the variation in the data as possible.
Other types of ordinations are based on a dissimilarity matrix. A dissimilarity matrix represents pairwise distances (measured in various ways) between samples. A common dissimilarity metric is the Bray-Curtis distance.
We’ll use the Oregon bird survey data again, but this time we’ll use the data aggregated by subbasin.
# Bird abundance data
dat_bird_sub = read.csv(
here("data", "bird.sub.csv"))
# Remove the non-species columns.
# Note the use of the select argument with the negative.
dat_bird_sub_2 = subset(
dat_bird_sub,
select = -c(
b.total, b.rich, b.sidi, basin, sub, tsta))
# Site habitat covariates
dat_hab_sub = read.csv(
here("data", "hab.sub.csv"))
We’ll be grouping observations by subbasin, so we’ll need a vector of colors for plotting.
# Create a vector of colors, coded by basin
basin_col_all = as.numeric(
factor(dat_bird_sub$basin))
# A prime rule in colorblind design is to avoid red/green combos.
# Change the green to blue
basin_col_all = replace(
basin_col_all,
basin_col_all==3,
4)
NMDS attempts to plot points in 2D or 3D space, such that the distances between the points in the plot correspond as closely as possible to the distances in the dissimilarity matrix.
We’ll use the vegan
package to perform NMDS.
The metaMDS
function creates the dissimilarity matrix
and performs the NMDS algorithm, returning a metaMDS
object.
We’ll create 2- and 3- dimensional ordinations:
mds_b_sub_2 = metaMDS(dat_bird_sub_2, k = 2)
mds_b_sub_3 = metaMDS(dat_bird_sub_2, k = 3)
class(mds_b_sub_2)
The MDS algorithm uses an iterative simulation method to find an optimal way to arrange the points in multidimensional space (that’s a very simplified explanation!).
The stress values are a measure of how well the algorithm was able to position the points in ordination space. Lower stress values are better, with a rule of thumb being:
Let’s check the stress values for our fits:
axes | stress |
---|---|
2 | 0.201 |
3 | 0.118 |
It looks like the 2-axis ordination did an acceptable job, but the 3-axis version did much better.
We can use the ordiplot()
function from the
vegan
package to plot the 2D ordination:
ordiplot(mds_b_sub_2)
Ordination plots are notoriously difficult to interpret, but we can make some simplifications to improve readability:
# How many observations are in each basin?
table(dat_bird_sub$basin)
##
## D L N
## 10 10 10
# Create a vector of colors, coded by basin
basin_col = as.numeric(factor(dat_hab_sub$basin))
# A prime rule in colorblind design is to avoid red/green combos.
# Change the green to blue
basin_col = replace(
basin_col,
basin_col==3,
4)
# Open a blank plot window:
ordiplot(
mds_b_sub_2,
display = "sites",
type = "n")
# Draw the points with custom colors
points(
mds_b_sub_2,
display = "sites",
col = basin_col,
lwd = 2)
It looks like there might be some patterning. How would you qualitatively describe the arrangement of the three colors of points?
Let’s look at the 3-axis ordination using an interactive 3D scatterplot:
# Make the points big using cex = 3
ordirgl(
mds_b_sub_3,
col = basin_col,
cex = 3, axes = F)
# Make sure the background is white
bg3d("white")
rglwidget()
Make sure you try rotating the plot with your mouse so you can get a better idea of how the points are distributed in 3D space.
The 3rd axis seems to contribute to the clustering.
Spider plots are another way to visualize the clusters (or hypothesized clusters) in ordination space.
They work by first calculating the centroid (center of gravity) of the proposed group. Next, line segments are connected from the centroid to each point in the group.
Here’s what a 3D spider plot looks like with the MDS object
ordirgl(
mds_b_sub_3,
col = basin_col,
cex = 3, axes = F)
orglspider(
mds_b_sub_3,
display = "sites",
col = basin_col,
groups = dat_bird_sub$basin,
ax.col = 0)
bg3d("white")
rglwidget()
Another common ordination method is Principal Correspondence Analysis (PCoA).
Like NMDS, PCoA does not operate on the original data, but rather a dissimilarity matrix
Dissimilarity may be measured in various ways, including Eudlidean distance. A useful dissimilarity measure in ecology is the Bray-Curtis dissimilarity matrix.
You can use the vegdist()
function from package
vegan
to create the matrix. By default it creates a
Bray-Curtis dissimilarity matrix.
Check the help entry for details on how to calculate different types of
dissimilarity matrices.
Using the function documentation and the code above as a template, perform a Principal Correspondence Analysis on the bird species data.
wcmdscale()
function.Your 2D plots should look something like this:
Are there any qualitative similarities in the plots of both ordinations?
Your 3D Spider plots should resemble mine:
A moderately detailed, but readable overview of ordination can be found at the Ordination Methods for Ecologists site.
Another good resource for information on community ecology methods in R is the aptly named Analysis of community ecology data in R site.
Characterizing the distribution of tree stem diameters is useful to forest managers. Understanding tree stand structure is useful for many economic (e.g. estimating the amount of marketable timber) and for meeting various management goals.
Many techniques exist for describing tree size distributions, but an especially useful method is using a Weibull distribution.
To illustrate, we’ll use some tree diameter data from a survey of a Tapajos national forest in Brazil in 2010.
The data were retrieved from the Oak Ridge National Laboratory Distributed Active Archive Center (ORNL DAAC).
The inventory data is in the file Tapajos_inventory_data_2010.csv.
You can view the metadata here
For the walkthrough, we’ll use a single species from the dataset Inga cylindrica.
You know how to use the subset()
function to retrieve a
subset of a dataframe that meets a logical condition.
Create an new data.frame
object called
dat_inga
that meets the criterion that the entries in the
scientific_name
are Inga cylindrica
.
Create a histogram of the I. cylindrica dbh. As a self-test,
it should look like the example below. I used the
breaks = 8
argument to hist()
.
As a first approximation, we can try to fit a Weibull curve by guessing values for the shape and scale parameters.
First, let’s create a density plot of the I. cylindrica dbh:
plot(
density(dat_inga$dbh),
main = "Mike's Kernel Density Plot\nInga cylindrica")
Next, I can use the curve()
function with
dweibull()
to draw a Weibull curve:
plot(
density(dat_inga$dbh),
main = "Mike's Kernel Density Plot\nInga cylindrica",
lwd = 2)
curve(
dweibull(x, shape = 3, scale = 25), add = T,
col = "steelblue", lwd = 2, lty = 3)
legend(
"topright",
legend = c("data", "guess"),
col = c(1, "steelblue"),
lty = c(1, 3), lwd = 2,
bty = "n")
The general shape looks promising, but the position in the plot doesn’t quite match the data.
You should tinker with the shape and scale parameters to see if you can eyeball a better fit.
We’ll use Maximum Likelihood to find a “best fit” below, but for now try the guess and check method to see how close you can get.
My best guess looked like:
It’s not quite a perfect fit, but it’s better than my first guess.
Before you proceed, see if you can do better than I did!
We’ll use a graphical approach to find an ML estimate of the Weibull parameters.
Fortunately, the Weibull distribution has only 2 parameters, so we can visualize a likelihood surface. No 4-dimensional plots are needed!
The general procedure is:
To create our maximum likelihood surface, we can make a regularly-spaced ‘grid’ of parameter value pairs for which to test the likelihood.
This may need some further explanation:
The first step is to choose a gradient of values over which to vary the shape parameter. Let’s try 1 to 3 with 5 values:
w_shape = seq(1, 3, length.out = 5)
Next, we can choose a range for the scale parameter:
w_scale = seq(12, 25, length.out = 5)
To create our set of parameter value pairs, we’ll use the
expand.grid
function:
param_pairs = expand.grid(shape = w_shape, scale = w_scale)
head(param_pairs)
## shape scale
## 1 1.0 12.00
## 2 1.5 12.00
## 3 2.0 12.00
## 4 2.5 12.00
## 5 3.0 12.00
## 6 1.0 15.25
tail(param_pairs)
## shape scale
## 20 3.0 21.75
## 21 1.0 25.00
## 22 1.5 25.00
## 23 2.0 25.00
## 24 2.5 25.00
## 25 3.0 25.00
Now the fun part: Calculate the sum of log-likelihoods of the data for all pairs of parameter values.
First, we need to figure out a few things, for example:
How do I calculate the sum of log-likelihoods for a Weibull distribution?
Here’s how I can calculate the log-likelihood value for the first set of parameter values:
x = dat_inga$dbh
shape = param_pairs$shape[1]
scale = param_pairs$scale[1]
sum(dweibull(
x,
shape = shape,
scale = scale,
log = T)
)
## [1] -897.288
You’ll need to do lots of likelihood calculations. By now, you know the routine: I’m asking you to create a Weibull Likelihood Function.
You can use the above code as inspiration for your implementation.
The function specs are:
weib_lik()
.x
) is the data (for example DBH).Now let’s put it together and calculate a likelihood surface.
We can use a loop along with our new function:
# Create an empty column in the parameter
# data frame to hold th likelihoods
param_pairs$log_lik = 0
for (i in 1:nrow(param_pairs))
{
param_pairs$log_lik[i] =
weib_lik(
dat_inga$dbh,
param_pairs$shape[i],
param_pairs$scale[i])
}
For the plotting functions below, we need a matrix of z-values. We
can create one from the log_lik
column of
param_pairs
:
z_matrix = matrix(
param_pairs$log_lik,
nrow = length(w_shape),
byrow = FALSE)
We can create a 2D contour plot:
contour(w_shape, w_scale, z_matrix)
The contour plot looks a little grainy. Let’s try a 3D perspective plot.
You’ve already used persp3d()
in a previous lab. You can
use it again here to plot the likelihood surface.
rgl.viewpoint(zoom = 1)
persp3d(
w_shape, w_scale, z_matrix,
col = "darkgreen",
xlab = "shape",
ylab = "scale",
zlab = "sum(log-lik)",
alpha = 0.2,
smooth = FALSE)
bg3d("white")
rglwidget()
This plot looks angular too. Try re-running the code with
smooth = TRUE
for a slight improvement.
Now it’s your turn to experiment.
To make your contour and perspective plots higher resolution, you’ll want to increase the number of parameter values you test. Recall that for the example, we used only 5 discrete values of each parameter.
To create the following high-resolution plots, for example, I used sequences of 50 discrete values for each parameter.
Make your way through the lab walkthrough and answer the following questions:
Create the code for the onmidirectional walk function
(omni_walk()
) from the walkthrough.
Q1 (1 pt.): Include a plot of a random walk
using your omni_walk()
and plot_walk()
functions. Your walk may be any any length between 100 and 10000
steps.
Q2 (2 pts.): Using either the grid or omnidirectional walk, perform a distance simulation (like in the walkthrough). Your simulation should use 200 random walks of 1000 steps. Using the code from the walkthrough as a template to create a plot of the distance envelope. Your envelope must show:
Q5 (1 pt.): Use my code to create the kernel density plot of Inga cylindrica DBH. Try different values for the two Weibull parameters (shape and scale) to see if you can get a better fitting curve than my example. Include a plot of your best curve and list the values you used for the shape and scale parameters.
Q6 (2 pts.): Include a high-resolution contour plot of your likelihood surface. Use the code in the Maximum Likelihood Fit section as a template.