Concepts

In this lab you will work through an example of kriging, in the process you’ll practice:

  • Plotting spatial objects with ggplot() and sf.
  • Create a variogram using package gstat.
  • Create a regression model of ozone pollution and air quality index, and examine the spatial autocorrelation in the residuals.
  • Calculate Moran’s I and create correlograms for the model variables and residuals.

You should work through the example from the slides using the CA ozone data. You will then apply the same procedure to the Air Quality Index data (don’t worry, it’s just a different column in the data frame!)

The data

You’ll be working with the same data set that I used in the lecture slides kriging example (deck 4b).
You’ll need the CA_ozone_2017.GPKG file and the county borders for California. Links to both files are available in the Data Files section of the Course Materials part of the course GitHub page.


As always, save yourself some headaches when you first import the data and make sure that your two spatial data sets are in a common projection.

You should use the projection of the CA Counties.

Lecture note code.

I encourage you to re-type the code from the lecture notes in order to practice, however you can use the reference code below for guidance as needed.

Read the data

require(rgdal)

ca_cnty = readOGR(here("data", "ca_counties.GPKG"))
## OGR data source with driver: GPKG 
## Source: "C:\Users\michaelnelso\git\spatial_data_r\data\ca_counties.GPKG", layer: "ca_counties"
## with 58 features
## It has 17 fields
ca_ozone = spTransform(readOGR(here("data", "CA_ozone_2017.GPKG")), proj4string(ca_cnty))
## OGR data source with driver: GPKG 
## Source: "C:\Users\michaelnelso\git\spatial_data_r\data\CA_ozone_2017.GPKG", layer: "ozone"
## with 80 features
## It has 3 fields
# This saves some headaches later:
# It copies the all of the coordinate reference info, including comments.
crs(ca_ozone) = crs(ca_cnty)

Plot map of ozone using ggplot

First make sf objects from the sp object:

ca_sf = st_as_sf(ca_cnty)
oz_sf = st_as_sf(ca_ozone)  

Note the use of the color aesthetic. I set it equal to ‘ozone’ so that it would scale the colors to that column.

  • Try to modify the code to plot the air quality index (AQI) instead.
ggplot() +
  geom_sf(data = ca_sf, lwd = 0.1, fill = gray(0.5, 0.5)) +
  geom_sf(data = oz_sf, mapping = aes(colour = ozone), cex = 2) +
  scale_color_gradientn(colours = heat.colors(10), name = "Ozone") +
  theme_minimal() +
  ggtitle("Annual ozone levels: 2017")

Things to note (and tinker with) in the ggplot code:

  • The fill color for the California counties: gray(0.5, 0.5).
  • The mapping of color to the ozone data column.
  • The heat colors palette for the ozone pollution.

Create Template Grid

# Create a template raster
temp_rast = raster(
  ca_cnty, nrow = 200, ncol = 180)

# Use the crs() trick to avoid any potential headaches
crs(temp_rast) = crs(ca_cnty)

# Convert the raster to a SpatialPointsDataFrame
temp_grid_sp = as(temp_rast, "SpatialPoints")

# Crop to the outline of California
temp_grid_sp = crop(temp_grid_sp, ca_cnty)

Variograms

# Create gstat object
oz_gs = gstat(
  formula = ozone ~ 1,
  locations = ca_ozone)

# Build empirical variogram:
vgm_emp = variogram(oz_gs)
plot(vgm_emp, pch = 16, cex = 1.2)

# Parameterize initial spherical and exponential variograms.
# These contain the starting parameters for optimizations.
vgm_mod_exp = vgm(
  model = "Exp",
  nugget = 1e-5,
  range = 2e5)
vgm_mod_sph = vgm(
  model = "Sph",
  nugget = 1e-5,
  range = 2e5)

# Use fit.variogram() to optimize the variogram parameters...
vgm_fit_sph = fit.variogram(
  vgm_emp, vgm_mod_sph)

vgm_fit_exp = fit.variogram(
  vgm_emp, vgm_mod_exp)

# and plot them!
par(mfrow = c(2, 1))
plot(vgm_emp, vgm_fit_exp, main = "Exponential Variogram")  

plot(vgm_emp, vgm_fit_sph, main = "Spherical Variogram")  

Plot with ggplot

Here’s a way to plot variograms in ggplot. Attribution is the answer from user aosmith on this stackoverflow entry.

preds_exp = variogramLine(
  vgm_fit_exp,
  maxdist = max(vgm_emp$dist),
  n = 200)

preds_sph = variogramLine(
  vgm_fit_sph,
  maxdist = max(vgm_emp$dist),
  n = 200)

g_exp = ggplot(vgm_emp, aes(x = dist, y = gamma)) +
  geom_point() +
  geom_line(data = preds_exp) +
  ggtitle("Mike's awesome exponential variogram", "feat. ggplot2")

g_sph = ggplot(vgm_emp, aes(x = dist, y = gamma)) +
  geom_point() +
  geom_line(data = preds_sph) +
  ggtitle("Mike's awesome spherical variogram", "feat. ggplot2")

g_exp

g_sph

Arranging ggplot objects in a grid

# For arranging ggplot objects into a multi-panel figure:
require(cowplot)

plot_grid(g_exp, g_sph, nrow = 2)

Krige!

Once everything is set up, kriging is easy:

oz_krig_exp =  gstat::krige(
  ozone ~ 1,
  locations = ca_ozone,
  newdata = temp_grid_sp,
  model = vgm_fit_exp)

oz_krig_sph =  gstat::krige(
  ozone ~ 1,
  locations = ca_ozone,
  newdata = temp_grid_sp,
  model = vgm_fit_sph)

Interpolation ggplot objects

The ggplot() code for plotting the interpolation is:

NOTE: this is slightly streamlined from the lecture note code

# CA county borders:
g_ca = geom_sf(
  data = ca_sf,
  lwd = 0.1, fill = "transparent")

# Fill gradients for the interpolated values and the sample variance

# Modify the name and limits for AQI
g_fill_oz_pred = scale_fill_gradientn(
  colours = heat.colors(10),
  name = "Ozone", limits = c(0, 0.05))

# Modify the name and limits for AQI
g_fill_oz_var =   scale_fill_gradientn(
  colours = rainbow(10),
  name = "Sample\nVariance",
  limits = c(0, 5e-5))

# For plotting the raster layers for interpolated values and sample variances:
g_r_pred = geom_raster(mapping = aes(x = x, y = y, fill = var1.pred))
g_r_var = geom_raster(mapping = aes(x = x, y = y, fill = var1.var))

# Ggplot objects
g_k_exp_oz = ggplot(as.data.frame(oz_krig_exp)) +
  g_r_pred + g_fill_oz_pred + g_ca +
  ggtitle("Kriged Ozone Values", "Exponential Variogram")
g_k_exp_var = ggplot(as.data.frame(oz_krig_exp)) +
  g_r_var + g_fill_oz_var + g_ca +
  ggtitle("Sample Variance", "Exponential Variogram")

g_k_sph_oz = ggplot(as.data.frame(oz_krig_sph)) +
  g_r_pred + g_fill_oz_pred + g_ca +
  ggtitle("Kriged Ozone Values", "Spherical Variogram")
g_k_sph_var = ggplot(as.data.frame(oz_krig_sph)) +
  g_r_var + g_fill_oz_var + g_ca +
  ggtitle("Sample Variance", "Spherical Variogram")

Plot Interpolations

I used plot_grid() to plot all the interpolations at once:

cowplot::plot_grid(
  g_k_sph_oz, g_k_sph_var, g_k_exp_oz, g_k_exp_var,
  nrow = 2)

Kriging for AQI

You should follow along with the code for kriging the ozone pollution data. Now, modify it to work with the air quality index (AQI) data.

Self Tests

Here are some plots of my implementation. You may use these to test your work.

AQI Map

Here’s how my basic map came out:

AQI Variograms

My AQI variograms looked like:

AQI Krige Interpolations

My Kriged interpolation models looked like:

## [using ordinary kriging]
## [using ordinary kriging]

Spatial Statistics: Moran’s I and Correlograms

Above we worked with variograms and Kriging. Recall that these are part of a geostatistical approach to spatial analysis.

We can also examine the dataset from a spatial statistics point of view using Moran’s I and correlograms.

We know that Moran’s I is a spatial version of a Pearson correlation coefficient for pairs of points within a distance class, d, and that a correlogram is a plot of Moran’s I against d.

Before working on the CA Ozone data, you should work your way through the book sections 5.3.1 to 5.3.3. You should use the verified book code for chapter 5 in Moodle as a reference.

The authors provide code to plot the cactus matrix data using base R.

You may also want to practice plotting sf objects with ggplot using the cactus data.

Here is some code I used to convert the matrix data into a sf object for easy plotting

require(sf)
require(ggplot2)
cactus_matrix = read.csv(file.path(data_dir, "cactus_matrix.csv"))
cactus_coords <- cbind(cactus_matrix$x, cactus_matrix$y)

# note the names of the x- and y-coordinate columns
head(cactus_matrix, 3)
##   x y Height
## 1 0 0     35
## 2 0 2     65
## 3 0 4     75
# use the st_as_sf() function to convert a matrix with into an sf object.

cactus_matrix_sf = 
  st_as_sf(
    cactus_matrix, 
    # provide the names of the x- and y-coordinate columns:
    coords = c("x", "y"))

# Note there's no coordinate reference (see the book's comment on this).
# st_as_sf() knew to use the Height column as the data corresponding
# to the coordinates in the x and y columns
head(cactus_matrix_sf)
## Simple feature collection with 6 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 0 ymin: 0 xmax: 0 ymax: 10
## CRS:           NA
##   Height     geometry
## 1     35  POINT (0 0)
## 2     65  POINT (0 2)
## 3     75  POINT (0 4)
## 4     60  POINT (0 6)
## 5     80  POINT (0 8)
## 6     20 POINT (0 10)

Now I can plot using ggplot() and geom_sf(). See if you can re-create a plot similar to mine before you peek at my code.

Click to show/hide hint
gg_cactus = ggplot(cactus_matrix_sf) +
  geom_sf(aes(colour = Height), cex = 5) +
  # I used the topo.colors color gradient:
  scale_colour_gradientn(colours = topo.colors(10)) +
  ggtitle("Mike's plot of cactus height matrix")
gg_cactus

Correlogram with spdep

For reference, here’s my adaptation of the book code to build and correlogram using spdep:

require(spdep)

distmat = as.matrix(dist(cactus_matrix[, 1:2]))
maxdist = (2/3) * max(distmat)

# Spdep
correlog.sp <- data.frame(
  dist=seq(5, maxdist, by=5),
  Morans.i=NA,
  Null.lcl=NA,
  Null.ucl=NA,
  Pvalue=NA)

#inspect (it should be empty)
head(correlog.sp)
##   dist Morans.i Null.lcl Null.ucl Pvalue
## 1    5       NA       NA       NA     NA
## 2   10       NA       NA       NA     NA
## 3   15       NA       NA       NA     NA
## 4   20       NA       NA       NA     NA
## 5   25       NA       NA       NA     NA
## 6   30       NA       NA       NA     NA
#then do a for loop to calculate Moran's I for lag distances
for (i in 1:nrow(correlog.sp))
{
  d.start <- correlog.sp[i,"dist"]-5
  d.end <- correlog.sp[i,"dist"]
  
  neigh <- dnearneigh(
    x=cactus_coords,
    d1=d.start,
    d.end,
    longlat=F)
  
  wts <- nb2listw(
    neighbours=neigh,
    style='W',
    zero.policy=T)
  
  mor.i <- moran.mc(
    x=cactus_matrix$Height,
    listw=wts,
    nsim=99,
    alternative="greater",
    zero.policy=T)
  
  #summarize results from spdep
  correlog.sp[i, "dist"] <- (d.end+d.start)/2
  #mean dist
  correlog.sp[i, "Morans.i"] <- mor.i$statistic 
  #observed Moran's I
  correlog.sp[i, "Null.lcl"] <- 
    quantile(
      mor.i$res,
      probs = 0.025,
      na.rm = TRUE)  #lower null envelope
  correlog.sp[i, "Null.ucl"] <- 
    quantile(
      mor.i$res,
      probs = 0.975,
      na.rm = TRUE)  #upper null envelope
  correlog.sp[i, "Pvalue"] <- mor.i$p.value                                                      #p-value for Moran's I at that distance category
}

#plot
plot(
  y=correlog.sp$Morans.i,
  x=correlog.sp$dist,
  xlab="Lag Distance(m)",
  ylab="Moran's I",
  ylim=c(-0.3,0.3))
#ylim provides limit on y-axis between -1 and 1

abline(h=0)                                                              #0 reference
lines(correlog.sp$dist, correlog.sp$Null.lcl,col = "red")                  #add the null lcl to the plot
lines(correlog.sp$dist, correlog.sp$Null.ucl,col = "red")                  #add the null ucl to the plot

Ggplot Correlogram

And a ggplot version of the correlogram:

ggplot(data.frame(correlog.sp)) +
  geom_smooth(aes(x = dist, y = Morans.i), se = FALSE) +
  geom_point(aes(x = dist, y = Morans.i)) +
  geom_ribbon(aes(
    x = dist,
    ymin = Null.lcl,
    ymax = Null.ucl),
    fill = adjustcolor("steelblue", 0.2)) +
  ggtitle("Mike's spdep correlogram", "cactus vegetation matrix")

A Simple Regression Model

A natural question we could ask with the California ozone dataset is whether the Air Quality Index is a good predictor for ozone pollution.

Since SpatialPointsDataFrame objects can be used just like a data.frame in many circumstances, we can build a simple linear model:

fit_1 = lm(ozone ~ AQI, data = ca_ozone)
summary(fit_1)
## 
## Call:
## lm(formula = ozone ~ AQI, data = ca_ozone)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0092823 -0.0019994  0.0000795  0.0019795  0.0117655 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.492e-02  1.877e-03   7.946 1.20e-11 ***
## AQI         3.969e-04  3.878e-05  10.235 4.46e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003988 on 78 degrees of freedom
## Multiple R-squared:  0.5732, Adjusted R-squared:  0.5677 
## F-statistic: 104.8 on 1 and 78 DF,  p-value: 4.461e-16
plot(fit_1, which = 1)

plot(fit_1, which = 2)

There may be some issues with this model, but for the moment we are going to proceed anyway

Variogram of model residuals

Whenever we build a regression model, we need to check whether spatial autocorrelation may be a problem. We might encounter spatial autocorrelation in the:

  • predictor variables
  • response variable
  • model residuals

In the regression world, we’re often most concerned about spatial autocorrelation in the model residuals.

Using the code you’ve written above for inspiration, create and plot a variogram of the model residuals.

To make things easier, remember that you can extract a vector of model residuals using the residuals() function. Since a SPDF behaves like a data.frame, you can easily add a column of model residuals to your ca_ozone object.

Click to show/hide hint
ca_ozone$resids = residuals(fit_1)

My spherical variogram (fit with gstat) looked like:

The variogram gives us lots of info, but it doesn’t really tell us about the significance of spatial autocorrelation.

To characterize autocorrelation, and significance, let’s use a spatial statistics approach with Moran’s I and correlograms.

Model Residuals Moran’s I

The spdep provided a nice way to calculate neighborhoods and Moran’s I statistics.

First you should use the code above as a template to create a neighborhood and test for significant spatial autocorrelation in the ozone residuals.

Here’s some hint code to get started

# Calculate the maximum distance for ozone points in CA
distmat_ca = dist(as.matrix(coordinates(ca_ozone)))

# Since CA is an elongated shape, we'll use 1/2 the max distance
maxdist_ca = 0.5 * max(distmat_ca)

# What's the minimum distance?
# This will be helpful for neighborhood building.
mindist_ca = min(distmat_ca)

# make a sequence of 10 distance classes between the min and max distances:
n_dist_class = 10
ca_nb_dists = seq(mindist_ca, maxdist_ca, length.out = n_dist_class)

# dnearneigh() accepts a sp or sf object as its x argument
ca_nbh_1 = dnearneigh(
  x = ca_ozone,
  #d1 is minimum distance, d2 is max distance
  d1 = ca_nb_dists[1],
  d2 = ca_nb_dists[2],
  longlat=F)

# Make a weights object
wts_ca = nb2listw(
  neighbours=ca_nbh_1,
  #W = row-standardized weights
  style='W',
  zero.policy=T)

#Moran's I test with normal approximation versus Monte Carlo permutation test
#Monte Carlo
mor_mc_ca = moran.mc(
  ca_ozone$resids,
  listw =wts_ca,
  nsim=999,
  zero.policy=T)

#normal approximation
mor_norm_ca = moran.test(
  ca_ozone$resids,
  listw =wts_ca,
  randomisation=F,
  zero.policy=T)

#inspect
mor_mc_ca
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  ca_ozone$resids 
## weights: wts_ca  
## number of simulations + 1: 1000 
## 
## statistic = 0.17122, observed rank = 992, p-value = 0.008
## alternative hypothesis: greater
mor_norm_ca
## 
##  Moran I test under normality
## 
## data:  ca_ozone$resids  
## weights: wts_ca  n reduced by no-neighbour observations
##   
## 
## Moran I statistic standard deviate = 2.3351, p-value = 0.009769
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.171224571      -0.013698630       0.006271484

Do you conclude there’s significant spatial autocorrelation?

Model Residuals Correlogram

We can modify the book code to crate a correlogram. It will take a bit of coding effort, so here’s a code template you can fill in.

The template shows how to make a correlogram for AQI, you need to modify it to create correlograms for ozone and the model residuals.

# make a sequence of 10 distance classes between the min and max distances:
n_dist_class = 10
ca_nb_dists = seq(mindist_ca, maxdist_ca, length.out = n_dist_class)

# Create data frames for storing output
ca_correlog_aqi =
  data.frame(
    dist=ca_nb_dists[-n_dist_class],
    Morans.i=NA,
    Null.lcl=NA,
    Null.ucl=NA,
    Pvalue=NA)

# then use a for loop to calculate Moran's I for lag distances
for (i in 1:(length(ca_nb_dists) - 1))
{
  d.start = ca_nb_dists[i]
  d.end = ca_nb_dists[i + 1]
  
  # dnearneigh() accepts a sp or sf object as its x argument
  neigh_i = dnearneigh(x = ca_ozone, d1 = d.start, d2 = d.end)
  
  # Make a weights object
  wts_i = nb2listw(
    neighbours = neigh_i,
    #W = row-standardized weights
    style='W',zero.policy=T)
  
  mor.i_aqi <- moran.mc(ca_ozone$AQI, listw =wts_i,nsim=999, zero.policy=T)
  
  #summarize results from spdep
  ca_correlog_aqi[i, "Morans.i"] = mor.i_aqi$statistic
  ca_correlog_aqi[i, "Null.lcl"] = quantile(mor.i_aqi$res, 0.025, na.rm = TRUE)
  ca_correlog_aqi[i, "Null.ucl"] = quantile(mor.i_aqi$res, probs = 0.975, na.rm = TRUE)
  ca_correlog_aqi[i, "Pvalue"] = mor.i_aqi$p.value
}

Self-test

Here are my correlograms, using the ggplot method. You may use base plotting or ggplot for yours.

Report

Q 1-4 Variograms and Kriging

Repeat the Kriging exercise using the AQI data (rather than the ozone as in the example).

Your variogram should use a Matérn model.

  • Q1 (2 pts.): Include a map of the AQI sample points. Use the ggplot example code as a guide. Your map must include:
    • The outlines of the CA county borders.
    • The sampling locations, color coded by AQI level. They must use a different palette than heat.colors().
  • Q2 (2 pts.): Include a plot of your variogram. You can use the base plotting or ggplot code above as inspiration. Your plot needs must show:
    • The empirical variogram points.
    • The curve of your fitted variogram (Matérn model).
  • Q3 (1 pt.): Include a map showing your (Kriged) interpolation of AQI.
  • Q4 (1 pt.): Include a map showing your interpolation sample variance.

Q 5-7 Regression Model Autocorrelation

Modify the supplied code to create correlograms for ozone and the model residuals. You can use either base or ggplot to create your figures. Use the example code for the cactus correlograms as a starting point.

  • Q5 (1 pt.): Include a plot of your correlogram for ozone. Your plot must include:
    • Points showing Moran’s I at the various distance classes.
    • Upper and lower confidence envelope values.
  • Q6 (1 pt.): Include a plot of your correlogram for model residuals. Your plot must include:
    • Points showing Moran’s I at the various distance classes.
    • Upper and lower confidence envelope values.
  • Q7 (2 pts.): Do you conclude that there is spatial autocorrelation in the model variables or residuals? Why or why not?