In an effort not to re-invent the wheel, This exercise is taken from the excellent open source RSpatial group’s e-book.
This is an very close adaptation of materials from their materials in Chapter 4: Interpolation
Their site is licensed under a ShareAlike 4.0 International (CC BY-SA 4.0) license, so I have reproduced some of their materials verbatim.
From rspatial:
Almost any variable of interest has spatial autocorrelation. That can be a problem in statistical tests, but it is a very useful feature when we want to predict values at locations where no measurements have been made; as we can generally safely assume that values at nearby locations will be similar. There are several spatial interpolation techniques. We show some of them in this chapter.
We will be working with precipitation data for California. If have not yet done so, first install the rspatial package to get the data. You may need to install the devtools package first.
if (!require("rspatial")) devtools::install_github('rspatial/rspatial')
Now get the data:
require(rspatial)
d <- sp_data('precipitation')
What kind of object is d
?
We haven’t used sp_data()
before so we probably don’t know what output it will produce… It’s always a good idea to check the help entry of a new (to you) function.
You should also check the class of an object. If you don’t know an object’s class, you can run into problems like unexpected behavior later on.
class(d)
## [1] "data.frame"
It’s a plain old data.frame
!
What is in d
?
head(d)
## ID NAME LAT LONG ALT JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
## 1 ID741 DEATH VALLEY 36.47 -116.87 -59 7.4 9.5 7.5 3.4 1.7 1.0 3.7 2.8 4.3 2.2 4.7 3.9
## 2 ID743 THERMAL/FAA AIRPORT 33.63 -116.17 -34 9.2 6.9 7.9 1.8 1.6 0.4 1.9 3.4 5.3 2.0 6.3 5.5
## 3 ID744 BRAWLEY 2SW 32.96 -115.55 -31 11.3 8.3 7.6 2.0 0.8 0.1 1.9 9.2 6.5 5.0 4.8 9.7
## 4 ID753 IMPERIAL/FAA AIRPORT 32.83 -115.57 -18 10.6 7.0 6.1 2.5 0.2 0.0 2.4 2.6 8.3 5.4 7.7 7.3
## 5 ID754 NILAND 33.28 -115.51 -18 9.0 8.0 9.0 3.0 0.0 1.0 8.0 9.0 7.0 8.0 7.0 9.0
## 6 ID758 EL CENTRO/NAF 32.82 -115.67 -13 9.8 1.6 3.7 3.0 0.4 0.0 3.0 10.8 0.2 0.0 3.3 1.4
Compute annual precipitation
d$prec <- rowSums(d[, c(6:17)])
plot(
sort(d$prec),
main = "CA annual precipitation",
ylab = 'Annual precipitation (mm)', xlab = 'Stations',
las = 1 )
Can you explain why they used rowSums(d[, c(6:17)])
to calculate annual precipitation, and why it worked?
What the heck is show on the x-axis in the plot above?
Now make a quick and dirty (and ugly) map:
require(sp)
dsp <- SpatialPoints(d[,4:3], proj4string = CRS("+proj=longlat +datum=NAD83"))
dsp <- SpatialPointsDataFrame(dsp, d)
CA <- sp_data("counties")
# define groups for mapping
cuts <- c(0,200,300,500,1000,3000)
# set up a palette of interpolated colors
blues <- colorRampPalette(c('yellow', 'orange', 'blue', 'dark blue'))
pols <- list("sp.polygons", CA, fill = "lightgray")
spplot(dsp, 'prec', cuts = cuts, col.regions = blues(5), sp.layout = pols, pch = 20, cex = 2)
What did cuts()
do?
Transform longitude/latitude to planar coordinates, using the commonly used coordinate reference system for California (“Teale Albers”) to assure that our interpolation results will align with other data sets we have.
TA <- CRS("+proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0
+lon_0=-120 +x_0=0 +y_0=-4000000 +datum=NAD83
+units=m +ellps=GRS80 +towgs84=0,0,0")
require(rgdal)
dta <- spTransform(dsp, TA)
cata <- spTransform(CA, TA)
We are going to interpolate (estimate for unsampled locations) the precipitation values. The simplest way would be to take the mean of all observations. We can consider that a null-model that we can compare other approaches to. We’ll use the Root Mean Square Error (RMSE) as evaluation statistic.
RMSE <- function(observed, predicted) {
sqrt(mean((predicted - observed)^2, na.rm = TRUE))
}
What are some advantages of using the RMSE instead of the Sum of Squared Errors?
Get the RMSE for the Null-model
# this variable name could lead to ambiguity.
null <- RMSE(mean(dsp$prec), dsp$prec)
null
## [1] 435.3217
Proximity polygons can be used to interpolate categorical or numerical variables.
We have point data, but we’d really like a continuous surface, such as contiguous polygons. We can create a Voronoi tessellation from points:
require(dismo)
v <- voronoi(dta)
plot(v)
The R spatial people say: > Looks weird.
Do you agree that the tessellation is weird?
We can, however, all agree that we need to confine the polygons to the California borders:
require(rgeos)
ca <- aggregate(cata)
vca <- intersect(v, ca)
spplot(vca, 'prec', col.regions = rev(get_col_regions()))
What did aggregate()
do?
What happens when you try to look a help entry for a function called aggregate()
.
How could you write your code in a failsafe way to mitigate the risk of calling the wrong function (with the same name)?
What are the ca
and vca
objects? How is ca
different from cata
?
Take a look at the classes of each of these objects and see what data they contain (or don’t contain): dsp, dta, cata, v, ca, vca.
It’s important, but not always easy, to keep track of your intermediate variables/objects and to know what each of them is/does.
ca
## class : SpatialPolygons
## features : 1
## extent : -373976.1, 539719.6, -604512.6, 450022.5 (xmin, xmax, ymin, ymax)
## crs : +proj=aea +lat_0=0 +lon_0=-120 +lat_1=34 +lat_2=40.5 +x_0=0 +y_0=-4000000 +datum=NAD83 +units=m +no_defs
cata
## class : SpatialPolygonsDataFrame
## features : 68
## extent : -373976.1, 539719.6, -604512.6, 450022.5 (xmin, xmax, ymin, ymax)
## crs : +proj=aea +lat_0=0 +lon_0=-120 +lat_1=34 +lat_2=40.5 +x_0=0 +y_0=-4000000 +datum=NAD83 +units=m +no_defs
## variables : 5
## names : STATE, COUNTY, NAME, LSAD, LSAD_TRANS
## min values : 06, 001, Alameda, 06, County
## max values : 06, 115, Yuba, 06, County
par(mfrow = c(1, 2)); plot(cata, main = "cata"); plot(ca, main = "ca")
Much better. These are polygons. We can ‘rasterize’ the results like this.
r <- raster(cata, res = 10000)
vr <- rasterize(vca, r, 'prec')
plot(vr)
Make sure you understand what the res = 10000
argument did. Why use the specific quantity 10000
?
Note that the raster r
serves as a template raster for the rasterize()
call.
We can use a cross validation to compare our null model that the best interpolation is to set all locations to the statewide mean precipitation.
Cross-validation is a simulation method in which we create a model using a portion of the data, i.e. the training set, and then calculate the discrepancies between the fitted and observed values for the remaining portion of the data, i.e. the testing set.
set.seed(5132015)
kf <- kfold(nrow(dta))
rmse <- rep(NA, 5)
for (k in 1:5) {
test <- dta[kf == k, ]
train <- dta[kf != k, ]
v <- voronoi(train)
p <- extract(v, test)
rmse[k] <- RMSE(test$prec, p$prec)
}
rmse
## [1] 199.0686 187.8069 166.9153 191.0938 238.9696
mean(rmse)
## [1] 196.7708
1 - (mean(rmse) / null)
## [1] 0.5479875
The R-Spatial book asks: > Question 1: Describe what each step in the code chunk above does > > Question 2: How does the proximity-polygon approach compare to the NULL model? > > Question 3: You would not typically use proximty polygons for rainfall data. For what kind of data would you use them?
Remember that we chose RMSE as our model fit criterion.
Here we do nearest neighbor interpolation considering multiple (5) neighbors.
We can use the gstat
package for this. First we fit a model. ~1
means “intercept only”. In the case of spatial data, that would be only ‘x’ and ‘y’ coordinates are used. We set the maximum number of points to 5, and the “inverse distance power” idp
to zero, such that all five neighbors are equally weighted
require(gstat)
gs <- gstat(
formula = prec~1,
locations = dta,
nmax = 5,
set = list(idp = 0))
nn <- interpolate(r, gs)
## [inverse distance weighted interpolation]
nnmsk <- mask(nn, vr)
plot(nnmsk)
NOTE: gstat()
creates an object of class gstat
.
Cross validate the result. Note that we can use the predict
method to get predictions for the locations of the test points.
rmsenn <- rep(NA, 5)
for (k in 1:5) {
test <- dta[kf == k, ]
train <- dta[kf != k, ]
gscv <- gstat(
formula = prec~1,
locations = train,
nmax = 5,
set = list(idp = 0))
p <- predict(gscv, test)$var1.pred
rmsenn[k] <- RMSE(test$prec, p)
}
## [inverse distance weighted interpolation]
## [inverse distance weighted interpolation]
## [inverse distance weighted interpolation]
## [inverse distance weighted interpolation]
## [inverse distance weighted interpolation]
rmsenn
## [1] 200.6222 190.8336 180.3833 169.9658 237.9067
mean(rmsenn)
## [1] 195.9423
1 - (mean(rmsenn) / null)
## [1] 0.5498908
A more commonly used method is “inverse distance weighted” (IDW) interpolation. The only difference with the nearest neighbor approach is that points that are further away get less weight in predicting a value a location.
require(gstat)
gs <- gstat(formula = prec~1, locations = dta)
idw <- interpolate(r, gs)
## [inverse distance weighted interpolation]
idwr <- mask(idw, vr)
plot(idwr)
Question 4: IDW generated rasters tend to have a noticeable artefact. What is that?
Cross validate. We can predict to the locations of the test points
rmse <- rep(NA, 5)
for (k in 1:5) {
test <- dta[kf == k, ]
train <- dta[kf != k, ]
gs <- gstat(formula = prec~1, locations = train)
p <- predict(gs, test)
rmse[k] <- RMSE(test$prec, p$var1.pred)
}
## [inverse distance weighted interpolation]
## [inverse distance weighted interpolation]
## [inverse distance weighted interpolation]
## [inverse distance weighted interpolation]
## [inverse distance weighted interpolation]
rmse
## [1] 215.3319 211.9383 190.0231 211.8308 230.1893
mean(rmse)
## [1] 211.8627
1 - (mean(rmse) / null)
## [1] 0.5133192
Question 5: Inspect the arguments used for and make a map of the IDW model below. What other name could you give to this method (IDW with these parameters)? Why?
gs2 <- gstat(formula = prec~1, locations = dta, nmax = 1, set = list(idp = 1))
We use California Air Pollution data to illustrate geostatistcal (Kriging) interpolation.
We use the airqual dataset to interpolate ozone levels for California (averages for 1980-2009). Use the variable OZDLYAV
(unit is parts per billion). Original data source.
Read the data file. To get easier numbers to read, I multiply OZDLYAV with 1000
require(rspatial)
x <- sp_data("airqual")
x$OZDLYAV <- x$OZDLYAV * 1000
Create a SpatialPointsDataFrame and transform to Teale Albers. Note the units = km
, which was needed to fit the variogram.
require(sp)
require(rgdal)
coordinates(x) <- ~LONGITUDE + LATITUDE
proj4string(x) <- CRS('+proj=longlat +datum=NAD83')
TA <- CRS("+proj=aea +lat_1=34 +lat_2=40.5 +lat_0=0 +lon_0=-120 +x_0=0 +y_0=-4000000 +datum=NAD83 +units=km +ellps=GRS80")
aq <- spTransform(x, TA)
Create an template raster to interpolate to. E.g., given a SpatialPolygonsDataFrame of California, ‘ca’. Coerce that to a ‘SpatialGrid’ object (a different representation of the same idea)
cageo <- sp_data('counties.rds')
ca <- spTransform(cageo, TA)
r <- raster(ca)
res(r) <- 10 # 10 km if your CRS's units are in km
g <- as(r, 'SpatialGrid')
Use gstat to create an emperical variogram ‘v’
require(gstat)
gs <- gstat(formula = OZDLYAV~1, locations = aq)
v <- variogram(gs, width = 20)
head(v)
## np dist gamma dir.hor dir.ver id
## 1 1010 11.35040 34.80579 0 0 var1
## 2 1806 30.63737 47.52591 0 0 var1
## 3 2355 50.58656 67.26548 0 0 var1
## 4 2619 70.10411 80.92707 0 0 var1
## 5 2967 90.13917 88.93653 0 0 var1
## 6 3437 110.42302 84.13589 0 0 var1
plot(v)
Now, fit a model variogram
fve <- fit.variogram(v, vgm(85, "Exp", 75, 20))
fve
## model psill range
## 1 Nug 21.96600 0.00000
## 2 Exp 85.52957 72.31404
plot(variogramLine(fve, 400), type = 'l', ylim = c(0,120))
points(v[,2:3], pch = 20, col = 'red')
Try a different type (spherical in stead of exponential)
fvs <- fit.variogram(v, vgm(85, "Sph", 75, 20))
fvs
## model psill range
## 1 Nug 25.57019 0.0000
## 2 Sph 72.65881 135.7744
plot(variogramLine(fvs, 400), type = 'l', ylim = c(0,120) ,col = 'blue', lwd = 2)
points(v[,2:3], pch = 20, col = 'red')
Both look pretty good in this case.
Another way to plot the variogram and the model
plot(v, fve)
Use variogram fve
in a kriging interpolation
k <- gstat(formula = OZDLYAV~1, locations = aq, model = fve)
# predicted values
kp <- predict(k, g)
## [using ordinary kriging]
spplot(kp)
# variance
ok <- brick(kp)
ok <- mask(ok, ca)
names(ok) <- c('prediction', 'variance')
plot(ok)
Let’s use gstat again to do IDW interpolation. The basic approach first.
require(gstat)
idm <- gstat(formula = OZDLYAV~1, locations = aq)
idp <- interpolate(r, idm)
## [inverse distance weighted interpolation]
idp <- mask(idp, ca)
plot(idp)
We can find good values for the idw parameters (distance decay and number of neighbors) through optimization. For simplicity’s sake I do not do that k times here. The optim function may be a bit hard to grasp at first. But the essence is simple. You provide a function that returns a value that you want to minimize (or maximize) given a number of unknown parameters. Your provide initial values for these parameters, and optim then searches for the optimal values (for which the function returns the lowest number).
RMSE <- function(observed, predicted) {
sqrt(mean((predicted - observed)^2, na.rm = TRUE))
}
f1 <- function(x, test, train) {
nmx <- x[1]
idp <- x[2]
if (nmx < 1) return(Inf)
if (idp < .001) return(Inf)
m <- gstat(formula = OZDLYAV~1, locations = train, nmax = nmx, set = list(idp = idp))
p <- predict(m, newdata = test, debug.level = 0)$var1.pred
RMSE(test$OZDLYAV, p)
}
set.seed(20150518)
i <- sample(nrow(aq), 0.2 * nrow(aq))
tst <- aq[i,]
trn <- aq[-i,]
opt <- optim(c(8, .5), f1, test = tst, train = trn)
opt
## $par
## [1] 9.2594442 0.6817524
##
## $value
## [1] 7.861426
##
## $counts
## function gradient
## 35 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Our optimal IDW model
m <- gstat(formula = OZDLYAV~1, locations = aq, nmax = opt$par[1], set = list(idp = opt$par[2]))
idw <- interpolate(r, m)
## [inverse distance weighted interpolation]
idw <- mask(idw, ca)
plot(idw)
A thin plate spline model
NOTE: We haven’t talked about these in class. It is just being used here for a comparison to the other interpolation methods.
require(fields)
m <- Tps(coordinates(aq), aq$OZDLYAV)
tps <- interpolate(r, m)
tps <- mask(tps, idw)
plot(tps)
Cross-validate the three methods (IDW, Ordinary kriging, TPS) and add RMSE weighted ensemble model.
require(dismo)
nfolds <- 5
k <- kfold(aq, nfolds)
ensrmse <- tpsrmse <- krigrmse <- idwrmse <- rep(NA, 5)
for (i in 1:nfolds)
{
test <- aq[k != i,]
train <- aq[k == i,]
m <- gstat(formula = OZDLYAV~1, locations = train, nmax = opt$par[1], set = list(idp = opt$par[2]))
p1 <- predict(m, newdata = test, debug.level = 0)$var1.pred
idwrmse[i] <- RMSE(test$OZDLYAV, p1)
m <- gstat(formula = OZDLYAV~1, locations = train, model = fve)
p2 <- predict(m, newdata = test, debug.level = 0)$var1.pred
krigrmse[i] <- RMSE(test$OZDLYAV, p2)
m <- Tps(coordinates(train), train$OZDLYAV)
p3 <- predict(m, coordinates(test))
tpsrmse[i] <- RMSE(test$OZDLYAV, p3)
w <- c(idwrmse[i], krigrmse[i], tpsrmse[i])
weights <- w / sum(w)
ensemble <- p1 * weights[1] + p2 * weights[2] + p3 * weights[3]
ensrmse[i] <- RMSE(test$OZDLYAV, ensemble)
}
rmi <- mean(idwrmse)
rmk <- mean(krigrmse)
rmt <- mean(tpsrmse)
rms <- c(rmi, rmt, rmk)
rms
## [1] 8.041305 8.307235 7.930799
rme <- mean(ensrmse)
rme
## [1] 7.858051
Question 6: Which method performed best?
We can use the rmse scores to make a weighted ensemble. Let’s look at the maps
weights <- ( rms / sum(rms) )
s <- stack(idw, ok[[1]], tps)
ensemble <- sum(s * weights)
And compare maps.
s <- stack(idw, ok[[1]], tps, ensemble)
names(s) <- c('IDW', 'OK', 'TPS', 'Ensemble')
plot(s)