Introduction

For this in-class exercise, we’ll practice working with raster data again. Specifically, we’ll learn:

  • How to extract raster cells that fall underneath polygon data and calculate aggregate statistics.

Data

You will need the following datasets, which you can download from the course GitHub site:

  • temperature.zip A raster of average annual temperatures for the Contiguous United States (CONUS)
  • wy_countes.zip A vector data set of Wyoming county borders.

Download and unzip them to subdirectories of your data folder.

Load the data into an R session

temperature = raster(here::here("data", "temperature", "temperature.tif"))
wy_cnty = spTransform(
  readOGR(here("data", "wy_counties")),
  projection(temperature))
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\michaelnelso\git\spatial_data_r\data\wy_counties", layer: "DOR_-_Counties"
## with 23 features
## It has 8 fields
projection(wy_cnty)
## [1] "+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs"

As usual, once you´ve loaded your data, you should plot it as a quick reality check to verify projections etc.

plot(temperature)
plot(wy_cnty, add = T)

For convenience, you should crop the temperature raster to the extent of the Wyoming counties dataset.

temp_wy = crop(temperature, wy_cnty)
par(mar = c(0.1, 0.1, 0.1, 1))
plot(temp_wy, axes = F); plot(wy_cnty, add = T)

The extract() function

You can use the function extract() to retrieve the values of raster cells that fall within the boundaries of polygons.

We’ll most often use extract() in conjunction with one raster and one Spatial*DataFrame object. It also works with sf vector data objects. The raster object will be the first argument, and the Spatial*DataFrame object the second.

You have several options for the output type of the extract() function:

  • Raw cell values: if you don’t specify an aggregate statistics function (such as mean() or sd()) the function will return a list whose elements are vectors of the raw cell values. The order of the list elements will be the same as the order in which the features appear in the Spatial*DataFrame object.
  • A column matrix of values: if you specify an aggregate function, extract() can output a matrix vector containing the values of the statistic corresponding
  • You can use the sp=TRUE argument along with an aggregate function to return a version of the Spatial*DataFrame with an extra column containing the values of the aggreagate statistic.

We’ll examine examples of each type of output.

matrix output

Let’s start with the simplest output: a matrix with a single column of aggregate statistics.

I’ll calculate the mean of the cells in the temperature raster for each of the Wyoming counties:

wy_counties_temp_mean = raster::extract(temp_wy, wy_cnty, fun = mean)
class(wy_counties_temp_mean)
## [1] "matrix" "array"

Spatial*DataFrame output

Next, let’s do the same calculation, but append the vector to a column of the original SpatialPolygonsDataFrame:

wy_counties_temp_spdf = raster::extract(
  temp_wy, wy_cnty, fun = mean, sp = TRUE)
class(wy_counties_temp_spdf)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
names(wy_counties_temp_spdf)
## [1] "OBJECTID"    "Shape_Leng"  "Shape_Area"  "COUNTYNAME" 
## [5] "YEARCOLLEC"  "TAXYEAR"     "TYPENAME"    "RuleID"     
## [9] "temperature"

Note the new data column called temperature.

List output

You may have noticed that extract() takes a little time to run. On larger datasets it may take minutes or longer to run.

If you plan to calculate several aggregate statistics, you can speed up your computation time by first extracting the raw values of the raster cells as a list, then use the sapply() function to quickly calculate the aggregate statistics.

wy_counties_temp_list = raster::extract(
  temp_wy, wy_cnty)
sapply(wy_counties_temp_list, mean)
##  [1] 6.736080 7.734480 6.217723 5.408657 4.857212 5.281912 8.495762
##  [8] 8.077408 6.294061 4.745921 3.028894 3.532247 7.210908 8.021009
## [15] 6.073369 6.377311 7.428445 6.952354 4.832284 6.958162 7.500885
## [22] 1.643057 1.800145

Note that if your raster has missing data in any cells, you’ll need to specify how the aggregate statistic function should handle them.

If you want to ignore them, just supply sapply() with the na.rm = TRUE argument (it will pass on the na.rm argument to the aggregating function via ...).

sapply(wy_counties_temp_list, mean, na.rm = T)
##  [1] 6.736080 7.734480 6.217723 5.408657 4.857212 5.281912 8.495762
##  [8] 8.077408 6.294061 4.745921 3.028894 3.532247 7.210908 8.021009
## [15] 6.073369 6.377311 7.428445 6.952354 4.832284 6.958162 7.500885
## [22] 1.643057 1.800145

Alternatively, you could write a custom function that deals with NA values in a user-defined way.

We can use the class() and str() functions to examine the output object to verify that it is a list:

class(wy_counties_temp_list)
## [1] "list"
str(wy_counties_temp_list, 1)
## List of 23
##  $ : num [1:40637] 5.95 5.99 5.7 6.14 6.12 ...
##  $ : num [1:19389] 5.95 5.91 5.96 5.96 6.01 ...
##  $ : num [1:20350] 1 1.34 1.37 1.31 1.28 ...
##  $ : num [1:76077] 2.68 2.63 2.59 2.57 2.58 ...
##  $ : num [1:31147] 3.43 3.46 3.51 3.36 3.36 ...
##  $ : num [1:57872] 6.8 6.85 6.83 6.83 6.86 ...
##  $ : num [1:16456] 6.9 6.87 6.87 6.9 6.9 ...
##  $ : num [1:15526] 8.05 8.01 7.97 7.91 7.91 ...
##  $ : num [1:25162] 7.58 7.59 7.56 7.56 7.54 ...
##  $ : num [1:70340] -2 -1.49 -1.67 -1.61 -0.27 ...
##  $ : num [1:55499] -0.06 0.06 1.08 1.31 1.63 ...
##  $ : num [1:30269] 2.76 3.01 2.92 1.69 1.64 ...
##  $ : num [1:32265] 7.64 7.64 7.61 7.65 7.7 ...
##  $ : num [1:19983] 8.34 8.38 8.38 8.46 8.5 ...
##  $ : num [1:15538] 6.26 6.33 6.39 6.41 6.28 ...
##  $ : num [1:32649] 7.04 7.13 7.28 7.28 7.28 ...
##  $ : num [1:37924] 7.77 7.74 7.76 7.91 7.91 ...
##  $ : num [1:22854] 7.13 7.11 7.06 6.94 6.82 ...
##  $ : num [1:14860] 4.62 4.75 4.73 4.87 4.76 ...
##  $ : num [1:17546] 7.26 7.26 7.25 7.25 7.25 ...
##  $ : num [1:18569] 6.51 6.36 6.39 6.3 6.13 ...
##  $ : num [1:32910] 2.15 2.15 2.15 2.14 2.14 ...
##  $ : num [1:37079] 0.6 0.56 0.5 0.47 0.44 ...

It’s easy to append columns to our original SpatialPolygonsDataFrame with the output from sapply():

# Mean of annual temperature.
wy_cnty$mean_temp = sapply(wy_counties_temp_list, mean, na.rm = TRUE)

# Standard deviation of annual temperature:
wy_cnty$sd_temp = sapply(wy_counties_temp_list, sd, na.rm = TRUE)

The spplot() Function

When we calculate aggregate statistics for polygons, it’s often desirable to create a choropleth map.

The spplot() is an easy way to create such a map.

Use the zcol argument to specify which column the choropleth should be based upon:

spplot(
  wy_cnty, zcol = "mean_temp")

spplot(wy_cnty, zcol = "sd_temp")

Note that spplot() does not use base-R plotting methods, so our usualy techniques for multi-panel figures won’t work.

You can check the help entry for more details, and some suggestions for plot customizations.

Practice

Let’s practice with some data on Iowa counties.

Data

Prism Climate Data

Navigate to the site for PRISM climate data:

https://prism.oregonstate.edu/

and locate the Normals data page.

Download the 30 year normals for precipitation and average temperature. You should download the .bil versions of the data.

You may choose to download either the annual average rasters, or the average for a specific month.

  • For the examples below, I used the average annual data.
  • If you choose the monthly data, your histograms and scatterplot may look slightly different than my examples.

When you read the raster files, note the file extension.

Iowa Counties

Navigate to the Iowa Spatial Hub

https://open-iowa.opendata.arcgis.com/

and find the Iowa counties shapefile dataset. Save it to your data subdirectory.

Note that the name of the unzipped directory and the shapefiles don’t match. This means you will have to use slightly more complex syntax to read the files into R.

Specifically you’ll need to use the following arguments to readOGR()

  • dsn = here("data", "Iowa_County_Boundaries-shp")
  • layer = "IowaCounties"

Data self-test

  • Check the projection information for the PRISM data and your Iowa Counties layer.
    • Perform any transformations on the counties that may be needed before proceeding.

Don’t forget to crop!

You should probably crop your rasters to the extent of Iowa, perhaps you may even want to buffer the Iowa borders.

Make sure you check the units of the coordinate system. The units will inform the amount by which you buffer

  • Is it a Geographic Coordinate System, or a Projected Coordinate System?

Once you’re done, you can plot your data:

County Means

Now you can use extract() to calculate aggregate statistics for the Iowa counties!

Try calculating mean temperature and precipitation values. You can plot histograms as a self-check:

Now try plotting a scatterplot of mean annual temperature and precipitation:

plot( 
  mean_prec ~ mean_temp, data = ia_cnty@data,
  main = "IA Counties\n Mean Annual Precipitation and Temperature",
  xlab = "Mean annual temperature", ylab = "Mean annual precipitation"
)