NR 512 – Spatial Statistical Modeling
Exercise #9 – Regression Kriging, Cokriging, and Cross-Validation
Most properties of the environment could be measured at any of an infinite number of places, but in practice they are measured at rather few, mainly for reasons of economy. If we wish to know their values elsewhere then we must estimate them from the data that we can obtain. Kriging is simply a linear interpolation through space. Ordinary kriging uses a model of spatial correlation to calculate a weighted linear combination of the available samples which is then used to predict values at unmeasured location. Alternatively, regression kriging or universal kriging is a modification to this technique that incorporates spatial trends to recognize both non-stationary deterministic and random components in a variable. Additionally, cokriging is an extension of ordinary kriging that brings in more than one variable and exploits the spatial correlation between variables for making predictions.
Objectives: The goal of today’s lab is to learn how to perform regression Kriging in R.
The R Package: Today we will be using R packages geoR and gstat, which provide functions for geostatistical data analysis using the software R.
The datasets: Today you be working with a dataset called meuse, which contains locations and topsoil heavy metal concentrations, along with a number of soil and landscape variables at the observation locations, collected in a flood plain of the river Meuse, near the village of Stein (NL). Heavy metal concentrations are from composite samples of an area of approximately 15 m x 15 m.
Exercise 1 – Regression-kriging (RK)
Employing RK as a spatial interpolation technique combines a regression of the dependent variable on auxiliary variables (such as terrain parameters, remote sensing imagery and thematic maps) with kriging of the regression residuals. It is mathematically equivalent to interpolation methods variously called Trend Kriging, Universal Kriging, and Kriging with External Drift, where auxiliary predictors are used directly to solve the kriging weights. We utilize RK to deal with geostatistical fields that are non-stationary.
Today’s lab will be conducted using various R packages.
- Start R
- Load the lattice, sp, geoR, and gstat packages
> require(geoR)
> require(gstat)
> require(lattice)
> require(sp)
- We may also need additional packages for today’s lab. Type the following to access them. Note if you get an error, you may need to download and install the packages from CRAN.
> require(ggplot2)
- In earlier labs we modeled the spatial structure of meuse dataset by fitting variograms and performing basic kriging. As you will recall, there was a trend in the dataset that we did not account for, namely soil containment concentrations decreased with increasing distance from the river.
To reconfirm this, use the following code to plot zinc concentrations in the study area.
> data(meuse)
> data(meuse.grid)
Create a categorical variable:
> meuse$zinc_cat <- cut(meuse$zinc, breaks = c(0, 200, 400, 800, 1200, 2000))
Initialize the plot, with data and axes.
> zinc.plot <- ggplot(aes(x = x, y = y), data = meuse)
Tell it to plot points, with a colour for zinc_cat zinc
> zinc.plot <- zinc.plot + geom_point(aes(colour = zinc_cat))
Tell it the scaling of the x and y coordinates are equal
> zinc.plot <- zinc.plot + coord_equal()
Plot!
> zinc.plot
Do you notice any trends in the data that might be useful for us in explaining the process that derived these observations?
_____________________________________________________________________________
_____________________________________________________________________________
There appears to be a spatial trend within the data. These data are in a bend in the Meuse River, and the zinc concentrations are higher where it floods regularly. We can use regression kriging to account for the trend… maybe.
- We will now fit a regression kriging model to the meuse data. The distance from the river is appropriate for fitting the trend.
First, it is good to verify that distance is an appropriate predictor of log zinc, by fitting a linear model that predicts the log of zinc concentrations as a function of the square root of distance from the river.
> plot(log(zinc)~sqrt(dist), meuse)
> abline(lm(log(zinc) ~ sqrt(dist), meuse))
> summary(lm(log(zinc) ~ sqrt(dist), data = meuse))
How much of the variation in zinc concentrations does the regression of distance from the river explain? Is this a strong relationship?
_____________________________________________________________________________
_____________________________________________________________________________
To evaluate if there is a spatial structure to the regressions residuals, we will plot an isotropic variogram of the residuals:
> vgm <- variogram(log(zinc) ~ sqrt(dist), locations = ~x + y, data = meuse)
> plot(vgm)
Instead of the constant mean as denoted by ~ 1, we were able to specify a mean function, e.g. using ~ sqrt(dist) as a predictor variable.
Additionally, we can plot the anisotropic variogram for the distance trend:
> vgm.aniso <- variogram(log(zinc) ~ sqrt(dist), locations = ~x + y , data = meuse,
+ alpha = c(0, 45, 90, 135))
> plot(vgm.aniso)
Maybe it’s still anisotropic, but it’s not clear to me. Any difference could just be noise. When in doubt, keep it simple.
Fit a variogram model to the isotropic variogram
> fit.reg.sph <- fit.variogram(vgm, vgm(0.25, “Sph”, 1500, 0.1))
> plot(vgm, fit.reg.sph)
In this case, the variogram of residuals with respect to a fitted mean function are shown. Residuals were calculated using ordinary least squares.
With a model fit to the data we can perform regression kriging:
> pred.reg <- krige(log(zinc) ~ sqrt(dist), ~x + y, meuse, meuse.grid, model = fit.reg.sph)
Note: In the above code that ‘log(zinc) ~ sqrt(dist)’ is the portion of the code that fits the trend. This is different than in lab 7 where the syntax ‘log(zinc) ~ 1’ was used to krige without a trend. The ‘~ 1’ specifies this.
From here, you should be able to replot the prediction and variance maps using the following code.
Plot the prediction:
> gridded(pred.reg) <- ~x + y
> spplot(pred.reg [“var1.pred”], main = “Regression Kriging Predictions”)
Plot the variance:
> spplot(pred.reg[“var1.var”], main = “Regression Kriging Variance”)
If you remember back to our ordinary kriging of this data, do we seem to have improved the prediction?
______________________________________________________________________________
______________________________________________________________________________
Exercise 2 – Cokriging
Cokriging is an extension of kriging that utilizes multiple input variables, assuming there is a known correlation between the primary and secondary variables. This is common practice in the presence of sparse primary data with exhaustive secondary data. In the case of ecological or environmental data this may be appropriate for primary species data and an extensive set of secondary data such as topography or ecosystem data.
Cokriging uses information on several variable types. The main variable of interest is Z1, and both autocorrelation for Z1 and cross-correlations between Z1 and other variables are used to make better predictions. It is appealing to use information from other variables to help make predictions, but it comes at a price. Cokriging requires much more estimation, including estimating the autocorrelation for each variable as well as all cross-correlations. Theoretically, you can do no worse than kriging because if there is no cross-correlation, you can fall back on autocorrelation for Z1. However, each time you estimate unknown autocorrelation parameters, you introduce more variability, so the gains in precision of the predictions may not be worth the extra effort.
First, we need to determine if we have a strong degree of correlation between our parameters. We will test the zinc concentrations from the meuse data against the other elements in the soil samples.
> data(“meuse”)
> cor(meuse[,3:6])
Which element has the strongest correlation with zinc? ______________________
We can also visualize this relationship to determine if it might breakdown at any point in their distributions:
> plot(meuse$lead, meuse$zinc)
Because the values seem to be a little skewed we can also plot the log of each variable as that seemed to make the zinc appear closer to normally distributed:
> plot(log(meuse$lead), log(meuse$zinc))
Do the two variables seem to have a fairly constant precision throughout their relationship? How could we test this? ______________________________________________________________
______________________________________________________________________________
Now that we have identified that lead has the strongest correlation with zinc we will build a gstat structure containing the two sample sets. The gstat function does nothing else than ordering (and actually, copying) information into a single object.
> data(“meuse”)
> coordinates(meuse) <- ~x + y
> zinc.lead <- gstat(NULL, id = “log.zinc”, form = log(zinc) ~ 1, data=meuse)
> zinc.lead <- gstat(zinc.lead, id = “log.lead”, form = log(lead) ~ 1, data=meuse)
Compute and display the two direct experimental variograms and one cross-variogram:
> v.cross <- variogram(zinc.lead)
> plot(v.cross)
By changing this slightly, we can get a better picture of what is going on:
> plot(v.cross, pl=T)
Now we need to add the variogram models to the gstat object and then fit a linear model of co-regionalisation to them:
> zinc.lead <- gstat(zinc.lead, id = “log(zn)”, model = vgm(0.5, “Sph”, 800, 0.1),
+ fill.all = TRUE)
> zinc.lead
To fit the linear model we will use the fit.lmc method to fit this across all three variograms together. This will fit a linear model of coregionalization which is a particular model that needs to have identical model components, which we are passing to fit.lmc from the step above.
> zinc.lead <- fit.lmc(v.cross, zinc.lead)
> zinc.lead
> plot(variogram(zinc.lead), model=zinc.lead$model)
What do you notice from the outputs? ________________________________________
We can now conduct the cokriging using the predict command in the gstat package:
> data(“meuse.grid”)
> coordinates(meuse.grid) = ~x + y
> cokrig.model <- predict(zinc.lead, meuse.grid)
Summarize the predictions and their errors:
> summary(cokrig.model)
Display the predictions and their errors for log(zinc) and log(cadmium)
> spplot(cokrig.model[“log.zinc.pred”], main = “log(zn) coKriging Predictions”)
> spplot(cokrig.model[“log.zinc.var”], main = “log(zn) coKriging Variance”)
> spplot(cokrig.model[“log.lead.pred”], main = “log(pb) coKriging Predictions”)
> spplot(cokrig.model[“log.lead.var”], main = “log(pb) coKriging Variance”)
Again, do we seem to have a better prediction here than we did with the ordinary kriging?
______________________________________________________________________________
______________________________________________________________________________
Exercise 3 – Kriging Cross-Validation
It’s a good idea to cross-validate the estimated results with true data points. While there are different levels of validation that can be performed and having an independent dataset to validate against is the most ideal approach, other approaches such as leave-one-out can serve in many instances. Leave-one-out cross-validation removes a single data point, one at a time, and estimates the result at the now missing location. We can then make a cross plot between the estimated results and the true data points.
We will run the leave-one-out cross-validation by feeding our variogram model and dataset into the cross-validation command (krige.cv). By default, this will run leave-one-out cross-validation, but this could be modified to run n-fold cross validation.
> coordinates(meuse) <- ~x+y
> uk.val <- krige.cv(log(zinc) ~ sqrt(dist), meuse, fit.reg.sph)
We can visualize the quality of our kriging surface by looking at the residuals or by comparing the predicted and observed values at each point. One of doing this is through a scaled bubble chart of the residuals, this will show us if there is a spatial trend to the residuals.
> bubble(uk.val, “residual”, main = “log(zinc) ~ sqrt(dist): cross validation residuals”)
Do you see any kind of spatial trend? _______________________________________
We could also look at the distribution of residuals to determine if they appear normally distributed.
> hist(uk.val$residual, xlab = “Residual”, main = “Leave-one-out Cross Validation”)
Finally, we could make a cross plot between the estimated results and the true data points. With this we could look at the correlation and find out if we are consistent throughout our predictions.
> plot(uk.val$observed, uk.val$var1.pred, data = uk.val, xlab = “Estimated Value”,
+ ylab = “True Value”, main = “Cross Validation”, xlim = c(0, 8),
+ ylim = c(0, 8), cex=uk.val$residual)
> abline(lm(uk.val$var1.pred ~ uk.val$observed), lty = 2)
> p <- cor(uk.val$observed, uk.val$var1.pred)
> p <- round(p, digits = 4)
> legend(5, 3.5, c(“COR = “, p))
Looking at the cross-validation results, how strong is the correlation between the observed and predicted values? _______________________________________________________________
By scaling the points by the residual they produced, we can see that a couple of these are excessively larger than the rest. Why do you think this is? _______________________________
______________________________________________________________________________
______________________________________________________________________________