## NR 512 – Spatial Statistical Modeling

## Exercise #6 – Exploring Geostatistical data with Variograms

As discussed in class, a geostatistical dataset contains a set of sample points that characterize a continuously varying phenomenon occurring in a defined study region. These samples could represent anything that varies continuously and has a measurable location including soil samples, mineral deposits, vegetation cover, temperature, etc. **The primary goal of geostatistics is to leverage spatial structure (as determined by the sample points) to infer the values of the continuously varying phenomena in unsampled areas (i.e. across the entire area of interest). **

Geostatistical datasets have ** first-order properties**, which are related to the mean value of the attribute across the study region, and

**, which are related to the spatial dependence (i.e. spatial arrangement) of the continuously varying phenomena across the study area.**

*second-order properties***Objectives: **The goal of today’s lab is to learn how to use variograms and other graphical techniques (e.g., a proportional symbol map) to explore *first-order and*** second-order properties **of a geostatistical dataset.

**The R Package:
**

Today we will be using R packages geoR and gstat, which provide functions for geostatistical data analysis using the R statistical language.

**The datasets: **Today you will be working with a geostatistical dataset called ca20, which contains the calcium content measured in soil samples taken from the 0-20 cm surface layer within a certain study area, which is divided into three sub-areas. The elevation at each location was also recorded. The ca20 dataset is included with geoR.

You will also be working with a dataset called meuse, which contains locations and topsoil heavy metal concentrations, along with a number of soil and landscape variables at the observation locations, collected in a flood plain of the Meuse river, near the village of Stein (NL). Heavy metal concentrations are from composite samples of an area of approximately 15 m x 15 m.** **

**Exercise 1 – Exploring first-order properties of a geostatistical dataset via proportional symbol maps.**

Today’s lab will be conducted using the geoR and gstat packages in R.

- Start R
- Download the geoR and gstat packages and required dependencies from CRAN (ask the instructor for assistance).
- Load the geoR and gstat packages:

> require(geoR)

> require(gstat)

We may also need additional packages for today’s lab. Type the following to access them.

> require(RandomFields)

> require(lattice)

> require(RColorBrewer)

> require(sp)

**Note:** If you get an error, you may need to download and install the packages from CRAN.

We will be working with the ca20 dataset. To inspect the ca20 dataset type:

> data(ca20)

> names(ca20)

As you can tell, the geoR data object is structured quite differently than the spatstat data objects that we have been working with in the last few labs.

Based on the output above, we can see that information is stored in numerous subfields within the ca20 dataset. For example, coordinates (Easting and Northing in this case) are stored in the east and north subfields within the coordinates field.

We can access the subfield information in different ways. For example, if we wanted to see the coordinates we would type:

> ca20$coords

If we wanted to see the east coordinates only, we could type:

> ca20$coords[,1]

If we wanted to look at the attribute information (which are stored in the data subfield) we would type:

> ca20$data

Additional information in the ca20 data object includes boundaries ($borders), 2 covariates ($covariate), and additional information depicting sub-regions within the study area ($other).

Using your sharply honed R skill, answer the following question:

How many samples are in the dataset? _____________________________________________

What is the average value of the observed data? ______________________________________

To plot the data, type the following:

> plot(ca20$coords[,1], ca20$coords[,2])

We can add the study area border by typing the following:

> polygon(ca20$borders)

We will now make a proportional symbol map that plots the locations of the sample points with symbols that are scaled to match the attribute values at each location.

> m <- max(ca20$data)

> plot(ca20$coords[,1], ca20$coords[,2], cex = ca20$data*2/m)

> polygon(ca20$borders)

Using what you know about R, explain what the first 2 lines of code are doing with regard to the max and cex parameters:

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

Based on this map, explain any spatial trend in the attribute values that you see:

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________

The ca20 dataset is also subdivided into 3 sub-regions that have different irrigation levels. A faster way to plot a proportion symbol map and add in the sub-region borders can be achieved by typing:

> points(ca20, borders = reg1)

> points(ca20, borders = reg2, add = T)

> points(ca20, borders = reg3, add = T)

Based on this map, explain any spatial trend in the attribute values that you see. Is there a difference between sub-regions?

______________________________________________________________________________

______________________________________________________________________________

______________________________________________________________________________** **

**Exercise 2 – Exploring second-order properties with variograms**

We will be working with the meuse dataset for Exercise 2.

To inspect the meuse dataset, type:

> data(meuse)

> names(meuse)

What soil contaminates are contained within the meuse dataset?

______________________________________________________________________________

______________________________________________________________________________

To make a color scatter plot of zinc concentrations, we will need to link together several functions.

To do this, you will need to copy and paste in the scatter.color function from the lab appendix.

Then type the following:

> scatter.color(meuse$x, meuse$y, log(meuse$zinc), “Purples”, quantile(log(meuse$zinc),

+ c(0, 0.2, 0.4, 0.6, 0.8, 1)), legend=T)

Do you see any spatial trend in zinc concentrations?

______________________________________________________________________________

______________________________________________________________________________

What do you think the c(0, 0.2, 0.4, 0.6, 0.8, 1) in the previous command is doing?

______________________________________________________________________________

You should notice some trend in the data. These data are in a bend in the river, and the zinc concentrations are higher where it floods regularly.

There is another dataset called meuse.grid which has an attribute called dist that depicts the distance to the river. This dataframe has the x and y coordinates for distances from the river and other observations. Let’s look at this plotted out:

> data(meuse.grid)

> plot(meuse.grid$x, meuse.grid$y, cex = meuse.grid$dist)

In its native structure this shows us the location of the observations, but it is hard to discern the differences. However, we can convert these values to a continuous colored raster with just a couple lines of code:

> gridded(meuse.grid) <- ~x + y

> spplot(meuse.grid, ‘dist’, main = ‘Distance to River’)

Comparing the map of distance to river confirms that the spatial trend in the dataset is partly related to proximity to the river.

The following code could also be used to confirm this trend is related to proximity to the river:

> trellis.par.set(sp.theme()) # sets color ramp to bpy.colors()

> data(meuse)

> coordinates(meuse)=~x + y

> data(meuse.riv)

> meuse.sr <- SpatialPolygons(list(Polygons(list(Polygon(meuse.riv)), “meuse.riv”)))

> rv <- list(“sp.polygons”, meuse.sr, fill = “lightblue”)

> scale <- list(“SpatialPolygonsRescale”, layout.scale.bar(),

+ offset = c(180500, 329800), scale = 500, fill=c(“transparent”, “black”), which = 1)

> text1 <- list(“sp.text”, c(180500, 329900), “0”, which = 1)

> text2 <- list(“sp.text”, c(181000, 329900), “500 m”, which = 1)

> arrow <- list(“SpatialPolygonsRescale”, layout.north.arrow(),

+ offset = c(178750, 332500), scale = 400)

## plot with north arrow and text outside panels

## (scale can, as of yet, not be plotted outside panels)

> spplot(meuse[“zinc”], do.log = TRUE, key.space = “bottom”,

+ sp.layout = list(rv, scale, text1, text2), main = “Zinc (top soil)”,

+ legend = list(right = list(fun = mapLegendGrob(layout.north.arrow()))))

Try positioning the scale bar and associated text in a better place and turn in this map with your lab.

Since we will eventually be kriging (in lab 7), we need to assume that the data are Gaussian (i.e. normal).

To check this, type:

> hist(meuse$zinc)

Are the data Gaussian? ___________________________________________________

Does a Log transform improve this or not?

> hist(log(meuse$zinc))

Is the log transformed data Gaussian? _______________________________________

In geostatistics, the theoretical **variogram** is a function describing the degree of spatial dependence of a spatial process. It is defined as the variance of the difference between field values at two locations across a study region.

In geostatistics, **empirical variograms** can be calculated and plotted to explore second-order properties of a dataset. There are several different ways to use **empirical variograms** to summarize geostatistical data.

The **variogram cloud** is a plot of the average dissimilarity between any two observations as a function of their separation in geographical space (i.e. distance). It essentially plots the variance across all intrasample distances.

The **variogram cloud** can be used in exploratory spatial data analysis to identify data pairs that differ more than what is observed on average over the entire study area. If several “outlying” data pairs involve the same observation, this indicates that this observation is substantially different from the neighboring ones and may be a spatial outlier.

> data(meuse)

> coordinates(meuse) = ~x + y

This specifies the x and y coordinates for the sample locations:

> vgm.cloud <- variogram(log(zinc)~1, meuse, cloud = T)

Creates a variogram cloud object called vgm.cloud based on the zinc information and the x and y coordinates. The cloud = T parameter indicates you want to plot a variogram cloud as opposed to a smoothed or binned variogram. The ~1 indicates you are fitting a variogram with no trend. To plot the variogram cloud, use:

> plot(vgm.cloud)

We can use the identify command to identify potential outliers interactively. To achieve this, follow these two lines of code to replot the variogram cloud:

> plot(vgm.cloud$dist, vgm.cloud$gamma)

> vgm.out <- identify(vgm.cloud$dist, vgm.cloud$gamma)

Then click on extreme points on the upper bounds of the variogram cloud plot. These points should be those that have substantially more variance between the data pairs.

When you have finished clicking on points, press Esc or right click on the variogram cloud to exit the interactive interface.

See which observations are repeating by typing:

> vgm.cloud[vgm.out,]

**Hint:** To identify outlying observations look for repeating numbers in the last two columns of the output from the above command.

Based on this interactive analysis, did you find any potential spatial outliers?

______________________________________________________________________________

When I looked at the data I saw that all the identified points had number 54 in it.

You can inspect this observation by typing:

> meuse[54,]

You can see that this is a very high zinc value. It is in fact the highest value.

We will be removing this outlier in a few steps, so be sure to remember the observation number of this potential outlier.

The variogram cloud is useful for inspecting datasets and identifying potential outliers. However, the graph is very difficult to interpret in general. Typically, it is more appropriate to plot a **binned variogram**. Essentially, a **binned variogram **groups distances between samples into distance categories (i.e. bins) and plots the variance as a function of each bin. Essentially, it plots the variance across categories of intrasample distances.

To calculate and plot a binned variogram, we will remove the cloud = T parameter we used early since a binned variogram is the default in the variogram command. Do this by typing:

> data(meuse)

> vgm <- variogram(log(zinc)~1, ~x + y, meuse)

> plot(vgm)

Turn this plot in with your lab.

Remember, we previously detected a potential outlier (observation 54). To remove this outlier and calculate and plot a new binned variogram type the following:

> vgm.2 <- variogram(log(zinc)~1, ~x + y, meuse[-54,])

> plot(vgm.2)

Turn this plot in with your lab.

Did removing this outlier change the binned variogram plot much?

______________________________________________________________________________

Do you really think this sample is an outlier?

______________________________________________________________________________

So far, we have been plotting omnidirectional variograms (i.e. semivariance is calculated between sample points in all directions).

Based on the color scatterplot, the data were clearly anisotropic, so let’s plot directional variograms along the following azimuth directions (0, 45, 90, 135).

> vgm.aniso <- variogram(log(zinc)~1, ~x+y, meuse, alpha = c(0, 45, 90, 135),

+ cutoff = 2000, width = 100)

> vgm.aniso

> plot(vgm.aniso)

Typically, when there is a trend, you will find that the sill appears to be different in different directions. This is known as “zonal anisotropy”. There are a few ways to deal with this, 1) Fit a nested semivariogram model with an infinite range in one direction, 2) Fit a semivariogram just along a direction with no trend, or 3) Fit a trend to the data, and then fit a semivariogram with a sill to the residual (however, optimal fitting of a trend requires a semivariogram!). We have not discussed trends yet, so I will start with 2). For the most part, the data appear to be stationary in the 45 degree direction, and the trend appears to be largest in the 135 degree direction. Let’s plot the variogram in the 45 degree direction by finding rows in the semivariogram that used data pairs separated along a 45 degree direction:

> dir.45 <- vgm.aniso$dir.hor == 45

> plot(vgm.aniso[dir.45,])

Based on the last 45 degree directional variogram plot, visually estimate the following:

Sill: _______, Range: _______, Nugget: ________

The next step is to fit the variogram model and perform kriging to interpolate between our samples… but this will have to wait until lab 7.

**Lab Exercise 6 Appendix**

#### color scatterplot function (paste this code into the R command line)

scatter.color <- function(X, Y, Z, brewer.name, breaks, legend=F, …){

### scatter.color(X, Y, Z, brewer.name, breaks, legend, …)

### Input:

# X: X coordinate

# Y: Y coordinate

# Z: value for coloring

# brewer.name: string for color palette (e.g. “Blues”); see below

# breaks: cutpoint for colors (e.g. quantile(Z, c(0, 0.33, 0.66, 1)))

# legend: add legend to plot? default is False.

# if true, then click on plot to add a legend using locator()

# …: extra parameters to be passed to plot(X, Y, …)

### Output:

# $colors: hexadecimal colors for every element in Z

# $levels: numeric ranges of levels

# $palette: hexadecimal colors of palette

### Palette Names:

# The sequential palette names are:

# Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd

# The diverging palette names are:

# BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral

num.cuts <- length(breaks)-1

palette <- brewer.pal(num.cuts, brewer.name)

if (num.cuts==2) {palette <- c(palette[1], palette[3])}

Z.level <- cut(Z, breaks)

# Should probably base palette on actual number of breaks!!!

colors <- palette[as.numeric(Z.level)]

levels <-levels(Z.level)

plot(X, Y, pch=16, type=”p”, col=colors, asp=1, …)

points(X, Y)

if (legend) {

legend(“bottomright”, “(x, y)”, levels, fill=palette)

}

return(list(colors=colors, levels=levels, palette=palette))

}