NR 512 – Exercise #1

NR 512 – Spatial Statistical Modeling

Exercise #1 – Introduction to the R Environment

What is R? R is an open source, freely available statically software package. R is both powerful and flexible. It can do nearly any type of statistical analysis. R does have a steep learning curve. R is awesome. Your lives will be much better if you learn it. R you ready?
This goal of today’s lab is not to teach you R. Thus, I’ve provided all the code you will need to develop the regression in R. Just copy the necessary text from this document (where indicated) and paste it to the R command line. Some tweaking may be required.

The dataset: Today you will be working a sample dataset that is described in Chapters 2 and 3 of the Isaaks book, which was handed out in class. The data are spatially referenced across a 10 X 10 grid and contacting measurements of two covariates: V and U.
The database is named ‘lab1_data.csv’. You should open it in Excel and look at the different variables it contains.

Exercise 1 – Being Organized:

  1. Think about where on your computer you would like to store your project materials. Create a class directory there, with sub directories for associated lab materials.
  2. Make sure that the lab1_data.csv is stored in your newly created lab1 directory within your spatial stats class directory.

Exercise 2 – R as a Calculator:

Most introductory R manuals start by teaching how to use R as a calculator. However, as we will discover throughout the semester R is much more powerful than that.

  1. Figure out how to start R on your computer and do so.
  2. Type 1 + 1 and then tap Enter/Return. What do you get? _____________________________
  3. Type 3 * 3 and then tap Enter/Return. What do you get? _____________________________
  4. Type pi and then tap Enter/Return. What do you get? ________________________________
  5. Tap the up arrow on your keyboard. What happens? ________________________________
  6. Try editing and re-executing a previous command. Did it work? _______________________
  7. Type x = sqrt(25) + 2 and then tap Enter/Return. What happens? ______________________
  8. Type x and then tap Enter/Return. What happens? __________________________________

Essentially R evaluates commands and returns the output or results from the associated computations. In the above command (x = sqrt(25) + 2) creates a new variable (or object) called x and assigns it the calculated value from the equation.

  1. R can also perform calculations on vectors or strings of numbers. Use the following syntax to create a vector of data. Type y = c(1, 4, 7, 2, 4, 5) then tap Enter/Return.
  2. Type y then tap Enter/Return. What is the result? ___________________________________
  3. Type x + y then tap Enter/Return. Explain the result? ________________________________
  4. Type z = x + y then tap Enter/Return
  5. Type z then tap Enter/Return. Explain the result? ___________________________________
  6. Type mean(z) then tap Enter/Return. Explain the result? _____________________________
  7. Quit R by typing q().

Exercise 3 – Reading Data into R

R is a very powerful package for analyzing and manipulating data. However, before we analyze data we need to learn how to read databases into R. There are several options for doing this depending on the type of database you are attempting to read in. A Comma Separated Values (.csv) file is perhaps the most flexible format. The dataset you are working with today is stored in .csv format. Execute the following commands to become comfortable reading in and manipulating data in R. Note the commands you type in are proceeded by the R prompt ‘>’ throughout the rest of the document. Type in each command then tap Enter/Return to execute it.

Start R

The read.csv command can be used to read in the data set we are working with today. The R help document can be used to help us figure out how to use specific commands. Use the command below to access the R help information for the read.csv command:

> ? read.csv

Looking through the document we can see that the read.csv command required several bits of information (termed arguments) in order to properly execute.

The usage portion of the help file says:

read.csv(file, header = TRUE, sep = “,”, quote=”\””, dec=”.”, fill = TRUE, comment.char=””, …)

If we continue to read the document, we can figure out what is required for each of the requirements.

  1. What pieces of information are required to read in the lab1_data.csv file? ___________________________________________________________________
  1. What does the sep argument do? _____________________________________________

Read in the lab1_data.csv file to a new R object called lab1_data. Note the path to the lab1_data.csv file must be enclosed in quotation marks.

> lab1_data = read.csv(“TYPE IN THE PATH TO THE FILE HERE”, header = TRUE, sep = “,”)

If the above command worked there should be a new R object called lab1_data that contains all the information stored in the lab1_data.csv file.

Check to make sure the data read in properly.

> print(lab1_data)

or a simple alternative is:

> lab1_data

  1. How many variables does this dataset contain? _________
  2. How many observations does this dataset contain? _________

To see the column names (i.e., variable names) in the dataset type:

> names(lab1_data)

  1. What are the column names? ___________________________________________________

To access the information in a specific column, V in this case, type:

> lab1_data$V

  1. What is printed on the screen? __________________________________________________

Similarly, we could simply type:

> lab1_data[3]

  1. What does the 3 indicate in the above command? ___________________________________

Similarly, we can access data stored in a specific row of the object (row 3 in this case) by typing:

> lab1_data[3,]

The data in each column can be manipulated:

> lab1_data$V + 1000

  1. What is printed on the screen? __________________________________________________

> lab1_data$V + lab1_data$U

  1. What is printed on the screen? __________________________________________________

We can create new objects from the lab1_data object by:

> VplusU = lab1_data$V + lab1_data$U
> VplusU

This creates a new vector of data the stores in result of the equation lab1_data$V + lab1_data$U.

We can add this vector of data to the lab1_data object vie the cbind command which stands for column bind:

> lab1_data = cbind(lab1_data, VplusU)

  1. What happens to the lab1_data object? ___________________________________________

Hint: use the name and print commands to figure this out.

The subset command is also very useful for manipulating datasets.

  1. What would you type to access the help information for the subset command? ____________

Using the information in the subset command help file determine how to create a new r object called lab1_data_Vgt100 that only contains observations the have V values greater than 100.

  1. What would you type to accomplish this? _________________________________________

Using the information in the subset command help file determine how to create a new r object called lab1_data_orig that only contains the original columns in the lab1_data.csv file.

  1. What would you type to accomplish this? _________________________________________

We have now created several R objects. Sometimes it can be difficult to remember each of them. To see the names of all R objects, type the following:

> ls()

  1. What are the names of the R objects currently stored in R? ____________________________________________________________________________

Sometimes we may want to get rid of specific objects. Type the following to remove the VplusU object:

> rm (VplusU)

To see what is left type:

> ls()

To remove all object type:

> rm(list = ls())

  1. What objects are left in the R environment? _______________________________________

Exercise 4 – R for Basic Statistics and Graphing – Univariate Examples

R is a very powerful statistical analysis package. In this exercise you will work with the lab1_data.csv file and perform basic univariate and bivariate statistics. Note the following exercises will follow Chapters 2 and 3 in the Issaks book very closely. Use the following commands in conjunction with the text in the chapter handout as out work through this exercise.

You will need to read the lab1_data.csv data in R again. However, this time we will use a slightly easier command.

> lab1_data = read.csv(file.choose())

Use the print and name commands to make sure the data read in properly.

First, we will create a histogram of the V data that looks like Figure 2.2 in the Issaks book.

> hist(lab1_data$V)

  1. Does the histogram produced look like Figure 2.2? _________________________________
  2. What is different? ____________________________________________________________

There are several arguments within the hist command that we need to alter to make our histogram look like figure 2.2. Use the help command to see the arguments for the hist command.

  1. What are some of the arguments we need to alter? __________________________________

First of all, we need to alter the number of bars in the histogram using the breaks argument.

  1. Home many breaks will we need to reproduce figure 2.2? ____________________________

> hist(lab1_data$V, breaks = 20)

Next, we need to change the range for the x and y axes to match figure 2.2 via the xlim and ylim arguments:

> hist(lab1_data$V, breaks = 20, xlim = c(-10,200), ylim = c(0,20))

We can change the labels for the x and y axes via the xlab and ylab arguments:

> hist(lab1_data$V, breaks = 20, xlab = ‘V (ppm)’, ylab = ‘Frequency (%)’, xlim = c(-10,200),
+ ylim = c(0,20))

Finally, we can change the title of the graphs via the main argument:

> hist(lab1_data$V, breaks = 20, xlab = ‘V (ppm)’, ylab = ‘Frequency (%)’, xlim = c(-10,200),
+ ylim = c(0,20), main = ‘Figure 2.2’)

Now we are going to reproduce the cumulative histogram in figure 2.3. To achieve this we will need to reorder the individual histogram bars by their cumulative sum. First we need to create a new R object that stores all the histogram information.

> h = hist(lab1_data$V, breaks = 20, plot = FALSE)

Note the histogram does not plot because the plot argument is set to FALSE. This creates a new object called h that stores all the histogram information. Use the name command to see what information is contained in h.

  1. What information is contained in h? _______________________________________________________________________________________________

We can see the number of observations within each histogram class by typing the following:

> h$counts

  1. How many observations are in each individual histogram class? _______________________

To create a cumulative histogram, we need to calculate a cumulative sum of these numbers vie the cumsum command:

> cumsum(h$counts)

  1. Describe what this command does? ______________________________________________

To use the result of the cumsum command we will need to store it in the histogram object (h) that was created above:

> h$counts <- cumsum(h$counts)

To see the new counts values stored in h type:

> h$counts

  1. How is this different from the info you described in question 5? ________________________________________________________________________

To plot the cumulative histogram type:

> plot(h)

Insert the xlim, ylim, xlab, ylab, and main arguments into the plot command to make the graph look more similar to figure 2.3.

  1. Print this graph and turn it in with your lab assignment.

We will assess normality of the data a bit differently than the Isaaks book (figure 2.3). We will us a the qqnorm command to make a Quantile-Quantile plot:

> qqnorm(lab1_data$V); qqline(lab1_data$V)

  1. Does this dataset follow a normal distribution? Why or Why not? _______________________________________________________________________

Hint: You may need to review basic statistics to interpret the results. Consult the internet to determine how to interpret a Q-Q plot.

To see if the data follow a log normal distribution type:

> qqnorm(log(lab1_data$V+1)); qqline(log(lab1_data$V+1))

Note a constant of 1 was added to the V data because there is a 0 in the dataset and the log of 0 in undefined.

  1. Does this dataset follow a log normal distribution why or why not? ____________________________________________________________________

You will now calculate the summary statistics for the V data. See the text starting on Issaks page 16. To calculate the mean, minimum, and maximum type:

> mean(lab1_data$V)
> min(lab1_data$V)
> max(lab1_data$V)

  1. Report the mean, min and, max: ________________________________________________

To determine the quartiles we will use the qauntile command in R and type:

> quantile(lab1_data$V)

  1. What are the lower and upper quantiles? __________________________________________

To determine the deciles type:

> quantile(lab1_data$V, prob = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0))

We will now use R to calculate measures of spread and shape from the V data (starting on page 20 in Isaaks). To calculate the variance of V type:

> var(lab1_data$V)

  1. What is the variance of V? ____________

Use R to calculate the inner quartile range of V

  1. What is the inner quartile range of V? ____________________ and how did you calculate it? ______________________________________________

To calculate the skewness of V type:

> skewness(lab1_data$V)

Hint: You may need to install and load the fBasics package. This can be done by using the Install Packages… function in the Tools drop down menu. Then you will want to use the command library(PackageName).

  1. What is the skewness of V? _________________
  2. What does this number mean? __________________________________________________
  3. What is the coefficient of variation for V? ________________________________________
  4. How did you calculate the coefficient of variation for V? ______________________________________________________________________________

We will now transition into working with bivariate data in R. In this case we will be using the V and U information in the lab1_data.csv file.

To create a Q-Q plot similar to Figure 3.3 on page 28 type:

> qqplot(lab1_data$V, lab1_data$U)

to add in the 1:1 line type:

> abline(0,1)

  1. Interpret this Q-Q plot: ______________________________________________________________________________________________________________________

Note: Use the text on pages 26 and 27 to assist you in your interpretation.

To create a scatter plot similar to the left pane of figure 3.4 type:

> plot(lab1_data$V, lab1_data$U)

Use xlim, ylim, ylab, xlab, and main arguments to make the scatter plot more similar to the left pane of figure 3.4.

  1. Print this figure and turn it in with your lab.

To calculate the correlation and covariance of V and U (page 30) type:

> cor(lab1_data$V, lab1_data$U)
> cov(lab1_data$V, lab1_data$U)

  1. Report the covariance and correlation: ___________________________________________

Finally, we will wrap up todays lab by running a simple linear regression on the V and U data. This will be achieved via the lm command which stands for linear model. To calculate a linear regression for V given U type:

> lm1 = lm(lab1_data$V~lab1_data$U)

To determine the estimates and statistics for this linear regression model type:

> summary(lm1)

  1. What is the R-squared of this regression? _________ is this strong? ____________________

Use this information as well as previous commands to recreate both graphs in figure 3.5.

  1. Print and turn in both graphs.