This is the second module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

## Introduction

- Typing commands each time you need them is not very efficient
- Control statements and user-defined functions help us to move forward and write more productive and complex programs

### Import, overview and analyse simple data

#### Reading Data

- load iris dataset

1 2 |
data("iris") dt <- iris |

#### Overview Data

- overview data using
**str()**and**summary()** - What’s the maximum value of Petal.Width?
- How many levels has Species?
- Have a look at the first 10 and last 3 rows using
**head()**and**tail()**

Remind:

**head()**and**tail()**display 6 rows by default, but you can choose how many with argument n

1 2 |
# Have an overview of the data str(dt) |

1 2 3 4 |
summary(dt) # Have a look at the first 10 and last 3 rows (use head and tail) head(dt, n = 10) tail(dt, 3) |

#### Spot missing values

- Check for missing values using
**any()**and**is.na()** - Start doing that in two steps
- Then nest the two functions to obtain same result

Remind:

**is.na()**is vectorized, as most of R functions- Vectorized functions are used same way with scalars, vectors, matrices, etc.
- What changes is the value returned which depend on input provided…
- …What kind of object do you expect to be returned by is.na applied to a dataframe?

1 2 3 4 |
# Check if there is at least one missing value idx <- is.na(dt) any(idx) any(is.na(dt)) # same as above but in a single line |

#### Explore factors

- Discover how many different levels there are in Species variable using
**unique()**or**table()**functions

1 |
unique(iris$Species) |

1 |
table(iris$Species) |

#### Summarize numerical variables

- Calculate mean petal width in this dataset using
**mean()**

Remind:

- you can access variables in a dataframe with $ operator
- Or with the [] operator specifying in the colum slot either…
- …an integer index indicating which column to pick
- …or a string indicating the exact name of variable to pick

1 2 |
# Calculate the average petal width mean(dt$Petal.Width) |

#### Summarize numerical variables from pre-filtered data

- Calculate average petal width only for setosa iris
- Calculate average petal length only for setosa iris

Tips:

- Anytime you have a doubt on the exact spell (or position) of a variable consider having a quick view at them with
**names()**or**str()** - You could create a filtered dataset first and summarize it after…
- …but this is not very efficient
- Rather get advantage of [] operator which allows you to query rows and columns at same time

1 2 3 |
# Calculate same indicators, but only for setosa iris mean( dt[ dt$Species=="setosa", "Petal.Width" ] ) mean( dt[ dt$Species=="setosa", "Petal.Length" ] ) |

#### Storing analysis’ results

- Store last results in a vector mystats using function
**c()** - Visualize the results by printing the vector

Tips:

- Consider store separately the two results…
- …then use them to fill the vector, our final output
- To visualize the value of an object you can use
**print()**(explicit printing) or simply the name of the object (auto-printing)

1 2 3 4 |
# Save these last two results in a vector called mystats avg_set_wid <- mean( dt[ dt$Species=="setosa", "Petal.Width" ] ) avg_set_len <- mean( dt[ dt$Species=="setosa", "Petal.Length" ] ) mystat <- c(avg_set_wid, avg_set_len) |

#### Change variable names

Change these names this way:

- Petal.Width -> pwidth
- Petal.Length -> plength

…using **names()** function

Remind:

**names()**allow you to access names attribute of the data frame

1 |
names(dt)[3:4] <- c("plength", "pwidth") |

#### Explore numerical data with histograms

- Create a histogram of the variable plength using
**hist()** - Create another one only for setosa iris

Make it prettier by:

- changing number of bins (nclass=30)
- title (main=“yourTitle”)
- x-axis label (xlab=“yourLabel”)

Store the plot in a variable called myhist

1 2 3 4 5 6 |
# Create a histogram of the variable mcost hist(dt$plength) # create a better histogram of variable mcost only for rows where there was actually at least a material claim hist(dt[dt$Species=="setosa", "pwidth"]) # create you final output by changing the number of bins, the title and the x-axis label hist(dt[dt$Species=="setosa", "pwidth"], nclass = 30, main = "Petal Length Setosa Iris", xlab = "") |

1 2 |
# save this plot in a variable called myhist myhist <- hist(dt[dt$Species=="setosa", "pwidth"], nclass = 30, main = "Petal Length Setosa Iris", xlab = "", col = "yellow") |

### Lists

#### Presenting lists

- Lists are ordered collection of objects (called list’s components)
- Components can be of any type (logical vector, matrix, function,…, whatever!)
- Lists’ components are numbered and may be referred to as such
- When components are named they can be referred to by name
- Subscripting operator […] can be used to obtain sublists of a list… -…but to select single components you need [[. . . ]] or $
- Let’s now create some objects we will then store in a list

#### Create a list

- Create an object named ‘x’ taking value ‘MyReport’
- Store the objects mystats, myhist and x in a list named mylist using
**list()** - Give a name to each component in the list using the form list(namecomponent=component,. . . )
- Check the number of components in mylist with function
**lenght()**

1 2 3 4 |
# Create a simple character object taking value "My analysis" myreport <- "my analysis" myreport2 <- c("my analysis") myreport2 == myreport # they are the same! |

1 2 3 4 5 6 |
# Store the objects you have created (mystats, myhist and myreport) in a list mylist <- list(mystat, myhist, myreport) # give a name to each object of the list mylist <- list(mystat = mystat, myhist = myhist, myreport = myreport) # check the number of components in mylist with function lenght() length(mylist) |

#### Play with sublists

- Sublist mylist keeping only the first entry
- Sublist mylist keeping only the first and third entries
- Play a bit with […] operator filled with positive (and negative) integer indexes as well as variables’ names (since our list has named components) …Do you think is possible to extract the second element of mystat component with []?

1 2 3 4 |
# Play with [] operator and sublists mylist[1] mylist[c(1,3)] # sum(mylist[1]) # This trigger an error! you need [[]] |

#### Play with extraction of lists’ components

- Extract first component using [[. . . ]], [[“. . .”]] and $ operators
- Experiment partial matching with $ operator
- Extract the second element of first component
- Sum mystat vector contained in mylist Remind: – […] for lists returns sub-lists – [[…]] returns lists’ components

1 2 3 4 5 6 |
# Play with [[]] operator to extract components mylist[[1]] mylist[["mystat"]] mylist$mystat mylist$myr mylist[["mystat"]][1] |

### Grouped expressions

### Presenting grouped expressions

- In R every command is a function returning some value
- Commands may be grouped together in braces: {expr_1 expr_2 }
- …in which case the value of the group is the result of the last expression in the group evaluated
- Grouped expressions are very useful together with control statements available in the R language

### Control statements: conditional execution

#### if-else statement

- If-else conditional construction takes the form

if (expr_1) expr_2 else expr_3

where:

- expr_1 must evaluate to a single logical value
- expr_2 is returned when expr_1 returns TRUE
- expr_3 is returned elsewhere

#### Play with if-else stataments

- Use an if-else statement to test whether or not data contain at least one missing values
- then trigger a warning in the first case (something like ‘Watch out! there are missing values…’)
- or simply print a message otherwise (something like ‘everything’s fine, go ahead…’)

Tips:

- Use the following functions:
**any()**,**is.na()**,**print()**,**warning()**

1 2 3 4 5 |
if(any(is.na(dt))) { warning("Watch out missing values!") } else { print("No missing values here") } |

#### Play further with if-else statements

- Use an if-else statement to test whether data is of class dataframe
- Then create a variable taking value the number of observation in the data when condition is met and NA otherwise

Tips:

- use functions:
**class()**,**nrow()**

1 2 3 4 5 6 |
y <- if(class(dt)=="data.frame") { nrow(dt) } else { NA } y |

1 2 3 4 5 6 7 |
# change the condition from data.frame to factor and contorl you get a NA in this case y <- if(class(dt)=="factor") { nrow(dt) } else { NA } y |

#### Vectorized if/else: ifelse() function

**ifelse()**is a vectorized version of the if/else construct- Create a variable named plength_high taking value 1 if plength>5 and 0 otherwise using
**ifelse()**

Tip:

- ifelse() has this structure ifelse(condition, if_true, if_false)

1 2 |
# Use the vectorized version of if/else construct, ifelse() function plength_high <- ifelse(dt$plength>5, 1, 0) |

### Control statement: repetitive execution

#### looping

- A for loop statment has the form:

for (name in expr_1) expr_2

where: – name is the loop variable – expr_1 is a vector expression, (often a sequence like 1:20) – expr_2 is often a grouped expression with its sub-expressions written in terms of the loop variable

For loops make that expr_2 is repeatedly evaluated as name ranges through the values in the vector result of expr_1

#### Play with loops

- Use a for statment to loop over integer sequence 1:10 and print iteratively the loop variable

1 2 3 |
for(i in 1:10) { print(i) } |

…not very useful, right?

- Use for statement to loop over columns of dataset and print the class of each of it

1 2 3 |
for(i in 1:ncol(dt)) { print(class(dt[,i])) } |

- Repeat it but store the classes in a vector

1 2 3 4 5 |
myclasses <- NULL for(i in 1:ncol(dt)) { myclasses <- c(myclasses, class(dt[,i])) } myclasses |

1 2 3 4 5 6 |
# or like this myclasses <- NULL for(i in 1:ncol(dt)) { myclasses[i] <- class(dt[,i]) } myclasses |

Tips:

- in grouped expressions auto-printing do not apply (use explicit printing)
- initialize your target vector before, out of the loop
- You can use NULL to initialize an empty object

#### Play further with loops

- A loop variable can be anything, not only an integer sequence
- Use a for statement to loop over all possible values of Species
- Then use them to calculate average petal length for each Species
- Final output is a numerical vector
- Bind result vector with the values of Species to make the result readable
- Order the matrix by Species (alphabetically)
- Round average lengths to have no decimal place

Tips:

- Use functions
**cbind()**,**order()**,**round()**

1 2 3 4 5 6 7 8 9 10 11 12 13 |
out <- NULL for(i in unique(dt$Species)) { dtsubset <- dt[dt$Species==i,] out <- c(out, mean(dtsubset$plength)) } out lspecies <- cbind(unique(dt$Species), out) ord <- order(lspecies[,1]) lspecies <- lspecies[ord,] lspecies lspecies[,2] <- round(lspecies[,2], digits = 0) lspecies |

#### Alternative (better) ways to loop

- for looping is not usually the best way to obtain a result in R
- Code that take a whole object view is likely to be both clearer and faster
- More advanced ways to obtain that same result are available in R
- Previous loops can be obtained with:

1 2 3 |
sapply(dt, class) tapply(X = dt$plengthExposure, INDEX = dt$Species, FUN = mean) |

#### Apply family of functions

- apply functions implement looping
- You can imagine apply functions as doing the following behind the scenes:
- SPLIT up some data into smaller pieces
- APPLY a function to each piece
- then COMBINE the results

1 |
?sapply |

### Functions

#### Presenting functions in R

- Functions represent one of the most powerful tool of R
- Somehow the transition between interactive and developing programming mode
- A function is defined by an assignment of the form:

name <- function(arg_1, arg_2) expression

where:

- expression usually is a grouped expression that uses the arguments to calculate a value
- A call to the function then usually takes the form:

name(arg_1 = value_1, arg_2 = value_2)

- Functions can be treated much like any other object
- They can be nested
- They return last expression evaluated

#### Functions, arguments and defaults

- Arguments in R can be matched by order and by name
- if you mantain the exact order of function definition then there’s no need to specify the names of the arguments
- Arguments can have default values:

name <- function(arg_1, arg_2 = NULL, arg_3 = 1)

- arguments without default values must be always specified when calling the function

#### Built-in functions

Most of the functions supplied as part of the R system are themselves written in R and thus do not differ materially from user written functions

Check the R code behind built-in functions like mean, sd, etc by simply typing their names

1 |
sd |

- Many functions call some primitive functions whose code (written in C) is masked

#### Functions in loaded packages

1 2 |
library(e1071) skewness |

#### Write your functions

- Write a simple function called AvgLenIris taking no arguments and returning average petal length of Iris
- Call the function to see if it works (remember the parenthesis even if no argument is needed…)
- Store the value returned by the function in an object named ‘y’

1 2 3 4 5 6 |
AvgLenIris <- function() { out <- mean(dt$plength) return(out) } y <- AvgLenIris() |

- Generalize the function adding an argument which will be the field to average, call it AvgAnyIris

1 2 |
AvgAnyIris <- function(field) mean(dt[,field]) AvgAnyIris(field = "pwidth") |

1 |
AvgAnyIris(field = "plength") |

#### Functions with control structures

- Write a new function AvgAnyIris identical to previous one but with a control over argument validity
- Control whether field is numeric using
**is.numeric()** - Test the control structures are working

Remind:

- Remind that integers are also numeric
- Remind you can revert logical expressions with ! operator

Tips:

- Use if statement followed by
**stop()**function **stop()**takes as argument a string/message to print whether a condition is met

1 2 3 4 5 6 7 8 |
AvgAnyIrisCtrl <- function(field) { if(!is.numeric(dt[,field])) stop("field must be numeric") out <- mean(dt[,field]) out } AvgAnyIrisCtrl("plength") # AvgAnyIrisCtrl("Species") # this triggers error. It is no sense to ask for the average of a factor variable |

#### Functions with optional arguments

- Write a function GroupedAvgAnyIris similar to last one but with an additional argument indicating a categorical variable to group the results by (univariate analysis)
- argument indicating grouping variable has default to NULL
- When no group value is provided then the same result of previous function (overall average) should be returned

Tips:

- Use a for loop or
**tapply()**as seen before

1 2 3 4 5 6 7 8 9 10 11 12 13 |
GroupedAvgAnyIris <- function(field, group=NULL) { if(!is.numeric(dt[,field])) stop("field must be numeric") if(is.null(group)) { out <- mean(dt[,field]) } else { if(!is.factor(dt[,group])) stop("group must be a factor") out <- tapply(X = dt[,field], INDEX = dt[,group], FUN = mean) } out } GroupedAvgAnyIris("plength") |

1 |
GroupedAvgAnyIris("plength", group = "Species") |

#### Functions returning lists

- Sometimes functions need to return more than one object
- In this case lists come in very handy
- Write a simple function returning a list with the division, multiplication, addition and difference between any two numbers
- Call the function and store the result in a new object
- Check the result with
**str()**function - Extract the division with $ operator

1 2 3 4 5 6 7 |
mystat <- function(a, b) { list(div = a/b, mult = a*b, add = a+b, diff = a-b) } mystatvalue <- mystat(1,2) str(mystatvalue) mystatvalue$div |

### Probability distributions & simulation

#### Probability distributions

- R language implements the tables of main probability distributions (normal, poisson, binomial, etc.)

For each distribution R provide 4 useful functions:

**pnorm()**, evaluate the cumulative distribution function**qnorm()**, quantile function (given q, the smallest x such that P(X <= x) > q)**dnorm()**, the probability density function**rnorm()**, simulate from the distribution- Replace norm with pois, binom, gamma, ecc. and you have same functions for these other distributions

#### Let’s simulate some deviates

- Simulate 1000 random numbers from normal distribution with mean=2 and sd=1 and store them in an object called “x” (use
**rnorm()**) - simulate 1000 random numbers from gamma distribution with scale=1000 and shape=0.8 and store it in an object called “z” (use
**rgamma()**) - Each distribution has its own parameters
- Sometimes they have default values, sometimes not
- Default parameters for normal distribution are those of standard distr. (mean=0, sd=1)

1 2 |
nn <- rnorm(n = 1000, mean = 2, sd = 1) gg <- rgamma(n = 1000, scale = 750, shape = 0.8) |

#### Visualize distributions

- Plot the histogram of simulated data using
**hist()**function - Adjust the number of bins with nclass argument
- Replace the absolute frequency count with the relative one using argument probability=TRUE
- Superimpose the theoretical density on the histogram

1 2 |
hist(nn, nclass = 30, probability = TRUE) curve(dnorm(x, mean = 2, sd = 1), col = "red", add = TRUE) |

1 2 |
hist(gg, nclass = 30, probability = TRUE) curve(dgamma(x, scale = 750, shape = 0.8), col = "red", add = TRUE) |

#### Quantiles & probabilities

- Calculate the 95th quantile of normal deviates using
**quantile()**and compare it with the theoretical one using**qnorm()** - Calculate the probability (relative frequency) of having values above 2000 for gamma deviates (tip: use
**length()**function) and compare it with the theoretical one using**pgamma()**

1 2 3 4 5 6 7 8 |
# Calculate the 95th quantile of nn and compare it with the theoretical one quantile(x = nn, probs = 0.95) qnorm(p = 0.95, mean = 2, sd = 1) # Calculate the probability (relative frequency) of having values above 2000 for gg # and compare it with the theoretical one length(gg[gg>2000])/length(gg) |

#### Sampling

- Use
**sample()**function to sample randomly 4 elements from the sequence 1:10 without replacement: - use the size argument
- replacement argument is FALSE by default, so no need to specify it
- Use
**sample()**to divide PolicyPtf dataset into a training (80%) and a test (20%) sample

1 2 3 4 5 6 7 8 9 10 11 |
# Use sample() function to sample randomly 4 elements from the sequence 1:10 without replacement ss <- 1:10 sample(x = ss, size = 4) # Sample with the following probabilities instead c(1/2, rep(1/18,9)) instead sample(x = ss, size = 4, prob = c(1/2, rep(1/8, 9))) # Use sample() to divide PolicyPtf dataset into a training (80%) and a test (20%) sample idx_train <- sample(1:nrow(dt), size = 0.8*nrow(dt)) train <- dt[idx_train,] test <- dt[-idx_train,] |

That’s it for this module! If you have gone through all this code you should have learnt how to use use functions or programming blocks to develop more advanced programming than simple interactive commands.

When you’re ready, go ahead with the third module: R training – data manipulation.