This is the first module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

**Tips**

- This notebook mixes plain text and R code (highlighted text in the boxes)
- Everything to the right of ‘#’ symbol is a comment

1 2 3 |
# this is a comment x <- 1 y <- 2 # this is also a comment |

- R is case sensitive!
- Press CTRL+Enter to run programs from RStudio script editor
- To pull up documentation for a function run ?namefunction
- Remember R does not like back slash in filepaths

**R and your machine**

#### Working directory

- Once R is installed in your computer it is able to communicate and interact with your system (create folders, read existing files, etc.)
- First, let’s see where we are

1 |
getwd() |

- getwd() is a function without arguments returning the filepath to your current working directory
- Working directory is the place which by default R communicate with (save and load file, etc.)

#### Change working directory

- Create a folder called ‘RFundamentalsWeek1’ with dir.create()

1 |
dir.create("C:/Users/pc/Desktop/RFundamentalsWeek1") # fit this path to your machine (e.g. "C:/Users/YOUR-USER-NAME/Desktop/RFundamentalsWeek1") |

- set it as your working directory with setwd()

1 |
setwd("C:/Users/pc/Desktop/RFundamentalsWeek1") |

- Working directory is the folder R interacts with by default then guess what this will produce:

1 |
dir.create("sub") |

…exactly, a sub-folder in your working directory

#### Check content folder

- Check what is inside your working directory with dir()

1 |
dir() # can you see "sub"? |

- dir() search in your WD because no other path is specified
- But you can check any folder in your system

1 |
dir("C:/Users") |

- Shortcuts ‘.’ and ‘..’ help you navigate in your system

1 2 |
dir("./sub") # "." set the start in your WD dir("..") # ".." moves you one level up |

#### R workspace

- Workspace is the collection of all objects created during an R session
- list all objects in your workspace with ls() function

1 |
ls() # character(0) indicates empty |

- Create your first object named ‘x’ taking value 1

1 |
x <- 1 |

- Assignment operator ‘<-’ is used to create obects in R
- Top-right box in RStudio represents your working space (you should see ‘x’ now)
- re-runnig ls() now should return ‘x’ object

#### Remove objects from workspace

- Let’s create a bunch of objects:

1 |
y <- 99; msg <- "Hello"; msg2 <- "Hi" |

- let’s remove ‘x’ from the workspace with rm() function

1 |
rm("x") |

- by concatenating rm() and ls() we can clean-up all workspace

1 |
rm(list=ls()) # In R is very common to nest functions |

- To understand why we used list argument read documentation

1 |
?rm |

### R basic objects and operators

#### Objects’ classes in R

- In R there are four important classes of objects:

1 2 3 4 |
"Hola" # character, any string within quotes 3.14 # numeric, any real number 4L # integer, any integer number TRUE # logical, TRUE or FALSE reserved words |

- Check the class of these objects with function class()

1 2 3 4 5 |
class("Hello") class(3.14) class(4L) class(4) # without suffix "L" all numbers are numeric by default class(TRUE) |

#### Arithmetic operators

- given two numeric objects R can perform the most common arithmetic operations:

1 2 3 4 5 6 7 |
3 + 4 3 - 4 3 * 4 3 / 4 abs(3 - 4) 3^4 # or 3**4 sqrt(4) |

- In R expressions are directly evaluated and the result is returned to the console logical operators
- given a couple of atomic objects R can perform logical operations
- logical operations return a logical value (TRUE, FALSE)…

1 2 3 4 5 6 |
3 == 4 # equality "a" == "a" 3 > 4 # greater than 3 <= 4 # lower or equal than 3 != 4 # different from "hello" == "Hello" |

- …which can be combined using AND (&) and OR (|) operators

1 2 |
4 >= 3 & 3==3 4 < 3 | 3==3 |

#### Atomic vectors

- Vectors represent the simplest data structure in R
- Even single-elements objects are seen as vectors (of length one)

1 2 3 |
length("Hello") length(2) length(TRUE) |

- That’s why we call vectors atomic vectors
- A vector is a collection of elements all of the same class (character, logical, etc.)

#### More complex data structures

- More complex data structures can be seen as extensions of vectors

#### Create vectors with ‘combine’ function

- Create vectors of length>1 with c() function

1 2 3 4 |
c("Hola", "Ciao", "Hello", "Bonjour") # character vector c(0.99, 2.4, 1.4, 5.9) # numeric vector c(1L, 2L, 3L, 4L) # integer vector c(TRUE, TRUE, FALSE, TRUE) # logical vector |

- Check their class:

1 2 3 4 |
class(c("Hola", "Ciao", "Hello", "Bonjour")) class(c(0.99, 2.4, 1.4, 5.9)) class(c(1L, 2L, 3L, 4L)) class(c(TRUE, TRUE, FALSE, TRUE)) |

#### Other ways to create vectors

- Create integer vectors with seq() function (or ‘:’ operator)
- the following four expressions all produce the same result:

1 2 3 4 |
seq(from = 1, to = 4, by = 1) seq(from=1, to=4) # by=1 is default seq(1, 4) # arguments in R can be matched by position 1:4 # common operations in R have shortcuts |

- Create vectors using rep() function

1 2 3 4 |
rep(x = "a", times = 4) # replicate "a" four times rep("a", 4) # same as above rep(c("a", "b"), times = 2) # same but for a vector rep(c("a", "b"), each = 2) # element-by-element |

#### Subsetting vectors

- [logical index]

1 2 3 4 5 |
x <- 1:10 x >= 5 idx <- (x > 5) x[idx] # all values of x greater then 5 x[x < 7] # calculate index directly within brackets |

- [positive integers index]

1 2 |
x[1] # 1st element x[c(1,5)] # 1st and 5th element |

- [negative integers vector]

1 2 |
x[-1] # all but the 1st x[-c(1,10)] # all but the 1st and the 10th |

#### Arithmetic and logical operators are vectorized

- we say that a function is vectorized when it works both on vectors (and matrices) and scalars
- What do you expect these expressions will return?

1 2 3 4 5 |
c(1, 2, 3, 4) + c(5, 6, 7, 8) c(1, 2, 3, 4) / c(5, 6, 7, 8) sqrt(c(1, 2, 3, 4)) c(1, 2, 3, 4) == c(5, 6, 7, 8) c(1, 2, 3, 4) != c(5, 6, 7, 8) |

- R perform the operation element-by-element and return the vector of results so obtained
- Keep in mind that most funcions in R are vectorized…

#### Vectorization + Recycling

- we saw operations between vectors of same length:

1 |
c(1, 2, 3) + c(5, 6, 7) # simple element-by-element |

- but what if length differs?
- In the case when one is multiple of another:

1 2 |
c(1, 2) + c(5, 6, 7, 8) # shortest vector "recycled" c(1, 2, 1, 2) + c(5, 6, 7, 8) |

- The case when one isn’t multiple of another

1 2 |
c(1, 2) + c(5, 6, 7) # recycling + warning r <- c(1, 2, 1) + c(5, 6, 7) |

#### Useful functions for numerical objects

- summarizing a numerical vector

1 2 3 4 5 |
mynum <- c(3.14, 6, 8.99, 10.21, 10, 56.9, 32.1, 2.3) sum(mynum) mean(mynum) sd(mynum) # standard deviation median(mynum) |

- what if we want the skewness of this vector?
- We could check the formula and write our own function or we could search the internet (google this ‘skeweness function in r’)

#### Install a package and use its functions

- First Google result mentions a R package called ‘e1071’

1 2 |
install.packages("e1071") # install the package library(e1071) # load the package |

- Now all the functions in this package are available to use:

1 2 |
skewness(mynum) kurtosis(mynum) |

- There are almost 10.000 packages in CRAN (and many others off-CRAN), so just type your problem in google with an R tag and odds are you will find a built-in solution in some package

#### Useful functions for logical objects

- underlying structure of logical values is TRUE=1 and FALSE=0

1 2 |
mylogic <- c(F, T, F, rep(T, 3)) sum(mylogic) |

- obtain the TRUE indices of a logical object with which()

1 |
which(mylogic) |

- summarizing logical vectors…

1 2 |
any(mylogic) # is at least one of the values TRUE? all(mylogic) # are all of the values TRUE? |

Useful functions for character objects

1 |
mychar <- c("201510", "201511", "201512", "201601") |

1 2 |
substr(x = mychar, start = 1, stop = 4) # the ubiquitous substring... nchar("Hello") # number of characters in a string |

- concatenate character vectors

1 2 |
paste("I", "m", sep = "'") paste("N.", 1, sep="") # 1 is coerced to "1" |

- find and replace

1 |
gsub(pattern = "20", replacement = "", x = mychar) |

#### Implicit coercion

Coercion happens when we force an object to belong to a class – implicit coercion numeric vs CHARACTER

1 2 |
c(1.7, "a") class(c(1.7, "a")) |

- implicit coercion logical vs NUMERICAL

1 2 |
c(FALSE, 2) class(c(TRUE, 2)) |

- implicit coercion CHARACTER vs logical

1 2 |
c("a", TRUE) class(c("a", TRUE)) |

What’s holding here is a principle of least common denominator…

#### Explicit coercion

- Family of functions of form as.* coerce explicitly R objects
- consider the following numeric vector

1 2 |
x <- c(0, 1, 2, 3, 4, 5, 6) class(x) |

- Force it to a character or logical (what do you expect to happen?)

1 2 |
as.character(x) as.logical(x) # 0=FALSE, 1+ = TRUE |

- non-sensical coercion returns missing values:

1 2 |
as.numeric(c("a", "b", "c")) as.logical(c("a", "b", "c")) |

### Special values in R

#### Missing values

- NA is a reserved word in R indicating a missing value
- reserved words have special meaning and cannot be used as identifier (variable name, function name, etc.)

1 |
NA <- 1 # This will trigger an error! |

- You can use the NAs to create a placeholder for a value that exist but you don’t know…

1 2 |
year <- c(2012, 2013, 2014) gwp <- c(NA, 98.7, 32.5) |

- class deduced from non-missing elements

1 2 |
class(gwp) is.na(gwp) # indicates which elements are missing |

#### Other special values

- For a list of reserved words in R type this:

1 |
help(reserved) |

- NULL, represents an object which is absent

1 |
x <- NULL # useful to initialize objects to be filled later |

- Inf, -Inf, NaN (special words for mathematical concepts)

1 2 3 |
1/0 # infinite -1/0 # minus infinite 0/0 # undefined number |

- bear in mind that NaN is also NA, viceversa is not true

### Matrices

#### Matrix underlying structure

- In R matrices can be seen as vectors with a dimension attribute
- To highlight this idea let’s create a matrix in a not-so-common way:

1 2 3 4 5 |
x <- 1:6 # take a vector dim(x) # vector do not have dimension attribute dim(x) <- c(2, 3) # impose a 2x3 dimesion (2 rows, 3 columns) class(x) # here it is a matrix! x |

- This tricky way to create a matrix is not so common, but it is useful to understand the underlying structure of objects in R…
- …and so be able to better manipulate them for future needs

#### More common ways to create matrices

- with function matrix()

1 2 3 |
m <- matrix(data = 1:6, nrow = 2, ncol = 3) class(m) dim(m) |

- by binding rows or columns with functions rbind() or cbind()

1 2 3 4 5 6 |
x <- 1:3 y <- 10:12 m1 <- cbind(x,y) m2 <- rbind(x,y) class(m1) class(m2) |

#### Subsetting matrices

- Matrices can be subset using (i,j)-style index

1 2 3 4 |
m[1,2] # one single element m[1,] # one full row m[,3] # one full column m[,-1] # all columns but one |

- Can you think about another way to obtain the last result?
- Tip: use an integer vector with function c()

### Factors

#### Nominal factors

- Factors are used to describe items that can have a finite number of values (i.e. categories)
- You can see them as positive-integer-sequences with labels

1 2 |
f <- factor( c("f", "m", "m", "f", "f") ) class(f) |

- Factors have a levels attribute listing its unique categories
- Access levels attribute with levels() function

1 2 |
attributes(f) levels(f) |

#### Ordered factors

- If a factor has a natural order this should be specified

1 |
fo <- factor( c("low", "med", "low", "high"), ordered = TRUE) |

- Default order is alphabetical

1 |
levels(fo) <- c("low", "med", "high") # re-order |

- It can useful sometimes re-order also nominal factors (e.g. to change default base levels taken by a GLM)

1 |
levels(f) <- c("m", "f") # change alphabetical default |

- Obtain frequency count of factor combinations with table()

1 |
table(f) |

### Data Frames

#### Create a data frame from scratch

- R structure which most closely mimic SAS data set (i.e. a ‘cases by variables’ matrix of data)
- R-speaking, it is a collection of vectors and factors all having the same length
- A data frame generally has names and row.names attributes to label variables and observations respectively
- You create a data.frame with function data.frame()

1 2 3 |
df <- data.frame( x = 1:3, y = c("a", "b", "c"), f = factor( c("m", "f", "m") ) ) class(df) |

- Although more often you will create a data.frame by reading some data from a file (excel, internet, SAS, etc.)

#### Read some data

- R provides some example data we can use to practice
- We will use the ‘iris’ data.frame which gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris

1 |
data("iris") # load the example data in the workspace |

- Have an overview of the data using these functions

1 2 3 4 |
str(iris) # returns a compact summary of R objects summary(iris) # few statistics for each variable head(iris, n = 20) # visualize first 20 observations tail(iris) # last 6 observations |

#### Subset data frames

- [i,j]-index notation is valid also for data.frames

1 2 |
iris[1,1] iris[1,5] |

- Additionally you can retain one or more variables by name

1 2 3 |
iris$Sepal.Length # using $ operator iris[, "Sepal.Length"] # quoting variable's name in j slot iris[, c("Sepal.Length", "Sepal.Width")] |

- Tip: after you type ‘$’ wait for RStudio auto-completion menu
- Tip: In general press ‘Tab’ to ask RStudio auto-completion options

#### Analyse data frames

- Use the mean() function to get some overall statistics from this data

1 |
mean(iris$Petal.Length) |

- Calculate same statistic only for Setosa iris:

1 |
mean( iris[ iris$Species =="setosa", "Petal.Length" ] ) |

#### Useful functions to analyse data frames

You can see that syntax become twisted quite rapidly when more complex manipulation is needed (filter rows, select columns, etc.)

- Use with() and subset() to make your program more readable
- with() allows to call dataframe’s variables directly

1 |
with( iris, sum(Petal.Width)/sum(Petal.Length) ) |

- subset() returns a dataframe meeting certain conditions

1 |
subset( iris, Species=="setosa" ) |

- Try out this more compact code:

1 2 |
with( subset(iris, Species=="setosa"), sum(Petal.Width)/sum(Petal.Length)) |

#### Subset data frames to remove missing values

- To better explore missing values in R let’s create a new dataframe like iris but with some missing values
- Don’t worry too much about this dump code now, we’ll see for loops in the second module

1 2 3 4 5 6 |
i_na <- sample(x = 1:nrow(iris), size = 0.1*nrow(iris) ) j_na <- sample(1:4, size = 0.1*nrow(iris), replace = TRUE) iris_na <- iris for(k in 1:length(i_na)) { iris_na[i_na[k], j_na[k]] <- NA } |

- Control if there is some missing value with is.na() function

1 |
sum( is.na(iris_na) ) # not surprised ehm? |

- Understand which variables have missing values with which() function with arr.ind = TRUE argument

1 2 3 |
class(is.na(iris_na)) w <- which( x = is.na(iris_na), arr.ind = TRUE ) head(w) |

Note: When x has dimesion > 1 then the arr.ind argument tells R whether array indices should be returned Now subset the dataframe by eliminating rows where Petal.Width is missing:

1 |
iris_clean <- subset(iris_na, !is.na(Petal.Width)) |

- A method to eliminate all records including at least one missing value (no matter in which variable) is with function complete.cases()
- It returns a logical vector indicating which cases (full record) are complete

1 2 |
good <- complete.cases(iris_na) # "good" is a logical vector iris_clean <- iris[good,] |

#### Add new variables to a data frame

- Add a variable with random values 1:10 using sample() function

1 2 |
iris$new <- sample(1:10, nrow(iris), replace = TRUE) |

#### Calculating an index of correlation

- Pearson correlation can be easily calculated in R with function cor()

1 |
cor(iris[,which(lapply(iris, class)=="numeric")]) |

That’s it for this module! If you have gone through all this code you should have learnt the fundamentals of the R language.

When you’re ready, go ahead with the second module: R training – functions and programming blocks.