R Training – The Basics

r training hello world

r training hello world

This is the first module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

Tips

  • This notebook mixes plain text and R code (highlighted text in the boxes)
  • Everything to the right of ‘#’ symbol is a comment

  • R is case sensitive!
  • Press CTRL+Enter to run programs from RStudio script editor
  • To pull up documentation for a function run ?namefunction
  • Remember R does not like back slash in filepaths

 

R and your machine

Working directory

  • Once R is installed in your computer it is able to communicate and interact with your system (create folders, read existing files, etc.)
  • First, let’s see where we are

  • getwd() is a function without arguments returning the filepath to your current working directory
  • Working directory is the place which by default R communicate with (save and load file, etc.)

Change working directory

  • Create a folder called ‘RFundamentalsWeek1’ with dir.create()

  • set it as your working directory with setwd()

  • Working directory is the folder R interacts with by default then guess what this will produce:

…exactly, a sub-folder in your working directory

Check content folder

  • Check what is inside your working directory with dir()

  • dir() search in your WD because no other path is specified
  • But you can check any folder in your system

  • Shortcuts ‘.’ and ‘..’ help you navigate in your system

R workspace

  • Workspace is the collection of all objects created during an R session
  • list all objects in your workspace with ls() function

  • Create your first object named ‘x’ taking value 1

  • Assignment operator ‘<-’ is used to create obects in R
  • Top-right box in RStudio represents your working space (you should see ‘x’ now)
  • re-runnig ls() now should return ‘x’ object

Remove objects from workspace

  • Let’s create a bunch of objects:

  • let’s remove ‘x’ from the workspace with rm() function

  • by concatenating rm() and ls() we can clean-up all workspace

  • To understand why we used list argument read documentation

 

R basic objects and operators

Objects’ classes in R

  • In R there are four important classes of objects:

  • Check the class of these objects with function class()

Arithmetic operators

  • given two numeric objects R can perform the most common arithmetic operations:

  • In R expressions are directly evaluated and the result is returned to the console logical operators
  • given a couple of atomic objects R can perform logical operations
  • logical operations return a logical value (TRUE, FALSE)…

  • …which can be combined using AND (&) and OR (|) operators

Atomic vectors

  • Vectors represent the simplest data structure in R
  • Even single-elements objects are seen as vectors (of length one)

  • That’s why we call vectors atomic vectors
  • A vector is a collection of elements all of the same class (character, logical, etc.)

More complex data structures

  • More complex data structures can be seen as extensions of vectors

r data structures

 

Create vectors with ‘combine’ function

  • Create vectors of length>1 with c() function

  • Check their class:

Other ways to create vectors

  • Create integer vectors with seq() function (or ‘:’ operator)
  • the following four expressions all produce the same result:

  • Create vectors using rep() function

Subsetting vectors

  • [logical index]

  • [positive integers index]

  • [negative integers vector]

Arithmetic and logical operators are vectorized

  • we say that a function is vectorized when it works both on vectors (and matrices) and scalars
  • What do you expect these expressions will return?

  • R perform the operation element-by-element and return the vector of results so obtained
  • Keep in mind that most funcions in R are vectorized…

Vectorization + Recycling

  • we saw operations between vectors of same length:

  • but what if length differs?
  • In the case when one is multiple of another:

  • The case when one isn’t multiple of another

Useful functions for numerical objects

  • summarizing a numerical vector

  • what if we want the skewness of this vector?
  • We could check the formula and write our own function or we could search the internet (google this ‘skeweness function in r’)

Install a package and use its functions

  • First Google result mentions a R package called ‘e1071’

  • Now all the functions in this package are available to use:

  • There are almost 10.000 packages in CRAN (and many others off-CRAN), so just type your problem in google with an R tag and odds are you will find a built-in solution in some package

Useful functions for logical objects

  • underlying structure of logical values is TRUE=1 and FALSE=0

  • obtain the TRUE indices of a logical object with which()

  • summarizing logical vectors…

Useful functions for character objects

  • concatenate character vectors

  • find and replace

Implicit coercion

Coercion happens when we force an object to belong to a class – implicit coercion numeric vs CHARACTER

  • implicit coercion logical vs NUMERICAL

  • implicit coercion CHARACTER vs logical

What’s holding here is a principle of least common denominator…

Explicit coercion

  • Family of functions of form as.* coerce explicitly R objects
  • consider the following numeric vector

  • Force it to a character or logical (what do you expect to happen?)

  • non-sensical coercion returns missing values:

 

Special values in R

Missing values

  • NA is a reserved word in R indicating a missing value
  • reserved words have special meaning and cannot be used as identifier (variable name, function name, etc.)

  • You can use the NAs to create a placeholder for a value that exist but you don’t know…

  • class deduced from non-missing elements

Other special values

  • For a list of reserved words in R type this:

  • NULL, represents an object which is absent

  • Inf, -Inf, NaN (special words for mathematical concepts)

  • bear in mind that NaN is also NA, viceversa is not true

 

Matrices

Matrix underlying structure

  • In R matrices can be seen as vectors with a dimension attribute
  • To highlight this idea let’s create a matrix in a not-so-common way:

  • This tricky way to create a matrix is not so common, but it is useful to understand the underlying structure of objects in R…
  • …and so be able to better manipulate them for future needs

More common ways to create matrices

  • with function matrix()

  • by binding rows or columns with functions rbind() or cbind()

Subsetting matrices

  • Matrices can be subset using (i,j)-style index

  • Can you think about another way to obtain the last result?
  • Tip: use an integer vector with function c()

Factors

Nominal factors

  • Factors are used to describe items that can have a finite number of values (i.e. categories)
  • You can see them as positive-integer-sequences with labels

  • Factors have a levels attribute listing its unique categories
  • Access levels attribute with levels() function

Ordered factors

  • If a factor has a natural order this should be specified

  • Default order is alphabetical

  • It can useful sometimes re-order also nominal factors (e.g. to change default base levels taken by a GLM)

  • Obtain frequency count of factor combinations with table()

 

Data Frames

Create a data frame from scratch

  • R structure which most closely mimic SAS data set (i.e. a ‘cases by variables’ matrix of data)
  • R-speaking, it is a collection of vectors and factors all having the same length
  • A data frame generally has names and row.names attributes to label variables and observations respectively
  • You create a data.frame with function data.frame()

  • Although more often you will create a data.frame by reading some data from a file (excel, internet, SAS, etc.)

Read some data

  • R provides some example data we can use to practice
  • We will use the ‘iris’ data.frame which gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris

  • Have an overview of the data using these functions

Subset data frames

  • [i,j]-index notation is valid also for data.frames

  • Additionally you can retain one or more variables by name

  • Tip: after you type ‘$’ wait for RStudio auto-completion menu
  • Tip: In general press ‘Tab’ to ask RStudio auto-completion options

Analyse data frames

  • Use the mean() function to get some overall statistics from this data

  • Calculate same statistic only for Setosa iris:

Useful functions to analyse data frames

You can see that syntax become twisted quite rapidly when more complex manipulation is needed (filter rows, select columns, etc.)

  • Use with() and subset() to make your program more readable
  • with() allows to call dataframe’s variables directly

  • subset() returns a dataframe meeting certain conditions

  • Try out this more compact code:

Subset data frames to remove missing values

  • To better explore missing values in R let’s create a new dataframe like iris but with some missing values
  • Don’t worry too much about this dump code now, we’ll see for loops in the second module

  • Control if there is some missing value with is.na() function

  • Understand which variables have missing values with which() function with arr.ind = TRUE argument

Note: When x has dimesion > 1 then the arr.ind argument tells R whether array indices should be returned Now subset the dataframe by eliminating rows where Petal.Width is missing:

  • A method to eliminate all records including at least one missing value (no matter in which variable) is with function complete.cases()
  • It returns a logical vector indicating which cases (full record) are complete

Add new variables to a data frame

  • Add a variable with random values 1:10 using sample() function

Calculating an index of correlation

  • Pearson correlation can be easily calculated in R with function cor()

That’s it for this module! If you have gone through all this code you should have learnt the fundamentals of the R language.

When you’re ready, go ahead with the second module: R training – functions and programming blocks.

Leave a Reply

Your email address will not be published. Required fields are marked *