R Training – Functions and programming blocks

example of linear interpolation in R

example of linear interpolation in R

This is the second module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

Introduction

  • Typing commands each time you need them is not very efficient
  • Control statements and user-defined functions help us to move forward and write more productive and complex programs

Import, overview and analyse simple data

Reading Data

  • load iris dataset

Overview Data

  • overview data using str() and summary()
  • What’s the maximum value of Petal.Width?
  • How many levels has Species?
  • Have a look at the first 10 and last 3 rows using head() and tail()

Remind:

  • head() and tail() display 6 rows by default, but you can choose how many with argument n

Spot missing values

  • Check for missing values using any() and is.na()
  • Start doing that in two steps
  • Then nest the two functions to obtain same result

Remind:

  • is.na() is vectorized, as most of R functions
  • Vectorized functions are used same way with scalars, vectors, matrices, etc.
  • What changes is the value returned which depend on input provided…
  • …What kind of object do you expect to be returned by is.na applied to a dataframe?

Explore factors

  • Discover how many different levels there are in Species variable using unique() or table() functions

Summarize numerical variables

  • Calculate mean petal width in this dataset using mean()

Remind:

  • you can access variables in a dataframe with $ operator
  • Or with the [] operator specifying in the colum slot either…
  • …an integer index indicating which column to pick
  • …or a string indicating the exact name of variable to pick

Summarize numerical variables from pre-filtered data

  • Calculate average petal width only for setosa iris
  • Calculate average petal length only for setosa iris

Tips:

  • Anytime you have a doubt on the exact spell (or position) of a variable consider having a quick view at them with names() or str()
  • You could create a filtered dataset first and summarize it after…
  • …but this is not very efficient
  • Rather get advantage of [] operator which allows you to query rows and columns at same time

Storing analysis’ results

  • Store last results in a vector mystats using function c()
  • Visualize the results by printing the vector

Tips:

  • Consider store separately the two results…
  • …then use them to fill the vector, our final output
  • To visualize the value of an object you can use print() (explicit printing) or simply the name of the object (auto-printing)

Change variable names

Change these names this way:

  • Petal.Width -> pwidth
  • Petal.Length -> plength

…using names() function

Remind:

  • names() allow you to access names attribute of the data frame

Explore numerical data with histograms

  • Create a histogram of the variable plength using hist()
  • Create another one only for setosa iris

Make it prettier by:

  • changing number of bins (nclass=30)
  • title (main=“yourTitle”)
  • x-axis label (xlab=“yourLabel”)

Store the plot in a variable called myhist

histogram iris data

Lists

Presenting lists

  • Lists are ordered collection of objects (called list’s components)
  • Components can be of any type (logical vector, matrix, function,…, whatever!)
  • Lists’ components are numbered and may be referred to as such
  • When components are named they can be referred to by name
  • Subscripting operator […] can be used to obtain sublists of a list… -…but to select single components you need [[. . . ]] or $
  • Let’s now create some objects we will then store in a list

Create a list

  • Create an object named ‘x’ taking value ‘MyReport’
  • Store the objects mystats, myhist and x in a list named mylist using list()
  • Give a name to each component in the list using the form list(namecomponent=component,. . . )
  • Check the number of components in mylist with function lenght()

Play with sublists

  • Sublist mylist keeping only the first entry
  • Sublist mylist keeping only the first and third entries
  • Play a bit with […] operator filled with positive (and negative) integer indexes as well as variables’ names (since our list has named components) …Do you think is possible to extract the second element of mystat component with []?

Play with extraction of lists’ components

  • Extract first component using [[. . . ]], [[“. . .”]] and $ operators
  • Experiment partial matching with $ operator
  • Extract the second element of first component
  • Sum mystat vector contained in mylist Remind: – […] for lists returns sub-lists – [[…]] returns lists’ components

Grouped expressions

Presenting grouped expressions

  • In R every command is a function returning some value
  • Commands may be grouped together in braces: {expr_1 expr_2 }
  • …in which case the value of the group is the result of the last expression in the group evaluated
  • Grouped expressions are very useful together with control statements available in the R language

Control statements: conditional execution

if-else statement

  • If-else conditional construction takes the form

if (expr_1) expr_2 else expr_3

where:

  • expr_1 must evaluate to a single logical value
  • expr_2 is returned when expr_1 returns TRUE
  • expr_3 is returned elsewhere

Play with if-else stataments

  • Use an if-else statement to test whether or not data contain at least one missing values
  • then trigger a warning in the first case (something like ‘Watch out! there are missing values…’)
  • or simply print a message otherwise (something like ‘everything’s fine, go ahead…’)

Tips:

  • Use the following functions: any(), is.na(), print(), warning()

Play further with if-else statements

  • Use an if-else statement to test whether data is of class dataframe
  • Then create a variable taking value the number of observation in the data when condition is met and NA otherwise

Tips:

  • use functions: class(), nrow()

Vectorized if/else: ifelse() function

  • ifelse() is a vectorized version of the if/else construct
  • Create a variable named plength_high taking value 1 if plength>5 and 0 otherwise using ifelse()

Tip:

  • ifelse() has this structure ifelse(condition, if_true, if_false)

Control statement: repetitive execution

looping

  • A for loop statment has the form:

for (name in expr_1) expr_2

where: – name is the loop variable – expr_1 is a vector expression, (often a sequence like 1:20) – expr_2 is often a grouped expression with its sub-expressions written in terms of the loop variable

For loops make that expr_2 is repeatedly evaluated as name ranges through the values in the vector result of expr_1

Play with loops

  • Use a for statment to loop over integer sequence 1:10 and print iteratively the loop variable

…not very useful, right?

  • Use for statement to loop over columns of dataset and print the class of each of it

  • Repeat it but store the classes in a vector

Tips:

  • in grouped expressions auto-printing do not apply (use explicit printing)
  • initialize your target vector before, out of the loop
  • You can use NULL to initialize an empty object

Play further with loops

  • A loop variable can be anything, not only an integer sequence
  • Use a for statement to loop over all possible values of Species
  • Then use them to calculate average petal length for each Species
  • Final output is a numerical vector
  • Bind result vector with the values of Species to make the result readable
  • Order the matrix by Species (alphabetically)
  • Round average lengths to have no decimal place

Tips:

  • Use functions cbind(), order(), round()

Alternative (better) ways to loop

  • for looping is not usually the best way to obtain a result in R
  • Code that take a whole object view is likely to be both clearer and faster
  • More advanced ways to obtain that same result are available in R
  • Previous loops can be obtained with:

Apply family of functions

  • apply functions implement looping
  • You can imagine apply functions as doing the following behind the scenes:
  • SPLIT up some data into smaller pieces
  • APPLY a function to each piece
  • then COMBINE the results

Functions

Presenting functions in R

  • Functions represent one of the most powerful tool of R
  • Somehow the transition between interactive and developing programming mode
  • A function is defined by an assignment of the form:

name <- function(arg_1, arg_2) expression

where:

  • expression usually is a grouped expression that uses the arguments to calculate a value
  • A call to the function then usually takes the form:

name(arg_1 = value_1, arg_2 = value_2)

  • Functions can be treated much like any other object
  • They can be nested
  • They return last expression evaluated

Functions, arguments and defaults

  • Arguments in R can be matched by order and by name
  • if you mantain the exact order of function definition then there’s no need to specify the names of the arguments
  • Arguments can have default values:

name <- function(arg_1, arg_2 = NULL, arg_3 = 1)

  • arguments without default values must be always specified when calling the function

Built-in functions

Most of the functions supplied as part of the R system are themselves written in R and thus do not differ materially from user written functions

Check the R code behind built-in functions like mean, sd, etc by simply typing their names

  • Many functions call some primitive functions whose code (written in C) is masked

Functions in loaded packages

Write your functions

  • Write a simple function called AvgLenIris taking no arguments and returning average petal length of Iris
  • Call the function to see if it works (remember the parenthesis even if no argument is needed…)
  • Store the value returned by the function in an object named ‘y’

  • Generalize the function adding an argument which will be the field to average, call it AvgAnyIris

Functions with control structures

  • Write a new function AvgAnyIris identical to previous one but with a control over argument validity
  • Control whether field is numeric using is.numeric()
  • Test the control structures are working

Remind:

  • Remind that integers are also numeric
  • Remind you can revert logical expressions with ! operator

Tips:

  • Use if statement followed by stop() function
  • stop() takes as argument a string/message to print whether a condition is met

Functions with optional arguments

  • Write a function GroupedAvgAnyIris similar to last one but with an additional argument indicating a categorical variable to group the results by (univariate analysis)
  • argument indicating grouping variable has default to NULL
  • When no group value is provided then the same result of previous function (overall average) should be returned

Tips:

  • Use a for loop or tapply() as seen before

Functions returning lists

  • Sometimes functions need to return more than one object
  • In this case lists come in very handy
  • Write a simple function returning a list with the division, multiplication, addition and difference between any two numbers
  • Call the function and store the result in a new object
  • Check the result with str() function
  • Extract the division with $ operator

Probability distributions & simulation

Probability distributions

  • R language implements the tables of main probability distributions (normal, poisson, binomial, etc.)

For each distribution R provide 4 useful functions:

  • pnorm(), evaluate the cumulative distribution function
  • qnorm(), quantile function (given q, the smallest x such that P(X <= x) > q)
  • dnorm(), the probability density function
  • rnorm(), simulate from the distribution
  • Replace norm with pois, binom, gamma, ecc. and you have same functions for these other distributions

Let’s simulate some deviates

  • Simulate 1000 random numbers from normal distribution with mean=2 and sd=1 and store them in an object called “x” (use rnorm())
  • simulate 1000 random numbers from gamma distribution with scale=1000 and shape=0.8 and store it in an object called “z” (use rgamma())
  • Each distribution has its own parameters
  • Sometimes they have default values, sometimes not
  • Default parameters for normal distribution are those of standard distr. (mean=0, sd=1)

Visualize distributions

  • Plot the histogram of simulated data using hist() function
  • Adjust the number of bins with nclass argument
  • Replace the absolute frequency count with the relative one using argument probability=TRUE
  • Superimpose the theoretical density on the histogram

normal distribution

gamma distribution

Quantiles & probabilities

  • Calculate the 95th quantile of normal deviates using quantile() and compare it with the theoretical one using qnorm()
  • Calculate the probability (relative frequency) of having values above 2000 for gamma deviates (tip: use length() function) and compare it with the theoretical one using pgamma()

Sampling

  • Use sample() function to sample randomly 4 elements from the sequence 1:10 without replacement:
  • use the size argument
  • replacement argument is FALSE by default, so no need to specify it
  • Use sample() to divide PolicyPtf dataset into a training (80%) and a test (20%) sample

 

That’s it for this module! If you have gone through all this code you should have learnt how to use use functions or programming blocks to develop more advanced programming than simple interactive commands.

When you’re ready, go ahead with the third module: R training – data manipulation.

Leave a Reply

Your email address will not be published. Required fields are marked *