R Training – Data Manipulation

example of database join

example of database join

This is the third module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

Setup

Load the packages in your workspace

Let’s work with iris data

Convert to data.table

data.table is an efficient package to perform data manipulation. It is often convenient to convert a data.frame into a data.table to leverage the advanced data manipulation features of this package.

the good thing is that data.frame class is not lost in this way.

Be careful when calling functions expecting one class or the other (e.g. ggplot2)

 

Removing NAs

Depending on the study we may want to simply exclude missing values.

 

Coalescing NAs to 0s

Another option is to replace NAs with 0s. This function will make the trick using data.table syntax.

 

data.table presentation

data.table is a package to handle data frames. It is particularly appreciated for its speed in performing operations on large data sets. It also makes it very easy to manipulate data for those knowing the basics of SQL since you can think of a data.table object as being structured this way:

  • DT[WHERE, SELECT, GROUP BY]

 

dplyr presentation

Quoting from CRAN dplyr package is a fast, consistent tool for working with data-frame-like objects. The main methods of this package to manipulate data are the following:

  • Filter(), to select specific rows
  • Arrange() the rows of your data into some order
  • Mutate() your data frame to contain new columns
  • Summarise() groups of your data in some way

but there are also lesser known tricks you can do with this package. Here we’ll cover just a few examples to give you an idea of its functioning.

Another awesome thing about dplyr is that it works seamlessly with the pipeline operator.

The pipeline operator from magrittr package (installed along with dplyr) allows you to create pipes. Pipes are chains which take the output from a function and feed it to the first argument of the next function. In this way it becomes very elegant and fast to query data frames.

 

Filtering rows

data table way

dplyr way

Subsetting columns

Data.table way

dplyr way by name of variables

or using the pipeline operator

or by position of variables

Group by

data frame way

Data.table way

dplyr way

Group by several variables, data.table way

Group by several variables, dplyr way

Merging tables

Typically a complex analysis will involve more than one table and often we will need to merge them to perform our analyses. When the tables have a similar structure by row or by column we may need to bind them.

Binding

Same columns: row binding

Same rows: column binding

Joining

More often what we would like to do is to merge different tables based on the values of some common variable.

  • to explore left_outer_join we make up a dataset with some info (totally fake!) on iris species

  • to explore inner_join we make up another dataset

Joining with base functions

  • with base merge functions the key variable has to have the same name on both tables

  • Inner join

  • Right outer join

Joining, data table way

  • Left outer join

  • Inner join

  • Right outer join

Joining, dplyr way

  • left join

  • inner join

  • right join

dplyr anti join

Suppose there is one full dataset (‘A’) and another one subset of the first (‘B’). A nice way to get only the data of A that it is not on B is the following:

 

That’s it for this module! If you have gone through all this code you should have learnt the basics of two fundamentals package in R for data manipulation, data.table and dplyr.

When you’re ready, go ahead with the fourth module: R training – data visualization.

 

Leave a Reply

Your email address will not be published. Required fields are marked *