R Training – Data Visualization

ggplot iris example dataset in R

ggplot iris example dataset in R

This is the fourth module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

Graphics in data projects can be useful for several tasks including:

  • understand data properties
  • find patterns in data
  • communicate results

First of all, let’s load some useful packages.

Understand data properties

We will start with some exploratory graphics to summarize data and highlight broad features. This is useful to explore basic questions and hypothesis, suggest modeling strategies and so on.

plot( ) is a generic function to plot R objects. It is generic because it adapts to the input provided:

  • if you provide a numeric vector the default is to plot them as points on the y axis against an integer index on x axis
  • if you provide two numeric vectors the default is to plot the points determined by the (x,y) couples (a scatterplot)
  • if you provide a dataframe with a numerical and a factor you will get a barplot

plot one way

scatterplot

barplot

Let’s load some example data

Once you know your data is clean you may want to explore some features more in detail.

barplot iris

There is also a specific function to create barplots in R, but input have to be provided in a slightly different way:

barplot iris standard

Looking at the summary we see that minimum sepal length is 4.3, maximum 7.9 and median 5.8. We have also other quantiles but to have a more thorough view of the distribution you should draw a histogram.

hist iris 1

hist iris 2

Another way to get a quick visualization of a distribution is to use boxplots.

boxplot one-way

In this case we see clearly that:

  • the bulk of distribution (50%) has a value around 5 and 6.5
  • maximum value excluding outliers is somewhere between 7.5 and 8.0
  • right tail is longer than left tail

In R-boxplots the box correspond to the interquartile range (from 25th to 75th quantile), black line inside the box is the median, the lines extending vertically from the box (whiskers) indicate variability outside the upper and lower quartile. Outliers are plotted as individual points (if any).

 

Find patterns

Usually it is a good idea to investigate relations using graphics since we are naturally prone to detect trends, relationships, etc. in a visual way.

When we talk about patterns in data we usually refer to relationships between two or more variables. Options to visualize two dimnensions are:

  • draw multiple boxplots in one window
  • scatterplots
  • etc.

To add a 3rd dimension one option is to use different colors, shapes, sizes, etc. (rather than using 3D graphics, which are typically hard to interpret).

Say we want to see if age distribution changes according to car category.

box plot multi

The hist() function does not support the formula statment, but you can modify directly the global graphical parameteres in order to split the graphical device into multiple slots. Before changing global parameters it is a good idea to save a copy of original settings in order to easily go back to defaults once done with the plot.

hist multi par

 

Scatterplot

Let’s simulate some numbers and draw scatterplots.

scatterplot 2way

scatterplot 3way

 

Spatial analysis

If you are interested in the visualization of a geographical attribute then a map is probably what you need. R can be used as a fast, user-friendly and extremely powerful command-line Geographic Information System (GIS).

In R there is a large and growing number of spatial data packages. Here we will focus on rworldmap, a package for visualising global data referenced by country.

The package stores multiple maps which can be accessed through getMap function.

Maps in R are classified as spatial (sp) objects. Spatial objects are made up of a number of different slots (that can be accessed through the @ operator):

  • bbox (bounding box, mostly used for setting up plots)
  • data (data indeed)
  • polygons/lines/points/… (the geometry instructing R on how to plot maps)
  • proj4string (define the coordinate reference system)

Inside each slot you may have multiple components which, as usual, can be accessed with the $ operator.

Plot is a generic function and it works also with spatial objects.

To add some information in this map we need some attribute at country-level. The package rworldmap itself offers some interesting environmental dataset.

The package rworldmap provides a function to join country-level attributes to an internal map. All you need to do is to provide the name of the column containing the key for join (nameJoinColumn = ‘ISO3V10’) and specify you want to join by that key (joinCode = ‘ISO3’)

Function mapCountryData in rworldmap draws a map of country-level data, allowing countries to be coloured.

biodiversity hot map

 

Using spatial data in R can be challenging because there are many types and formats and there are many packages coming from diverse user communities. Anyway there is an increasing trend of harmonization and the capabilities offered are extremely vast. A good start is the CRAN tutorial, or one of the many tutorials on github.

Communicate results

Typically the findings of a data analysis are shared with an audience and in general visual aids help people to digest complex messages. In this context the sizes, shapes, widths, labels, margins, fonts, etc. are all things that become important because they can contribute to make the visualization clearer.

 

Additional graphical parameters

When applicable plot function allows you to specify many additional graphical parameters. To have a list of them type ?par

Let’s take the histogram created before and clean it a bit with additional graphical parameters.

hist formatted

Ggplot

All functions used until now belong to the base plotting systems. In R there are 3 different plotting systems available:

  • base
  • lattice
  • ggplot

ggplot is an implementation of the Grammar of Graphics by Leland Wilkinson (a set of principles for graphics). Grammar of graphics is a description of how graphics can be broken down into abstract concepts (like languages are divided in nouns, adjectives, etc.). Ggplot graphics abstraction is a very powerful concept to organize all kind of graphics and has become extremely popular in recent years.

Ggplot2, as lattice, is built upon the grid package which is able to control all details of the graphic system in R. This is why ggplot allows you to produce a wide variety of visualizations virtually according to every needs and purpose. For the same reason ggplot is typically the first choice for high-quality works in R, ready to publish.

Briefly, from the ggplot book,

the grammar tells us that a statistical graphics is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system.

Another key feature of ggplot graphics is that they are built with layers and this explain the sum symbol (+) you will see in the code.

ggplot density plot

 

Colours

A careful choice of colors can help to draw better visualizations. R has 657 built-in color names. Use colors() for a list of all colors known by R.

When we need to show a range of colors we can use palettes. In the map created before the palette was not specified so mapCountryData function used its default value (in that case a heat palette, with colors ranging gradually from yellow to red). We can customize palettes to our needs.

A reference package for color palettes is RColorBrewer. The function to create palettes is brewer.pal. It takes two arguments:

  • n –> Number of different colors in the palette, minimum 3, maximum depending on palette
  • name –> a palette name

To have a look at all available palettes you can use:

colours display 1

colours display 2

For an interactive viewer of palettes you can visit this page.

biodiversity2

 

Graphical devices

Once your nice plot is completed you may want to export it for reporting purpose. There are many graphic devices in R. A graphic device is something where you can make a plot appear:

  • a window on your computer (screen device)
  • a PDF file (file device)
  • a PNG or JPEG (file device)
  • a scalable vector graphics (SVG) file (file device)

When you make a plot in R it has to be “sent” to a specific graphic device. The most common place to be sent is the screen. On Mac screen device is launched with the quartz(), in windows with windows(), on Unix/Linux with x11().

Functions like plot(), hist(), ggplot() they all have screen as default device. If you want to send the graphics to a device different from screen you have to:

  • explicitly launch a graphic device
  • call a plotting function to make a plot (note that if you are using a file device no plot will appear on the screen!)
  • annotate plot if necessary (add legends, etc.)
  • explicitly close the graphics device with dev.off()

R graphical capabilities are enormous and we have only scratched the surface. To get inspired consider have a tour in R graph gallery.

 

That’s it for this module! If you have gone through all this code you should have learnt the basics of R graphical capabilities.

 

Leave a Reply

Your email address will not be published. Required fields are marked *