Thu, 04/26/2018 – 09:30
R is the de facto statistical
package in the Open Source world. It’s also quickly becoming the default
data-analysis tool in many scientific disciplines.
R’s core design includes
a central processing engine that runs your code, with
a very simple interface to the outside world. This basic interface
means it’s been easy to build graphical interfaces that wrap the
core portion of R, so lots of options exist that you
can use as a GUI.
In this article, I look at one of the available GUIs:
RStudio. RStudio is a commercial program, with a free community version,
available for Linux, Mac OSX and Windows, so your data analysis
work should port easily regardless of environment.
For Linux, you can install the main RStudio package from the
From there, you can
download RPM files for Red Hat-based distributions or DEB files for
Debian-based distributions, then use either
to do the installation.
For example, in Debian-based distributions,
use the following to install RStudio:
sudo dpkg -i rstudio-xenial-1.1.423-amd64.deb
It’s important to note that RStudio is only the GUI interface. This
means you need to install R itself as a separate step. Install the core
parts of R with:
sudo apt-get install r-base
There’s also a community repository of available packages, called CRAN,
that can add huge amounts of functionality to R. You’ll want to install
at least some of them in order to have some common tools to use:
sudo apt-get install r-recommended
There are equivalent commands for RPM-based distributions too.
point, you should have a complete system to do some data analysis.
When you first start RStudio, you’ll see a window that looks
somewhat like Figure 1.
Figure 1. RStudio creates a new session, including a console interface to R, where
you can start your work.
The main pane of
the window, on the left-hand side, provides a console interface where
you can interact directly with the R session that’s running in the
The right-hand side is divided into two sections, where each
section has multiple tabs. The default tab in the top section
is an environment pane. Here, you’ll see all the objects that
have been created and exist within the current R session.
other two tabs provide the history of every command given and a list
of any connections to external data sources.
The bottom pane has five
tabs available. The default tab gives you a file listing of the current
working directory. The second tab provides a plot window where any
data plots you generate are displayed. The third tab provides a nicely
ordered view into R’s library system. It shows a list of all of
the currently installed libraries, along with tools to manage updates and
install new libraries. The fourth tab is the help viewer. R includes
a very complete and robust help system modeled on
Linux man pages. The last tab is a general “viewer” pane to view other
types of objects.
One part of RStudio that’s a great help to people managing multiple
areas of research is the ability to use projects. Clicking the menu item
File→New Project pops up a window where you can select how your
new project will exist on the filesystem.
Figure 2. When you create a new project, it can be created in a new directory,
an existing directory or be checked out from a code repository.
As an example, let’s create a new project hosted in a local directory. The file display
in the bottom-right pane changes to the new directory, and you should
see a new file named after the project name, with the filename ending
.Rproj. This file contains the configuration for your new project. Although
you can interact with the R session directly through the console, doing so
doesn’t really lead to easily reproduced workflows. A better solution,
especially within a project, is to open a script editor and write your
code within a script file. This way you automatically have a starting
point when you move beyond the development phase of your research.
you click File→New File→R Script, a new pane opens in
the top left-hand side of the window.
Figure 3. The script editor allows you to construct more complicated pieces of code
than is possible using just the console interface.
From here, you can
write your R code with all the standard tools you’d expect
in a code editor. To execute this code,
you have two options. The first is simply to click the run
button in the top right of this editor pane. This will run either the
single line where the cursor is located or an entire block of code that
previously had been highlighted.
Figure 4. You can enter code in the script editor and then have them run to
make code development and data analysis a bit easier on your brain.
If you have an entire
script file that you want to run as a whole, you can click the source
button in the top right of the editor pane. This lets you
reproduce analysis that was done at an earlier time.
The last item to mention is data visualization in RStudio. Actually,
the data visualization is handled by other libraries within R. There is
a very complete, and complex, graphics ability within the core of R. For
normal humans, several libraries are built on top of this. One of
the most popular, and for good reason, is ggplot. If it isn’t already
installed on your system, you can get it with:
Once it’s installed, you can make a simple scatter plot with
As you can see, ggplot takes dataframes as the data to plot, and you
control the display with
aes() function calls and
calls. In this case, I used the
geom_point() function to get
a scatter plot of points. The plot then is generated
in the bottom-left pane.
ggplot2 is one of the most powerful and popular graphing tools available in the R
There’s a lot more
functionality available in RStudio, including a server portion that can
be run on a cluster, allowing you to develop code locally and then
send it off to a server for the actual processing.
Powered by WPeMatico