For those of you unfamiliar with the tidyverse, it is a collection of R packages that share common philosophies and are designed to work together. Most if not all, are created by R-god Hadley Wickham, one of the leads at RStudio. I was introduced to the tidyverse-packages such as ggplot2 and plyr in my second R-course, and they have cleaned and sped up my workflow tremendously ever since.

Although I don’t want to mix in the political debate, I came across such a wonderful example of how the tidyverse has simplified coding in R. On the downside, those unfamiliar with the language (Yes, the tidyverse has its own language) could have trouble understanding what happens in the code the author uses. Therefore, this will be the first of a series of posts that help to explore and understand the language of the tidyverse.

Running the following R-code will install the core packages of the tidyverse:

install.packages(‘tidyverse’)

These consist among others of the following:

  • ggplot2: a more potent way of visualization
  • tibble: an upgrade to the standard data.frame()
  • tidyr: a new way of cleaning and wrangling data
  • readr: a faster way to load in standard data files
  • purr: adds new functionality (map()) similar to apply()
  • dplyr: adds piping functionality to improve readability and speed up workflow

There are several other packages included in the tidyverse (e.g, stringr), but the above are the ones you are most likely to use in everyday projects.

Now, how about dissecting the code in the post. The author (1) loads some functionality in R,  (2) scrapes data on approval rates from the web, (3) cleans it up, and creates a wonderful visualization. S/He does this all in only 35 lines of code! Better even, 2 of these code lines are blank, 3 are setup, 6 have aesthetic purposes, and many others could be combined being only several characters long. Due to the tidyverse, the code is easy to read, transparent, and reproducible (it only consists of two dplyr-chained code blocks, after loading the packages), and takes only 7 seconds to run!

   user  system elapsed 
   5.67    0.85    6.53

In the rest of this article, I walk you through the code of this post to explain what’s happening:

  • hrbrthemes package includes additional ggplot2 themes (plot colors, etc.)
  • rvest package includes functionalities for webscraping
  • tidyverse package we discussed above
library(hrbrthemes) 
library(rvest)
library(tidyverse)

Below, the author then creates a list containing the links to the online data to scrape and run it through a dplyr pipe (%>%) to apply the next bit of code to it.

The map_df() functionality comes from the purr package and applies the subsequent code to every element in the earlier list:

  • Read in the html files specified earlier in the list %>%
  • Convert them to a table %>%
  • Store the name of the list (this is the name of the president) as .id %>%
  • Store that as a data.frame %>%
  • Select columns (and rename them) %>%
  • Use the earlier stored president id and add it as a column (‘who’) %>%
  • Save the output as a dataframe called ratings.
list(
  Obama="http://m.rasmussenreports.com/public_content/politics/obama_administration/obama_approval_index_history",
  Trump="http://m.rasmussenreports.com/public_content/politics/trump_administration/trump_approval_index_history"
) %>% 
map_df(~{
    read_html(.x) %>%
      html_table() %>%
      .[[1]] %>%
      tbl_df() %>%
      select(date=Date, approve=`Total Approve`, disapprove=`Total Disapprove`)
  }, .id="who") -> ratings

Below, the author then starts a new chained code block. S/He first changes (mutate), from the ratings dataframe, the approval & disapproval data with a custom function (get rid of the % sign and divide by 100), which is then piped through:

  • Change (mutate) dates to a data format (libridate is another tidyverse package) %>%
  • Filter out any missing values %>%
  • Group by the ‘who’-column (President name) %>%
  • Sort the data file by earlier specified date %>%
  • Give every line an id number, from 1 up to the number of records (n() returns the sample size per President due to group_by()) %>%
  • Ungroup the data %>%

For readability, I split the code here, but it actually still continues as depicted by the %>% at the end.

mutate_at(ratings, c("approve", "disapprove"), function(x) as.numeric(gsub("%", "", x, fixed=TRUE))/100) %>%
  mutate(date = lubridate::dmy(date)) %>%
  filter(!is.na(approve)) %>%
  group_by(who) %>%
  arrange(date) %>%
  mutate(dnum = 1:n()) %>%
  ungroup() %>%

The output is now entered into the ggplot2 visualization function below:

  • ggplot() creates a layered plot, where the aes(thetics) (parameters) are defined as
    • x = the id number,
    • y = the approval rate,
    • and the color = the President name

Layers and details to this plot are specified/added using +

  • The first (bottom) layer of the plot is geom_hline() which creates a horizontal line at [x = 0; y = 0.5] with a size = 0.5. +
  • The 2nd layer is a scatterplot as geom_point() adds points with size = 0.25 on the x & y predefined in ggplot(aes()) + 
  • Next the limits of the Y-axis are set to run from 0 to 1 +
  • A custom/manual color scheme is set +
  • Custom titles and labels are applied to the axis +
  • A predefined theme for the plot is used, drawn from hrbrthemes-package loading in at the start +
  • The direction of the legend is set +
  • The position of the legend is set
  ggplot(aes(dnum, approve, color=who)) +
  geom_hline(yintercept = 0.5, size=0.5) +
  geom_point(size=0.25) +
  scale_y_percent(limits=c(0,1)) +
  scale_color_manual(name=NULL, values=c("Obama"="#313695", "Trump"="#a50026")) +
  labs(x="Day in office", y="Approval Rating",
       title="Presidential approval ratings from day 1 in office",
       subtitle="For fairness, data was taken solely from Trump's favorite polling site (Ramussen)",
       caption="Data Source: \nCode: ") +
  theme_ipsum_rc(grid="XY", base_size = 16) +
  theme(legend.direction = "horizontal") +
  theme(legend.position=c(0.8, 1.05))

The ggplot() command at the start automatically prints the plot when it is finished (when no more + is found). The result is just wonderful, isn’t it? With only 35 lines, 2 chained commands, and 7 seconds runtime.

Rplot

Found on https://www.r-bloggers.com.