Tag: tidyverse

David Robinson’s R Programming Screencasts

David Robinson (aka drob) is one of the best known R programmers.

Since a couple of years David has been sharing his knowledge through streaming screencasts of him programming. It’s basically part of R’s #tidytuesday movement.

Alex Cookson decided to do us all a favor and annotate all these screencasts into a nice overview.

https://docs.google.com/spreadsheets/d/1pjj_G9ncJZPGTYPkR1BYwzA6bhJoeTfY2fJeGKSbOKM/edit#gid=444382177

Here you can search for video material of David using a specific function or method. There are already over a thousand linked fragments!

Very useful if you want to learn how to visualize data using ggplot2 or plotly, how to work with factors in forcats, or how to tidy data using tidyr and dplyr.

For instance, you could search for specific R functions and packages you want to learn about:

Thanks David for sharing your knowledge, and thanks Alex for maintaining this overview!

Anyone other #rstats people find @drob's #TidyTuesday screencasts useful?

I made a spreadsheet with timestamps for hundreds of specific tasks he does: https://t.co/HvJbLk1chd

Useful if, like me, you keep going back and ask, "Where in the video did he do [this thing] again?"
— Alex Cookson (@alexcookson) January 13, 2020

Online Workshop Tidy Data Science in R, by Jake Thompson

Here’s a website hosting for a five-day hands-on workshop based on the book “R for Data Science”.

The workshop was originally offered as part of the Stats Camp: Summer Statistical Institute in Lawrence, KS and hosted by the Center for Research Methods and Data Analysis and the Achievement and Assessment Instituteat the University of Kansas. It is designed for those who want to learn practical applications of R for data analysis.

You can download the Workshop files, but I suggest you do so via the original workshop webpage.

This workshop is designed for those who want to learn how to use R to analyze data. The material is based on Hadley Wickham and Garrett Grolemund’s R for Data Science. We’ll talk about how to conduct a complete data analysis from data import to final reporting in R using a suite of packages known as the tidyverse. The two goals of this workshop are: 1) learn how to use R to answer questions about our data; and 2) write code that is human readable and reproducible. We will also talk about how to share our code and analyses with others.
You should take this workshop if you are new to R, or to the tidyverse, and want to learn how to take advantage of this ecosystem to do data analysis. You’ll get the most from the workshop if you are primarily interested in applying pre-existing R packages and functions to your own data. We will give minimal tutorials on how to write your own functions; however, the main focus will be on using existing tools, rather than building our own.
About this workshop

Tidy Machine Learning with R’s purrr and tidyr

Jared Wilber posted this great walkthrough where he codes a simple R data pipeline using purrr and tidyr to train a large variety of models and methods on the same base data, all in a non-repetitive, reproducible, clean, and thus tidy fashion. Really impressive workflow!

Animated vs. Static Data Visualizations

GIFs or animations are rising quickly in the data visualization world (see for instance here).

However, in my personal experience, they are not as widely used in business settings. You might even say animations are frowned by, for instance, LinkedIn, which removed the option to even post GIFs on their platform!

Nevertheless, animations can be pretty useful sometimes. For instance, they can display what happens during a process, like a analytical model converging, which can be useful for didactic purposes. Alternatively, they can be great for showing or highlighting trends over time.

I am curious what you think are the pro’s and con’s of animations. Below, I posted two visualizations of the same data. The data consists of the simulated workforce trends, including new hires and employee attrition over the course of twelve months.

versus

Would you prefer the static, or the animated version? Please do share your thoughts in the comments below, or on the respective LinkedIn and Twitter posts!

Want to reproduce these plots? Or play with the data? Here’s the R code:

# LOAD IN PACKAGES ####
# install.packages('devtools')
# devtools::install_github('thomasp85/gganimate')
library(tidyverse)
library(gganimate)
library(here)


# SET CONSTANTS ####
# data
HEADCOUNT = 270
HIRE_RATE = 0.12
HIRE_ADDED_SEASONALITY = rep(floor(seq(14, 0, length.out = 6)), 2)
LEAVER_RATE = 0.16
LEAVER_ADDED_SEASONALITY = c(rep(0, 3), 10, rep(0, 6), 7, 12)

# plot
TEXT_SIZE = 12
LINE_SIZE1 = 2
LINE_SIZE2 = 1.1
COLORS = c("darkgreen", "red", "blue")

# saving
PLOT_WIDTH = 8
PLOT_HEIGHT = 6
FRAMES_PER_POINT = 5


# HELPER FUNCTIONS ####
capitalize_string = function(text_string){
  paste0(toupper(substring(text_string, 1, 1)), substring(text_string, 2, nchar(text_string)))
}


# SIMULATE WORKFORCE DATA ####
set.seed(1)

# generate random leavers and some seasonality
leavers <- rbinom(length(month.abb), HEADCOUNT, TURNOVER_RATE / length(month.abb)) + LEAVER_ADDED_SEASONALITY

# generate random hires and some seasonality
joiners <- rbinom(length(month.abb), HEADCOUNT, HIRE_RATE / length(month.abb)) + HIRE_ADDED_SEASONALITY 

# combine in dataframe
data.frame(
  month = factor(month.abb, levels = month.abb, ordered = TRUE)
  , workforce = HEADCOUNT - cumsum(leavers) + cumsum(joiners)
  , left = leavers
  , hires = joiners
) -> 
  wf

# transform to long format
wf_long <- gather(wf, key = "variable", value = "value", -month)
capitalize the name of variables
wf_long$variable <- capitalize_string(wf_long$variable)


# VISUALIZE & ANIMATE ####
# draw workforce plot
ggplot(wf_long, aes(x = month, y = value, group = variable)) +
    geom_line(aes(col = variable, size = variable == "workforce")) +
    scale_color_manual(values = COLORS) +
    scale_size_manual(values = c(LINE_SIZE2, LINE_SIZE1), guide = FALSE) +
    guides(color = guide_legend(override.aes = list(size = c(rep(LINE_SIZE2, 2), LINE_SIZE1)))) +
    # theme_PVDL() +
    labs(x = NULL, y = NULL, color = "KPI", caption = "paulvanderlaken.com") +
    ggtitle("Workforce size over the course of a year") +
    NULL ->
    workforce_plot

# ggsave(here("workforce_plot.png"), workforce_plot, dpi = 300, width = PLOT_WIDTH, height = PLOT_HEIGHT)

# animate the plot
workforce_plot +
  geom_segment(aes(xend = 12, yend = value), linetype = 2, colour = 'grey') +
  geom_label(aes(x = 12.5, label = paste(variable, value), col = variable), 
             hjust = 0, size = 5) + 
  transition_reveal(variable, along = as.numeric(month)) +
  enter_grow() + 
  coord_cartesian(clip = 'off') +
  theme(
    plot.margin = margin(5.5, 100, 11, 5.5)
    , legend.position = "none"
    ) ->
  animated_workforce

anim_save(here("workforce_animation.gif"), 
          animate(animated_workforce, nframes = nrow(wf) * FRAMES_PER_POINT, 
                  width = PLOT_WIDTH, height = PLOT_HEIGHT, units = "in", res = 300))

Tidy Missing Data Handling

A recent open access paper by Nicholas Tierney and Dianne Cook — professors at Monash University — deals with simpler handling, exploring, and imputation of missing values in data.They present new methodology building upon tidy data principles, with a goal to integrating missing value handling as an integral part of data analysis workflows. New data structures are defined (like the nabular) along with new functions to perform common operations (like gg_miss_case).

These new methods have bundled among others in the R packages naniar and visdat, which I highly recommend you check out. To put in the author’s own words:

The naniar and visdat packages build on existing tidy tools and strike a compromise between automation and control that makes analysis efficient, readable, but not overly complex. Each tool has clear intent and effects – plotting or generating data or augmenting data in some way. This reduces repetition and typing for the user, making exploration of missing values easier as they follow consistent rules with a declarative interface.

The below showcases some of the highly informational visuals you can easily generate with naniar‘s nabulars and the associated functionalities.

For instance, these heatmap visualizations of missing data for the airquality dataset. (A) represents the default output and (B) is ordered by clustering on rows and columns. You can see there are only missings in ozone and solar radiation, and there appears to be some structure to their missingness.

Another example is this upset plot of the patterns of missingness in the airquality dataset. Only Ozone and Solar.R have missing values, and Ozone has the most missing values. There are 2 cases where both Solar.R and Ozone have missing values.

You can also generate a histogram using nabular data in order to show the values and missings in Ozone. Values are imputed below the range to show the number of missings in Ozone and colored according to missingness of ozone (‘Ozone_NA‘). This displays directly that there are approximately 35-40 missings in Ozone.

Alternatively, scatterplots can be easily generated. Displaying missings at 10 percent below the minimum of the airquality dataset. Scatterplots of ozone and solar radiation (A), and ozone and temperature (B). These plots demonstrate that there are missings in ozone and solar radiation, but not in temperature.

Finally, this parallel coordinate plot displays the missing values imputed 10% below range for the oceanbuoys dataset. Values are colored by missingness of humidity. Humidity is missing for low air and sea temperatures, and is missing for one year and one location.

Please do check out the original open access paper and the CRAN vignettes associated with the packages!

Simple Correlation Analysis in R using Tidyverse Principles

R’s standard correlation functionality (base::cor) seems very impractical to the new programmer: it returns a matrix and has some pretty shitty defaults it seems. Simon Jackson thought the same so he wrote a tidyverse-compatible new package: corrr!

Simon wrote some practical R code that has helped me out greatly before (e.g., color palette’s), but this new package is just great. He provides an elaborate walkthrough on his own blog, which I can highly recommend, but I copied some teasers below.

Diagram showing how the new functionality of `corrr` works.

Apart from corrr::correlate to retrieve a correlation data frame and corrr::stretch to turn that data frame into a long format, the new package includes corrr::focus, which can be used to simulteneously select the columns and filter the rows of the variables focused on. For example:

# install.packages("tidyverse")
library(tidyverse)

# install.packages("corrr")
library(corrr)

# install.packages("here")
library(here)

dir.create(here::here("images")) # create an images directory

mtcars %>%
  corrr::correlate() %>%
  # use mirror = TRUE to not only select columns but also filter rows
  corrr::focus(mpg:hp, mirror = TRUE) %>% 
  corrr::network_plot(colors = c("red", "green")) %>%
  ggplot2::ggsave(
    filename = here::here("images", "mtcars_networkplot.png"),
    width = 5,
    height = 5
    )

With corrr::networkplot you get an immediate sense of the relationships in your data.

Let’s try some different visualizations:

mtcars %>%
  corrr::correlate() %>%
  corrr::focus(mpg) %>% 
  dplyr::mutate(rowname = reorder(rowname, mpg)) %>%
  ggplot2::ggplot(ggplot2::aes(rowname, mpg)) +
  # color each bar based on the direction of the correlation
  ggplot2::geom_col(ggplot2::aes(fill = mpg >= 0)) + 
  ggplot2::coord_flip() + 
  ggplot2::ggsave(
    filename = here::here("images", "mtcars_mpg-barplot.png"),
    width = 5,
    height = 5
  )

The tidy correlation data frames can be easily piped into a ggplot2 function call

corrr also provides some very helpful functionality display correlations. Take, for instance, corrr::fashion and corrr::shave:

mtcars %>%
  corrr::correlate() %>%
  corrr::focus(mpg:hp, mirror = TRUE) %>%
  # converts the upper triangle (default) to missing values
  corrr::shave() %>%
  # converts a correlation df into clean matrix
  corrr::fashion() %>%
  readr::write_excel_csv(here::here("correlation-matrix.csv"))

Exporting a nice looking correlation matrix has never been this easy.

Finally, there is the great function of corrr::rplot to generate an amazing correlation overview visual in a wingle line. However, here it is combined with corr::rearrange to make sure that closely related variables are actually closely located on the axis, and again the upper half is shaved away:

mtcars %>%
  corrr::correlate() %>%
  # Re-arrange a correlation data frame 
  # to group highly correlated variables closer together.
  corrr::rearrange(method = "MDS", absolute = FALSE) %>%
  corrr::shave() %>% 
  corrr::rplot(shape = 19, colors = c("red", "green")) %>%
  ggplot2::ggsave(
    filename = here::here("images", "mtcars_correlationplot.png"),
    width = 5,
    height = 5
  )

Generate fantastic single-line correlation overviews with <code>corrr::rplot</code>

For some more functionalities, please visit Simon’s blog and/or the associated GitHub page. If you copy the code above and play around with it, be sure to work in an Rproject else the here::here() functions might misbehave.