Category: r

Visualizing model uncertainty

Visualizing model uncertainty

ungeviz is a new R package by Claus Wilke, whom you may know from his amazing work and books on Data Visualization. The package name comes from the German word “Ungewissheit”, which means uncertainty. You can install the developmental version via:

devtools::install_github("clauswilke/ungeviz")

The package includes some bootstrapping functionality that, when combined with ggplot2 and gganimate, can produce some seriousy powerful visualizations. For instance, take the below piece of code:

data(BlueJays, package = "Stat2Data")

# set up bootstrapping object that generates 20 bootstraps
# and groups by variable `KnownSex`
bs <- ungeviz::bootstrapper(20, KnownSex)

ggplot(BlueJays, aes(BillLength, Head, color = KnownSex)) +
  geom_smooth(method = "lm", color = NA) +
  geom_point(alpha = 0.3) +
  # `.row` is a generated column providing a unique row number
  # to all rows in the bootstrapped data frame 
  geom_point(data = bs, aes(group = .row)) +
  geom_smooth(data = bs, method = "lm", fullrange = TRUE, se = FALSE) +
  facet_wrap(~KnownSex, scales = "free_x") +
  scale_color_manual(values = c(F = "#D55E00", M = "#0072B2"), guide = "none") +
  theme_bw() +
  transition_states(.draw, 1, 1) + 
  enter_fade() + 
  exit_fade()

Here’s what’s happening:

  • Claus loads in the BlueJays dataset, which contains some data on birds.
  • He then runs the ungezviz::bootstrapper function to generate a new dataset of bootstrapped samples.
  • Next, Claus uses ggplot2::geom_smooth(method = "lm") to run a linear model on the orginal BlueJays dataset, but does not color in the regression line (color = NA), thus showing only the confidence interval of the model.
  • Moreover, Claus uses ggplot2::geom_point(alpha = 0.3) to visualize the orginal data points, but slightly faded.
  • Subsequent, for each of the bootstrapped samples (group = .row), Claus again draws the data points (unfaded), and runs linear models while drawing only the regression line (se = FALSE).
  • Using ggplot2::facet_wrap, Claus seperates the data for BlueJays$KnownSex.
  • Using gganimate::transition_states(.draw, 1, 1), Claus prints each linear regression line to a row of the bootstrapped dataset only one second, before printing the next.

The result an astonishing GIF of the regression lines that could be fit to bootstrapped subsamples of the BlueJays data, along with their confidence interval:

One example of the practical use of ungeviz, original on its GitHub page

Another valuable use of the new package is the visualization of uncertainty from fitted models, for example as confidence strips. The below code shows the powerful combination of broom::tidy with ungeviz::stat_conf_strip to visualize effect size estimates of a linear model along with their confidence intervals.

library(broom)
#> 
#> Attaching package: 'broom'
#> The following object is masked from 'package:ungeviz':
#> 
#>     bootstrap

df_model <- lm(mpg ~ disp + hp + qsec, data = mtcars) %>%
  tidy() %>%
  filter(term != "(Intercept)")

ggplot(df_model, aes(estimate = estimate, moe = std.error, y = term)) +
  stat_conf_strip(fill = "lightblue", height = 0.8) +
  geom_point(aes(x = estimate), size = 3) +
  geom_errorbarh(aes(xmin = estimate - std.error, xmax = estimate + std.error), height = 0.5) +
  scale_alpha_identity() +
  xlim(-2, 1)
Visualizing effect size estimates with ungeviz, via its GitHub page
Very curious to see where this package develops into. What use cases can you think of?

 

Add a self-explantory legend to your ggplot2 boxplots

Add a self-explantory legend to your ggplot2 boxplots

Laura DeCicco found that non-R users keep asking her what her box plots exactly mean or demonstrate. In a recent blog post, she therefore breaks down the calculations into easy-to-follow chunks of code. Even better, she included the source code to make boxplots that come with a very elaborate default legend:

Chloride by month, styled.

As you can see, the above contains much more and easier to understand information than the original ggplot2 boxplot below.

ggplot2 defaults for boxplots.

Laura wrote the custom function ggplot_box_legend() (see source code below and in Laura’s blog), which uses the cowplot package to paste the explanation to the box plot. All you need to do is call the legend function just before you run your ggplot2 boxplot call.

ggplot_box_legend <- function(family = "serif"){
  
  # Create data to use in the boxplot legend:
  set.seed(100)

  sample_df <- data.frame(parameter = "test",
                        values = sample(500))

  # Extend the top whisker a bit:
  sample_df$values[1:100] <- 701:800
  # Make sure there's only 1 lower outlier:
  sample_df$values[1] <- -350
  
  # Function to calculate important values:
  ggplot2_boxplot <- function(x){
  
    quartiles <- as.numeric(quantile(x, 
                                     probs = c(0.25, 0.5, 0.75)))
    
    names(quartiles) <- c("25th percentile", 
                          "50th percentile\n(median)",
                          "75th percentile")
    
    IQR <- diff(quartiles[c(1,3)])
  
    upper_whisker <- max(x[x < (quartiles[3] + 1.5 * IQR)])
    lower_whisker <- min(x[x > (quartiles[1] - 1.5 * IQR)])
      
    upper_dots <- x[x > (quartiles[3] + 1.5*IQR)]
    lower_dots <- x[x < (quartiles[1] - 1.5*IQR)]
  
    return(list("quartiles" = quartiles,
                "25th percentile" = as.numeric(quartiles[1]),
                "50th percentile\n(median)" = as.numeric(quartiles[2]),
                "75th percentile" = as.numeric(quartiles[3]),
                "IQR" = IQR,
                "upper_whisker" = upper_whisker,
                "lower_whisker" = lower_whisker,
                "upper_dots" = upper_dots,
                "lower_dots" = lower_dots))
  }
  
  # Get those values:
  ggplot_output <- ggplot2_boxplot(sample_df$values)
  
  # Lots of text in the legend, make it smaller and consistent font:
  update_geom_defaults("text", 
                     list(size = 3, 
                          hjust = 0,
                          family = family))
  # Labels don't inherit text:
  update_geom_defaults("label", 
                     list(size = 3, 
                          hjust = 0,
                          family = family))
  
  # Create the legend:
  # The main elements of the plot (the boxplot, error bars, and count)
  # are the easy part.
  # The text describing each of those takes a lot of fiddling to
  # get the location and style just right:
  explain_plot <- ggplot() +     stat_boxplot(data = sample_df,                  aes(x = parameter, y=values),                  geom ='errorbar', width = 0.3) +     geom_boxplot(data = sample_df,                  aes(x = parameter, y=values),                   width = 0.3, fill = "lightgrey") +     geom_text(aes(x = 1, y = 950, label = "500"), hjust = 0.5) +     geom_text(aes(x = 1.17, y = 950,                   label = "Number of values"),               fontface = "bold", vjust = 0.4) +     theme_minimal(base_size = 5, base_family = family) +     geom_segment(aes(x = 2.3, xend = 2.3,                       y = ggplot_output[["25th percentile"]],                       yend = ggplot_output[["75th percentile"]])) +     geom_segment(aes(x = 1.2, xend = 2.3,                       y = ggplot_output[["25th percentile"]],                       yend = ggplot_output[["25th percentile"]])) +     geom_segment(aes(x = 1.2, xend = 2.3,                       y = ggplot_output[["75th percentile"]],                       yend = ggplot_output[["75th percentile"]])) +     geom_text(aes(x = 2.4, y = ggplot_output[["50th percentile\n(median)"]]),                label = "Interquartile\nrange", fontface = "bold",               vjust = 0.4) +     geom_text(aes(x = c(1.17,1.17),                    y = c(ggplot_output[["upper_whisker"]],                         ggplot_output[["lower_whisker"]]),                    label = c("Largest value within 1.5 times\ninterquartile range above\n75th percentile",                             "Smallest value within 1.5 times\ninterquartile range below\n25th percentile")),                   fontface = "bold", vjust = 0.9) +     geom_text(aes(x = c(1.17),                    y =  ggplot_output[["lower_dots"]],                    label = "Outside value"),                vjust = 0.5, fontface = "bold") +     geom_text(aes(x = c(1.9),                    y =  ggplot_output[["lower_dots"]],                    label = "-Value is >1.5 times and"), 
              vjust = 0.5) +
    geom_text(aes(x = 1.17, 
                  y = ggplot_output[["lower_dots"]], 
                  label = "<3 times the interquartile range\nbeyond either end of the box"), 
              vjust = 1.5) +
    geom_label(aes(x = 1.17, y = ggplot_output[["quartiles"]], 
                  label = names(ggplot_output[["quartiles"]])),
              vjust = c(0.4,0.85,0.4), 
              fill = "white", label.size = 0) +
    ylab("") + xlab("") +
    theme(axis.text = element_blank(),
          axis.ticks = element_blank(),
          panel.grid = element_blank(),
          aspect.ratio = 4/3,
          plot.title = element_text(hjust = 0.5, size = 10)) +
    coord_cartesian(xlim = c(1.4,3.1), ylim = c(-600, 900)) +
    labs(title = "EXPLANATION")

  return(explain_plot) 
  
}

ggplot_box_legend()

 

ggstatsplot: Creating graphics including statistical details

ggstatsplot: Creating graphics including statistical details

This pearl had been resting in my inbox for quite a while before I was able to add it to my R resources list. Citing its GitHub pageggstatsplot is an extension of ggplot2 package for creating graphics with details from statistical tests included in the plots themselves and targeted primarily at behavioral sciences community to provide a one-line code to produce information-rich plots. The package is currently maintained and still under development by Indrajeet Patil. Nevertheless, its functionality is already quite impressive. You can download the latest stable version via:

utils::install.packages(pkgs = "ggstatsplot")

Or download the development version via:

devtools::install_github(
  repo = "IndrajeetPatil/ggstatsplot", # package path on GitHub
  dependencies = TRUE,                 # installs packages which ggstatsplot depends on
  upgrade_dependencies = TRUE          # updates any out of date dependencies
)

The package currently supports many different statistical plots, including:

?ggbetweenstats
?ggscatterstats
?gghistostats
?ggpiestats
?ggcorrmat
?ggcoefstats
?combine_plots
?grouped_ggbetweenstats
?grouped_ggscatterstats
?grouped_gghistostats
?grouped_ggpiestats
?grouped_ggcorrmat

Let’s take a closer look at the first one:

ggbetweenstats

This function creates either a violin plot, a box plot, or a mix of two for between-group or between-condition comparisons and additional detailed results from statistical tests can be added in the subtitle. The simplest function call looks like the below, but much more complex information can be added and specified.

set.seed(123) # to get reproducible results

# the functions work approximately the same as ggplot2
ggstatsplot::ggbetweenstats(
  data = datasets::iris, 
  x = Species, 
  y = Sepal.Length,
  messages = FALSE
) +   
# and can be adjusted using the same, orginal function calls
  ggplot2::coord_cartesian(ylim = c(3, 8)) + 
  ggplot2::scale_y_continuous(breaks = seq(3, 8, by = 1))
All pictures copied from the GitHub page of ggstatsplot [original]

ggscatterstats

Not all plots are ggplot2-compatible though, for instance, ggscatterstats is not. Nevertheless, it produces a very powerful plot in my opinion.

ggstatsplot::ggscatterstats(
  data = datasets::iris, 
  x = Sepal.Length, 
  y = Petal.Length,
  title = "Dataset: Iris flower data set",
  messages = FALSE
)
All pictures copied from the GitHub page of ggstatsplot [original]

ggcormat

ggcorrmat is also quite impressive, producing correlalograms with only minimal amounts of code as it wraps around ggcorplot. The defaults already produces publication-ready correlation matrices:

ggstatsplot::ggcorrmat(
  data = datasets::iris,
  corr.method = "spearman",
  sig.level = 0.005,
  cor.vars = Sepal.Length:Petal.Width,
  cor.vars.names = c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width"),
  title = "Correlalogram for length measures for Iris species",
  subtitle = "Iris dataset by Anderson",
  caption = expression(
    paste(
      italic("Note"),
      ": X denotes correlation non-significant at ",
      italic("p "),
      "< 0.005; adjusted alpha"
    )
  )
)
All pictures copied from the GitHub page of ggstatsplot [original]

ggcoefstats

Finally, ggcoefstats is a wrapper around GGally::ggcoef, creating a plot with the regression coefficients’ point estimates as dots with confidence interval whiskers. Here’s an example with some detailed specifications:

ggstatsplot::ggcoefstats(
  x = stats::lm(formula = mpg ~ am * cyl,
                data = datasets::mtcars),
  point.color = "red",
  vline.color = "#CC79A7",
  vline.linetype = "dotdash",
  stats.label.size = 3.5,
  stats.label.color = c("#0072B2", "#D55E00", "darkgreen"),
  title = "Car performance predicted by transmission and cylinder count",
  subtitle = "Source: 1974 Motor Trend US magazine"
) +                                    
  ggplot2::scale_y_discrete(labels = c("transmission", "cylinders", "interaction")) +
  ggplot2::labs(x = "regression coefficient",
                y = NULL)

All pictures copied from the GitHub page of ggstatsplot [original]
I for one am very curious to see how Indrajeet will further develop this package, and whether academics will start using it as a default in publishing.

 

Generating Book Covers By Their Words — My Dissertation Cover

Generating Book Covers By Their Words — My Dissertation Cover

As some of you might know, I am defending my PhD dissertation later this year. It’s titled “Data-Driven Human Resource Management: The rise of people analytics and its application to expatriate management” and, over the past few months, I was tasked with designing its cover.

Now, I didn’t want to buy some random stock photo depicting data, an organization, or overly happy employees. I’d rather build something myself. Something reflecting what I liked about the dissertation project: statistical programming and sharing and creating knowledge with others.

Hence, I came up with the idea to use the collective intelligence of the People Analytics community to generate a unique cover. It required a dataset of people analytics-related concepts, which I asked People Analytics professionals on LinkedIn, Twitter, and other channels to help compile. Via a Google Form, colleagues, connections, acquitances, and complete strangers contributed hundreds of keywords ranging from the standard (employees, HRM, performance) to the surprising (monetization, quantitative scissors [which I had to Google]). After reviewing the list and adding some concepts of my own creation, I ended up with 1786 unique words related to either business, HRM, expatriation, data science, or statistics.

I very much dislike wordclouds (these are kind of cool though), but already had a different idea in mind. I thought of generating a background cover of the words relating to my dissertation topic, over which I could then place my title and other information. I wanted to place these keywords randomly, maybe using a color schema, or with some random sizes.

The picture below shows the result of one of my first attempts. I programmed everything in R, writing some custom functionality to generate the word-datasets, the cover-plot, and .png, .pdf, and .gif files as output.

canvas.PNG

Random colors did not produce a pleasing result and I definitely needed more and larger words in order to fill my 17cm by 24cm canvas!

Hence, I started experimenting. Using base R’s expand.grid() and set.seed() together with mapply(), I could quickly explore and generate a large amount of covers based on different parameter settings and random fluctuations.

expand.grid(seed = c(1:3), 
            dupl = c(1:4, seq(5, 30, 5)),
            font = c("sans", "League Spartan"),
            colors = c(blue_scheme, red_scheme, 
                       rainbow_scheme, random_scheme),
            size_mult = seq(1, 3, 0.3),
            angle_sd = c(5, 10, 12, 15)) -> 
  param

mapply(create_textcover, 
       param$seed, param$dupl, 
       param$font, param$colors, 
       param$size_mult, param$angle_sd)

The generation process for each unique cover only took a few seconds, so I would generate a few hundred, quickly browse through them, update the parameters to match my preferences, and then generate a new set. Among others, I varied the color palette used, the size range of the words, their angle, the font used, et cetera. To fill up the canvas, I experimented with repeating the words: two, three, five, heck, even twenty, thirty times. After an evening of generating and rating, I came to the final settings for my cover:

  • Words were repeated twenty times in the dataset.
  • Words were randomly distributed across the canvas.
  • Words placed in random order onto the canvas, except for a select set of relevant words, placed last.
  • Words’ transparency ranged randomly between 0% and 70%.
  • Words’ color was randomly selected out of six colors from this palette of blues.
  • Words’ writing angles were normally distributed around 0 degrees, with a standard deviation of 12 degrees. However, 25% of words were explicitly without angle.
  • Words’ size ranged between 1 and 4 based on a negative binomial distribution (10 * 0.8) resulting in more small than large words. The set of relevant words were explicitly enlarged throughout.

With League Spartan (#thisisparta) loaded as a beautiful custom font, this was the final cover background which I and my significant other liked most:

cover_wordcloud_20-League Spartan-4.png

While I still need to decide on the final details regarding title placement and other details, I suspect that the final cover will look something like below — the white stripe in the middle depicting the book’s back.

coverpaul.png

Now, for the finale, I wanted to visualize the generation process via a GIF. Thomas Lin Pedersen developed this great gganimate package, which builds on the older animation package. The package greatly simplifies creating your own GIFs, as I already discussed in this earlier blog about animated GIFs in R. Anywhere, here is the generation process, where each frame includes the first frame ^ 3.2 words:

cover_wordcloud_20-League Spartan_4.gif

If you are interested in the process, or the R code I’ve written, feel free to reach out!

I’m sharing a digital version of the dissertation online sometime around the defense date: November 9th, 2018. If you’d like a copy, you can still leave your e-mailadress in the Google Form here and I’ll make sure you’ll receive your copy in time!

Transitioning from Excel to R: Dictionary of common functions

Transitioning from Excel to R: Dictionary of common functions

Alyssa Columbus published maintains this GitHub repository with a great tutorial on how to move from Excel to R. Very useful for beginning useRs, the repository’s tutorial includes a translation table between Excel and R functions:

Excel Formula R Function Type
ABS abs Arithmetic
ADDRESS assign Essentials
AND &,&&,all Boolean
AVERAGE, AVG, AVERAGEIF mean Arithmetic
BETADIST pbeta Statistics
BETAINV qbeta Statistics
BINOMDIST pbinom when cumulative,dbinom when not Statistics
CEILING ceiling Arithmetic
CELL str has the same idea Essentials
CHIDIST, CHISQDIST pchisq Statistics
CHIINV, CHISQINV qchisq Statistics
CHITEST chisq.test Statistics
CHOOSE switch Essentials
CLEAN gsub Text
COLS, COLUMNS ncol Essentials
COLUMN col,:,seq Essentials
COMBIN choose Essentals
CONCATENATE paste Text
CONFIDENCE -qnorm(alpha/2)*std/sqrt(n) Statistics
CORREL cor Statistics
COUNT, COUNTIF length Arithmetic
COVAR cov Statistics
CRITBINOM qbinom Statistics
DELTA identical Boolean
EXACT == Boolean
EXP exp Arithmetic
EXPONDIST pexp when cumulative,dexp when not Statistics
FACT factorial Arithmetic
FACTDOUBLE dfactorial in the phangorn package Arithmetic
FDIST pf Statistics
FIND regexpr,grepl,grep Text
FINV qf Statistics
FISHER atanh Arithmetic
FISHERINV tanh Arithmetic
FIXED format,sprintf,formatC Essentials
FLOOR floor Arithmetic
FORECAST predict on an lm object Statistics
FREQUENCY cut,table Arithmetic
FTEST var.test Statistics
GAMMADIST pgamma if last argument T,dgamma if last arg. F Statistics
GAMMAINV qgamma Statistics
GAMMALN lgamma Statistics
GAUSS pnorm(x) - 0.5 Statistics
GCD gcd Arithmetic
GEOMEAN exp(mean(log(x))) Arithmetic
GESTEP >= Boolean
HARMEAN harmonic.mean in the psych package Arithmetic
HLOOKUP match,merge Essentials
HYPGEOMDIST dhyper Statistics
IF if,ifelse Essentials
IFERROR try,tryCatch Essentials
INDEX x[y,z] Essentials
INDIRECT get Essentials
INT as.integer(not for negative numbers),floor Arithmetic
INTERCEPT first element of coef of an lm object Statistics
ISLOGICAL is.logical Boolean
ISNA is.na Boolean
ISNUMBER is.numeric Boolean
ISTEXT is.character Boolean
KURT kurtosis in the moments package Statistics
LARGE sort Statistics
LCM scm in the schoolmath package Arithmetic
LEFT substr Text
LEN, LENGTH nchar Text
LINEST lm Statistics
LN, LOG log Arithmetic
LOG10 log10 Arithmetic
LOGINV qlnorm Statistics
LOGNORMDIST plnorm Statistics
LOWER tolower Text
MATCH match,which Essentials
MAX max (sometimes pmax) Arithmetic
MDETERM det Arithmetic
MEDIAN median Arithmetic
MID substr Text
MIN min (sometimes pmin) Arithmetic
MINVERSE solve Arithmetic
MMULT %*% Arithmetic
MOD %% Arithmetic
MODE as.numeric(names(which.max(table(x)))) Arithmetic
MUNIT diag Arithmetic
N as.numeric Arithmetic
NEGBINOMDIST dnbinom Statistics
NORMDIST, NORMSDIST pnorm when cumulative,dnorm when not Statistics
NORMINV, NORMSINV qnorm Statistics
NOT ! Boolean
NOW date,Sys.time Essentials
OR ` ,
PEARSON cor Statistics
PERCENTILE quantile Statistics
PERCENTRANK ecdf Statistics
PERMUT function(n,k) {choose(n,k)*factorial(k)} Arithmetic
PERMUTATIONA n^k Arithmetic
PHI dnorm Statistics
POISSON ppois when cumulatic,dpois when not Statistics
POWER ^ Arithmetic
PROB ecdf Statistics
PRODUCT prod Arithmetic
PROPER toupper Text
QUARTILE quantile Arithmetic
QUOTIENT %/% Arithmetic
RAND runif Arithmetic
RANDBETWEEN sample Arithmetic
RANK rank Essentials
REPLACE sub,gsub Text
REPT rep and paste or paste0 Text
RIGHT substring Text
ROUND round Arithmetic
ROUNDDOWN floor Arithmetic
ROUNDUP ceiling Arithmetic
ROW row,:,seq Essentials
ROWS nrow Essentials
RSQ summary of lm object Statistics
SEARCH regexpr,grep Text
SIGN sign Arithmetic
SKEW skewness in the moments package Statistics
SLOPE in coef of lm object Statistics
SMALL sort Arithmetic
SQRT sqrt Arithmetic
STANDARDIZE scale Statitics
STD, STDEV sd Arithmetic
STEYX predict on an lm object Statistics
STRING format,sprintf,formatC Text
SUBSTITUTE sub,gsub,paste Essentials
SUM, SUMIF sum Arithmetic
SUMPRODUCT crossprod Arithmetic
TDIST pt Statistics
TEXT format,sprintf,formatC Text
TINV abs(qt(x/2,data)) Statistics
TODAY Sys.Date Essentials
TRANSPOSE t Arithmetic
TREND fitted of an lm object Statistics
TRIM sub Essentials
TRIMMEAN mean(x,trim=tr/2) Arithmetic
TRUNC trunc Essentials
TTEST t.test Statistics
TYPE typeof,mode,class Essentials
UPPER toupper Text
VALUE as.numeric Arithmetic
VAR var Essentials
VLOOKUP match,merge Essentials
WEEKDAY weekdays Essentials
WEIBULL pweibull when cumulative,dweibull when not Statistics
ZTEST pnorm Statistics

 

 

(Time Series) Forecasting: Principles & Practice (in R)

(Time Series) Forecasting: Principles & Practice (in R)

I stumbled across this open access book by Rob Hyndman, the god of time series, and George Athanasopoulos, a colleague statistician / econometrician at Monash University in Melbourne Australia.

Hyndman and Athanasopoulos provide a comprehensive introduction to forecasting methods, accessible and relevant among others for business professionals without any formal training in the area. All R examples in the book assume work build on the fpp2 R package. fpp2 includes all datasets referred to in the book and depends on other R packages including forecast and ggplot2.

Some examples of the analyses you can expect to recreate, ignore the agricultural topic for now ; )

Monthly milk production per cow.
One of the example analysis you will recreate by following the book (Figure 3.3)
Forecasts of egg prices using a random walk with drift applied to the logged data.
You will be forecasting price data using different analyses and adjustments (Figure 3.4)

I highly recommend this book to any professionals or students looking to learn more about forecasting and time series modelling. There is also a DataCamp course based on this book. If you got value out of this free book, be sure to buy a hardcopy as well.