Author: Paul van der Laken

Multimapping in R, by Ilya Kashnitsky

Nothing beats a aesthetically-pleasing data visualization in the form of a map (see evidence here, here, here, or here).

Moreover, we’ve already witnessed some great R tutorials by Ilya Kashnitsky before (see Animated Snow in R).

These two come together in Ilya’s recent post on subplots in ggplot2 maps, with which he completely amazed me. The creation process is actually easier than the end result makes it look: make several visualizations and add them as ggplot2::annotation_custom() to your main ggplot2 map — the same as if you are adding a logo to your plot. Enjoy:

Here you can find Ilya’s original blog and the associated R script.

PyData, London 2018

PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

April 2018, a PyData conference was held in London, with three days of super interesting sessions and hackathons. While I couldn’t attend in person, I very much enjoy reviewing the sessions at home as all are shared open access on YouTube channel PyDataTV!

In the following section, I will outline some of my favorites as I progress through the channel:

Winning with simple, even linear, models:

One talk that really resonated with me is Vincent Warmerdam‘s talk on “Winning with Simple, even Linear, Models“. Working at GoDataDriven, a data science consultancy firm in the Netherlands, Vincent is quite familiar with deploying deep learning models, but is also midly annoyed by all the hype surrounding deep learning and neural networks. Particularly when less complex models perform equally well or only slightly less. One of his quote’s nicely sums it up:

“Tensorflow is a cool tool, but it’s even cooler when you don’t need it!”

— Vincent Warmerdam, PyData 2018

In only 40 minutes, Vincent goes to show the finesse of much simpler (linear) models in all different kinds of production settings. Among others, Vincent shows:

how to solve the XOR problem with linear models
how to win at timeseries with radial basis features
how to use weighted regression to deal with historical overfitting
how deep learning models introduce a new theme of horror in production
how to create streaming models using passive aggressive updating
how to build a real-time video game ranking system using mere histograms
how to create a well performing recommender with two SQL tables
how to rock at data science and machine learning using Python, R, and even Stan

R tips and tricks

Below are a dozen of very specific R tips and tricks. Some are valuable, useful, or boost your productivity. Others are just geeky funny.

More general helpful R packages and resources can be found in this list.

If you have additions, please comment below or contact me!

Completely new to R? → Start here!

RStudio tricks
General tips
Base R tricks
R Markdown tricks
Data manipulation tricks
Data visualization tricks

Funny tricks
Easter eggs

Join 385 other subscribers

RStudio

RStudio Addins
RStudio Keyboard Shortcuts
R Studio easy tricks: tearable panes, command history, renaming in scope, outlining, snippets, and more
Working with R projects and here
Working with code snippets
Working with code snippets (video)
Stop RStudio from asking to save workspace
Automatically save workspace in case of a crash / errors
Edit several lines of code at once
Press ALT + left mousebutton to select and write on multiple lines simultaneously.
Press ALT + - to insert a <- operator
Press CTRL + SHIFT + M to insert a %>% operator
Press CTRL + SHIFT + F to search all files in the directory or project
Press CTRL + UP to access navigate your console history
Rename all variables with same name (rename in scope)
Press CMD + ALT + SHIFT + M to rename variable within scope: to rename all/multiple occurrences of a variable in a script
Press TAB inside “” (quotation marks / an empty string) to select a filename from your current directory, or to autocomplete a filename you started typing

Many more shortkeys available here online, and in your RStudio under Tools → Keyboard Shortcuts Help.

General

Disclaimer: This page contains one or more links to Amazon.
Any purchases made through those links provide us with a small commission that helps to host this blog.

Useful base functions

str() – explore structure of R object
trimws() – trim trailing and/or leading whitespaces
dput() – dump an R object in form of R code
cut()– categorize values into intervals
intersect() – returns similar elements in two vectors
union() – find intersecting items in two vectors
setdiff() – returns different elements in two vectors
interaction() – computes a factor which represents the interaction of the given factors
formatC()can be used to round numbers and force trailing zero’s
formatC() and sprintf() can be used to add leading/trailing characters
expand.grid() – create a data frame from all combinations of the supplied vectors or factors
seq_along(myvec) – generates a vector of 1:length(myvec)
Initiate an empty dataframe with header names
Functional programming tricks:
- switch() can replace elaborate ifelse statements (see also)
- match.arg() can check for arguments and values
- The null-default operator (%||%) returns the first value that is not NULL
Convert a vector of strings to title case
Quickly map a new set of values to an existing vector
Calculate the derivative of a function expression
Specify options() in your script:
- Prevent scientific notation using options(scipen = 999)
- Prevent automatic factor columns using options(stringsAsFactors = FALSE)
- Use options(width = 60) to change the default width of console output
- Use options(max.print = 100) to change the default number of values printed in the console

Back to Table of Contents

R Markdown

Pimp my RMD: Overview of many R markdown tricks by Yan Holtz
Save compiled images in folder with markdown
Add caption to compiled tables with markdown
Tabsets in markdown
Foldable html content in markdown
Reuse code chunks in markdown
Generate Word documents with markdown
Open url’s in a new window with[text](url){target = "_blank} in markdown
Use #<< to highlight code
Move to next xaringan slide upon click (or Enter)
Convert an R Markdown file (.Rmd) into an R script (.R) with
knitr::purl(input, output, documentation = 2)
Use CTRL + SHIFT + 1:4 to zoom in on any single of your RStudio panels. Use ALT + CTRL + SHIFT + 0 to zoom back out.
knitr::read_chunk("your_script_name.R") can be used to source in scripts that reside outside your current markdown file
Use animations in your markdown files with the gganimate package and "header-includes: - \usepackage{animate} in your YAML preamble
Create a searchable, sortable HTML table in 1 line of code with DT::datatable(mydf, filter = 'top')

Data manipulation

readr::parse_number extracts the numbers from raw / scraped text
stringr::str_pad can be used to add leading or trailing characters (like zero’s)
dplyr tricks
dplyr::case_when replaces elaborate ifelse statements (Video)
dplyr::everything in combination with dplyr::select to reorder columns
Quickly count / tally observations within groups with dplyr::count, dplyr::tally, and dplyr::add_count and dplyr::add_tally
Quickly filter the top categories / groups based on a variable with forcats::fct_lump
Apply the same filter to multiple columns with dplyr::filter_all or dplyr::filter_if in combination with dplyr::all_vars and dplyr::any_vars
dplyr::group_by_if performs quick conditional grouping
Perform rowwise mutations / calculations using dplyr::rowwise
purrr tricks
purrr::map_df to read in and merge all data files in a folder
Combine purr::map_df and fs::dir_ls to read in and merge all data files following a specific pattern in a folder
Combine list.files and purrr::map_df to read in and merge all data files in a folder
broom::tidy puts your model results in a tidy data frame
Simpler correlation analysis with corrr
df %>% .$column_name or df %$% column_name can retrieve a column from a tibble
dplyr::coalesce finds the one value contained in many columns with missing values
Display a fraction between 0 and 1 as a percentage with scales::percent(myfraction)
Convert numbers that came in as strings with commas to R numbers with readr::parse_number(mydf$mycol)

Data visualization

colors() to see the names of all built-in colors
GGally::ggpairs for beautiful pair-wise correlation plots
tidyr::complete to get barplot spacing right
Quickly visualize your whole dataset
Create custom, corporate, reproducible color palettes and custom discrete color scales
Standardize the colors of groups in your visualizations using named vectors
theme_set to set a default ggplot2 theme
Create your own ggplot2 theme:
Rearranging values and axis within ggplot2 facets
Add line labels at the end of geom_lines by Simon Jackson
Add + NULL to the end of your ggplot2 chain during development
Add clip = "off" to draw outside the plot panel
Remove point borders with stroke = 0
Multicolored annotated text in ggplot2 by Andrew Whitby & Visuelle Data
Combine plots using patchwork or cowplot
Add a (corporate) logo to your plot using magick
Use animations in your markdown files with the gganimate package and "header-includes: - \usepackage{animate} in your YAML preamble
If you pass a function to the data-argument in a geom_*, then it applies that function to the data!
Generate distributions in ggplot2 using the stat_function function. Normal distributions, student t-distributions, beta distributions, anything. See also here.

Back to Table of Contents

Fun

Easter eggs

Run ????"", via Reddit
Run example(readLine), via DecisionStats
Run ?.Internal, via DecisionStats

Join 385 other subscribers

Back to Table of Contents

Generating Pusheen with AI

Zack Nado wrote the best machine learning application I’ve seen so far: a neural network architecture that generates new Pusheen pictures.

Image result for pusheen — This is an orginal Pusheen picture.

In his blog, Zack describes his generative adversarial network (GAN) , a special type of machine learning architecture where two neural networks try to fool each other. Zack first gave the discriminator network some real Pusheen images, so it gets an idea of what Pusheen looks like. Next, the generator network gets a bunch of random numbers so it can generate completely new (fake) images. These generated images are then fed back into the discriminator, so it knows what generated images look like. Zack repeated this process several hundred thousand times, so he obtained a generator network that’s great at making new Pusheen images which the discriminator (nearly) can’t dinstinguish from the original, real ones. Below is the learning process of the generator network visualized:

ezgif.com-video-to-gif — Samples output by the generator network. It learns distinctive features of “real” Pusheen (e.g., tail, eyes, ears) over time [original]

In the end, the generated images are very much like the real Pusheen. Zack added an interactive module (using Tensorflow.js) to the blog so you can generate some Pusheens yourself. (it didn’t work for me though…) On a final note, Zack wrote the orginal blog both in plain English, for non-experts, and in jargon, for the more experienced data scientists. I highly recommend you read either one of those versions!

Some of the Pusheen’s generated by Zack’s GAN [original]

Predicting Employee Turnover at SIOP 2018

The 2018 annual Society for Industrial and Organizational Psychology (SIOP) conference featured its first-ever machine learning competition. Teams competed for several months in predicting the enployee turnover (or churn) in a large US company. A more complete introduction as presented at the conference can be found here. All submissions had to be open source and the winning submissions have been posted in this GitHub repository. The winning teams consist of analysts working at WalMart, DDI, and HumRRO. They mostly built ensemble models, in Python and/or R, combining algorithms such as (light) gradient boosted trees, neural networks, and random forest analysis.

Interactive Explanation of Network and Graph Principles

Why do groups of people act smart, dumb, kind, or cruel? People behave in strange ways, particularly when they are able to influence one another. Both good and bad things can happen when people interact and behave in network structures. On the bright side, you must be familiar with the wisdom of the crowd, where the aggregated knowledge of a group is more valuable than its sum? Ensemble algorithms – like random forest analysis – rely on this positive principle.

On the dark side, are you familiar with the phenomenon called the tragedy of the commons, where shared resource-systems collapse because individuals behave in their self-interest? Or psychological phenomena such as groupthink, where groups of people make irrational decisions due to social issues? The recent spread of fake news and misinformation is also stimulated by network interactions. In these cases, we could speak of the madness of the crowd.

Nicky Case made a great interactive walkthrough explaining why and when networks of people become wise or mad. You are tasked to change and simulate network interactions while Nicky explains concepts such as (complex) contagion, the majority illusion paradox, bonding and bridging, and small world networks. In the references, Nicky provides links to scientific papers explaining these concepts in more detail. I highly suggest you check out her website here.