Category: programming

Multimapping in R, by Ilya Kashnitsky

Nothing beats a aesthetically-pleasing data visualization in the form of a map (see evidence here, here, here, or here).

Moreover, we’ve already witnessed some great R tutorials by Ilya Kashnitsky before (see Animated Snow in R).

These two come together in Ilya’s recent post on subplots in ggplot2 maps, with which he completely amazed me. The creation process is actually easier than the end result makes it look: make several visualizations and add them as ggplot2::annotation_custom() to your main ggplot2 map — the same as if you are adding a logo to your plot. Enjoy:

Here you can find Ilya’s original blog and the associated R script.

PyData, London 2018

PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

April 2018, a PyData conference was held in London, with three days of super interesting sessions and hackathons. While I couldn’t attend in person, I very much enjoy reviewing the sessions at home as all are shared open access on YouTube channel PyDataTV!

In the following section, I will outline some of my favorites as I progress through the channel:

Winning with simple, even linear, models:

One talk that really resonated with me is Vincent Warmerdam‘s talk on “Winning with Simple, even Linear, Models“. Working at GoDataDriven, a data science consultancy firm in the Netherlands, Vincent is quite familiar with deploying deep learning models, but is also midly annoyed by all the hype surrounding deep learning and neural networks. Particularly when less complex models perform equally well or only slightly less. One of his quote’s nicely sums it up:

“Tensorflow is a cool tool, but it’s even cooler when you don’t need it!”

— Vincent Warmerdam, PyData 2018

In only 40 minutes, Vincent goes to show the finesse of much simpler (linear) models in all different kinds of production settings. Among others, Vincent shows:

how to solve the XOR problem with linear models
how to win at timeseries with radial basis features
how to use weighted regression to deal with historical overfitting
how deep learning models introduce a new theme of horror in production
how to create streaming models using passive aggressive updating
how to build a real-time video game ranking system using mere histograms
how to create a well performing recommender with two SQL tables
how to rock at data science and machine learning using Python, R, and even Stan

R tips and tricks

Below are a dozen of very specific R tips and tricks. Some are valuable, useful, or boost your productivity. Others are just geeky funny.

More general helpful R packages and resources can be found in this list.

If you have additions, please comment below or contact me!

Completely new to R? → Start here!

RStudio tricks
General tips
Base R tricks
R Markdown tricks
Data manipulation tricks
Data visualization tricks

Funny tricks
Easter eggs

Join 385 other subscribers

RStudio

RStudio Addins
RStudio Keyboard Shortcuts
R Studio easy tricks: tearable panes, command history, renaming in scope, outlining, snippets, and more
Working with R projects and here
Working with code snippets
Working with code snippets (video)
Stop RStudio from asking to save workspace
Automatically save workspace in case of a crash / errors
Edit several lines of code at once
Press ALT + left mousebutton to select and write on multiple lines simultaneously.
Press ALT + - to insert a <- operator
Press CTRL + SHIFT + M to insert a %>% operator
Press CTRL + SHIFT + F to search all files in the directory or project
Press CTRL + UP to access navigate your console history
Rename all variables with same name (rename in scope)
Press CMD + ALT + SHIFT + M to rename variable within scope: to rename all/multiple occurrences of a variable in a script
Press TAB inside “” (quotation marks / an empty string) to select a filename from your current directory, or to autocomplete a filename you started typing

Many more shortkeys available here online, and in your RStudio under Tools → Keyboard Shortcuts Help.

General

Disclaimer: This page contains one or more links to Amazon.
Any purchases made through those links provide us with a small commission that helps to host this blog.

Useful base functions

str() – explore structure of R object
trimws() – trim trailing and/or leading whitespaces
dput() – dump an R object in form of R code
cut()– categorize values into intervals
intersect() – returns similar elements in two vectors
union() – find intersecting items in two vectors
setdiff() – returns different elements in two vectors
interaction() – computes a factor which represents the interaction of the given factors
formatC()can be used to round numbers and force trailing zero’s
formatC() and sprintf() can be used to add leading/trailing characters
expand.grid() – create a data frame from all combinations of the supplied vectors or factors
seq_along(myvec) – generates a vector of 1:length(myvec)
Initiate an empty dataframe with header names
Functional programming tricks:
- switch() can replace elaborate ifelse statements (see also)
- match.arg() can check for arguments and values
- The null-default operator (%||%) returns the first value that is not NULL
Convert a vector of strings to title case
Quickly map a new set of values to an existing vector
Calculate the derivative of a function expression
Specify options() in your script:
- Prevent scientific notation using options(scipen = 999)
- Prevent automatic factor columns using options(stringsAsFactors = FALSE)
- Use options(width = 60) to change the default width of console output
- Use options(max.print = 100) to change the default number of values printed in the console

Back to Table of Contents

R Markdown

Pimp my RMD: Overview of many R markdown tricks by Yan Holtz
Save compiled images in folder with markdown
Add caption to compiled tables with markdown
Tabsets in markdown
Foldable html content in markdown
Reuse code chunks in markdown
Generate Word documents with markdown
Open url’s in a new window with[text](url){target = "_blank} in markdown
Use #<< to highlight code
Move to next xaringan slide upon click (or Enter)
Convert an R Markdown file (.Rmd) into an R script (.R) with
knitr::purl(input, output, documentation = 2)
Use CTRL + SHIFT + 1:4 to zoom in on any single of your RStudio panels. Use ALT + CTRL + SHIFT + 0 to zoom back out.
knitr::read_chunk("your_script_name.R") can be used to source in scripts that reside outside your current markdown file
Use animations in your markdown files with the gganimate package and "header-includes: - \usepackage{animate} in your YAML preamble
Create a searchable, sortable HTML table in 1 line of code with DT::datatable(mydf, filter = 'top')

Data manipulation

readr::parse_number extracts the numbers from raw / scraped text
stringr::str_pad can be used to add leading or trailing characters (like zero’s)
dplyr tricks
dplyr::case_when replaces elaborate ifelse statements (Video)
dplyr::everything in combination with dplyr::select to reorder columns
Quickly count / tally observations within groups with dplyr::count, dplyr::tally, and dplyr::add_count and dplyr::add_tally
Quickly filter the top categories / groups based on a variable with forcats::fct_lump
Apply the same filter to multiple columns with dplyr::filter_all or dplyr::filter_if in combination with dplyr::all_vars and dplyr::any_vars
dplyr::group_by_if performs quick conditional grouping
Perform rowwise mutations / calculations using dplyr::rowwise
purrr tricks
purrr::map_df to read in and merge all data files in a folder
Combine purr::map_df and fs::dir_ls to read in and merge all data files following a specific pattern in a folder
Combine list.files and purrr::map_df to read in and merge all data files in a folder
broom::tidy puts your model results in a tidy data frame
Simpler correlation analysis with corrr
df %>% .$column_name or df %$% column_name can retrieve a column from a tibble
dplyr::coalesce finds the one value contained in many columns with missing values
Display a fraction between 0 and 1 as a percentage with scales::percent(myfraction)
Convert numbers that came in as strings with commas to R numbers with readr::parse_number(mydf$mycol)

Data visualization

colors() to see the names of all built-in colors
GGally::ggpairs for beautiful pair-wise correlation plots
tidyr::complete to get barplot spacing right
Quickly visualize your whole dataset
Create custom, corporate, reproducible color palettes and custom discrete color scales
Standardize the colors of groups in your visualizations using named vectors
theme_set to set a default ggplot2 theme
Create your own ggplot2 theme:
Rearranging values and axis within ggplot2 facets
Add line labels at the end of geom_lines by Simon Jackson
Add + NULL to the end of your ggplot2 chain during development
Add clip = "off" to draw outside the plot panel
Remove point borders with stroke = 0
Multicolored annotated text in ggplot2 by Andrew Whitby & Visuelle Data
Combine plots using patchwork or cowplot
Add a (corporate) logo to your plot using magick
Use animations in your markdown files with the gganimate package and "header-includes: - \usepackage{animate} in your YAML preamble
If you pass a function to the data-argument in a geom_*, then it applies that function to the data!
Generate distributions in ggplot2 using the stat_function function. Normal distributions, student t-distributions, beta distributions, anything. See also here.

Back to Table of Contents

Fun

Easter eggs

Run ????"", via Reddit
Run example(readLine), via DecisionStats
Run ?.Internal, via DecisionStats

Join 385 other subscribers

Back to Table of Contents

Predicting Employee Turnover at SIOP 2018

The 2018 annual Society for Industrial and Organizational Psychology (SIOP) conference featured its first-ever machine learning competition. Teams competed for several months in predicting the enployee turnover (or churn) in a large US company. A more complete introduction as presented at the conference can be found here. All submissions had to be open source and the winning submissions have been posted in this GitHub repository. The winning teams consist of analysts working at WalMart, DDI, and HumRRO. They mostly built ensemble models, in Python and/or R, combining algorithms such as (light) gradient boosted trees, neural networks, and random forest analysis.

A Categorical Spatial Interpolation Tutorial in R

Timo Grossenbacher works as reporter/coder for SRF Data, the data journalism unit of Swiss Radio and TV. He analyzes and visualizes data and investigates data-driven stories. On his website, he hosts a growing list of cool projects. One of his recent blogs covers categorical spatial interpolation in R. The end result of that blog looks amazing:

This map was built with data Timo crowdsourced for one of his projects. With this data, Timo took the following steps, which are covered in his tutorial:

Read in the data, first the geometries (Germany political boundaries), then the point data upon which the interpolation will be based on.
Preprocess the data (simplify geometries, convert CSV point data into an sf object, reproject the geodata into the ETRS CRS, clip the point data to Germany, so data outside of Germany is discarded).
Then, a regular grid (a raster without “data”) is created. Each grid point in this raster will later be interpolated from the point data.
Run the spatial interpolation with the kknn package. Since this is quite computationally and memory intensive, the resulting raster is split up into 20 batches, and each batch is computed by a single CPU core in parallel.
Visualize the resulting raster with ggplot2.

All code for the above process can be accessed on Timo’s Github. The georeferenced points underlying the interpolation look like the below, where each point represents the location of a person who selected a certain pronunciation in an online survey. More details on the crowdsourced pronunciation project van be found here, .

Another of Timo’s R map, before he applied k-nearest neighbors on these crowdsourced data. [original]

If you want to know more, please read the original blog or follow Timo’s new DataCamp course called Communicating with Data in the Tidyverse.

Open Source Visual Inspector for Neuroevolution (VINE)

In optimizing their transportation services, Uber uses evolutionary strategies and genetic algorithms to train deep neural networks through reinforcement learning. A lot of difficult words in one sentence; you can imagine the complexity of the process.

Because it is particularly difficult to observe the underlying dynamics of this learning process in neural network optimization, Uber built VINE – a Visual Inspector for NeuroEvolution. VINE helps to discover how evolutionary strategies and genetic optimizing are performing under the hood. In a recent article, they demonstrate how VINE works on the Mujoco Humanoid Locomotion task.

[…] In the Humanoid Locomotion Task, each pseudo-offspring neural network controls the movement of a robot, and earns a score, called its fitness, based on how well it walks. [Evolutionary principles] construct the next parent by aggregating the parameters of pseudo-offspring based on these fitness scores […]. The cycle then repeats.

Uber, March 2018, link

VINE plots parent neural networks and their pseudo-offspring according to their performance. Users can then interact with these plots to:

visualize parents, top performance, and/or the entire pseudo-offspring cloud of any generation,
compare between and within generation performance,
and zoom in on any pseudo-offspring (points) in the plot to display performance information.

The GIFs below demonstrate what VINE is capable of displaying: