Tag: programming

R tips and tricks

Below are a dozen of very specific R tips and tricks. Some are valuable, useful, or boost your productivity. Others are just geeky funny.

More general helpful R packages and resources can be found in this list.

If you have additions, please comment below or contact me!

Completely new to R? → Start here!

RStudio tricks
General tips
Base R tricks
R Markdown tricks
Data manipulation tricks
Data visualization tricks

Funny tricks
Easter eggs

Join 385 other subscribers

RStudio

RStudio Addins
RStudio Keyboard Shortcuts
R Studio easy tricks: tearable panes, command history, renaming in scope, outlining, snippets, and more
Working with R projects and here
Working with code snippets
Working with code snippets (video)
Stop RStudio from asking to save workspace
Automatically save workspace in case of a crash / errors
Edit several lines of code at once
Press ALT + left mousebutton to select and write on multiple lines simultaneously.
Press ALT + - to insert a <- operator
Press CTRL + SHIFT + M to insert a %>% operator
Press CTRL + SHIFT + F to search all files in the directory or project
Press CTRL + UP to access navigate your console history
Rename all variables with same name (rename in scope)
Press CMD + ALT + SHIFT + M to rename variable within scope: to rename all/multiple occurrences of a variable in a script
Press TAB inside “” (quotation marks / an empty string) to select a filename from your current directory, or to autocomplete a filename you started typing

Many more shortkeys available here online, and in your RStudio under Tools → Keyboard Shortcuts Help.

General

Disclaimer: This page contains one or more links to Amazon.
Any purchases made through those links provide us with a small commission that helps to host this blog.

Useful base functions

str() – explore structure of R object
trimws() – trim trailing and/or leading whitespaces
dput() – dump an R object in form of R code
cut()– categorize values into intervals
intersect() – returns similar elements in two vectors
union() – find intersecting items in two vectors
setdiff() – returns different elements in two vectors
interaction() – computes a factor which represents the interaction of the given factors
formatC()can be used to round numbers and force trailing zero’s
formatC() and sprintf() can be used to add leading/trailing characters
expand.grid() – create a data frame from all combinations of the supplied vectors or factors
seq_along(myvec) – generates a vector of 1:length(myvec)
Initiate an empty dataframe with header names
Functional programming tricks:
- switch() can replace elaborate ifelse statements (see also)
- match.arg() can check for arguments and values
- The null-default operator (%||%) returns the first value that is not NULL
Convert a vector of strings to title case
Quickly map a new set of values to an existing vector
Calculate the derivative of a function expression
Specify options() in your script:
- Prevent scientific notation using options(scipen = 999)
- Prevent automatic factor columns using options(stringsAsFactors = FALSE)
- Use options(width = 60) to change the default width of console output
- Use options(max.print = 100) to change the default number of values printed in the console

Back to Table of Contents

R Markdown

Pimp my RMD: Overview of many R markdown tricks by Yan Holtz
Save compiled images in folder with markdown
Add caption to compiled tables with markdown
Tabsets in markdown
Foldable html content in markdown
Reuse code chunks in markdown
Generate Word documents with markdown
Open url’s in a new window with[text](url){target = "_blank} in markdown
Use #<< to highlight code
Move to next xaringan slide upon click (or Enter)
Convert an R Markdown file (.Rmd) into an R script (.R) with
knitr::purl(input, output, documentation = 2)
Use CTRL + SHIFT + 1:4 to zoom in on any single of your RStudio panels. Use ALT + CTRL + SHIFT + 0 to zoom back out.
knitr::read_chunk("your_script_name.R") can be used to source in scripts that reside outside your current markdown file
Use animations in your markdown files with the gganimate package and "header-includes: - \usepackage{animate} in your YAML preamble
Create a searchable, sortable HTML table in 1 line of code with DT::datatable(mydf, filter = 'top')

Data manipulation

readr::parse_number extracts the numbers from raw / scraped text
stringr::str_pad can be used to add leading or trailing characters (like zero’s)
dplyr tricks
dplyr::case_when replaces elaborate ifelse statements (Video)
dplyr::everything in combination with dplyr::select to reorder columns
Quickly count / tally observations within groups with dplyr::count, dplyr::tally, and dplyr::add_count and dplyr::add_tally
Quickly filter the top categories / groups based on a variable with forcats::fct_lump
Apply the same filter to multiple columns with dplyr::filter_all or dplyr::filter_if in combination with dplyr::all_vars and dplyr::any_vars
dplyr::group_by_if performs quick conditional grouping
Perform rowwise mutations / calculations using dplyr::rowwise
purrr tricks
purrr::map_df to read in and merge all data files in a folder
Combine purr::map_df and fs::dir_ls to read in and merge all data files following a specific pattern in a folder
Combine list.files and purrr::map_df to read in and merge all data files in a folder
broom::tidy puts your model results in a tidy data frame
Simpler correlation analysis with corrr
df %>% .$column_name or df %$% column_name can retrieve a column from a tibble
dplyr::coalesce finds the one value contained in many columns with missing values
Display a fraction between 0 and 1 as a percentage with scales::percent(myfraction)
Convert numbers that came in as strings with commas to R numbers with readr::parse_number(mydf$mycol)

Data visualization

colors() to see the names of all built-in colors
GGally::ggpairs for beautiful pair-wise correlation plots
tidyr::complete to get barplot spacing right
Quickly visualize your whole dataset
Create custom, corporate, reproducible color palettes and custom discrete color scales
Standardize the colors of groups in your visualizations using named vectors
theme_set to set a default ggplot2 theme
Create your own ggplot2 theme:
Rearranging values and axis within ggplot2 facets
Add line labels at the end of geom_lines by Simon Jackson
Add + NULL to the end of your ggplot2 chain during development
Add clip = "off" to draw outside the plot panel
Remove point borders with stroke = 0
Multicolored annotated text in ggplot2 by Andrew Whitby & Visuelle Data
Combine plots using patchwork or cowplot
Add a (corporate) logo to your plot using magick
Use animations in your markdown files with the gganimate package and "header-includes: - \usepackage{animate} in your YAML preamble
If you pass a function to the data-argument in a geom_*, then it applies that function to the data!
Generate distributions in ggplot2 using the stat_function function. Normal distributions, student t-distributions, beta distributions, anything. See also here.

Back to Table of Contents

Fun

Easter eggs

Run ????"", via Reddit
Run example(readLine), via DecisionStats
Run ?.Internal, via DecisionStats

Join 385 other subscribers

Back to Table of Contents

rstudio::conf 2018 summary

rstudio::conf is the yearly conference when it comes to R programming and RStudio. In 2017, nearly 500 people attended and, last week, 1100 people went to the 2018 edition. Regretfully, I was on holiday in Cardiff and missed out on meeting all my #rstats hero’s. Just browsing through the #rstudioconf Twitter-feed, I already learned so many new things that I decided to dedicate a page to it!

Fortunately, you can watch the live streams taped during the conference:

Two people have collected the slides of most rstudio::conf 2018 talks, which you can acces via the Github repo’s of matthewravey and by simecek. People on Twitter have particularly recommended teach the tidyverse to beginners (by David Robinson), the lesser known stars of the tidyverse (by Emily Robinson), the future of time series and financial analysis in the tidyverse (by Davis Vaughan of business-science.io), Understanding Principal Component Analysis (by Julia Silge), and Deploying TensorFlow models (by Javier Luraschi). Nevertheless, all other presentations are definitely worth checking out as well!

One of the workshops deserves an honorable mention. Jenny Bryan presented on What they forgot to teach you about R, providing some excellent advice on reproducible workflows. It elaborates on her earlier blog on project-oriented workflows, which you should read if you haven’t yet. Some best pRactices Jenny suggests:

Restart R often. This ensures your code is still working as intended. Use Shift-CMD-F10 to do so quickly in RStudio.
Use stable instead of absolute paths. This allows you to (1) better manage your imports/exports and folders, and (2) allows you to move/share your folders without the code breaking. For instance, here::here("data","raw-data.csv") loads the raw-data.csv-file from the data folder in your project directory. If you are not using the here package yet, you are honestly missing out! Alternatively you can use fs::path_home(). normalizePath() will make paths work on both windows and mac. You can usebasename instead of strsplit to get name of file from a path.
To upload an existing git directory to GitHub easily, you can usethis::use_github().
If you include the below YAML header in your .R file, you can easily generate .md files for you github repo.

#' ---
#' output: github_document
#' ---

Moreover, Jenny proposed these useful default settings for knitr:

knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
out.width = "100%"
)

Another of Jenny Bryan‘s talks was named Data Rectangling and although you might not get much out of her slides without her presenting them, you should definitely try the associated repurrrsive tutorial if you haven’t done so yet. It’s a poweR up for any useR!

Here’s a Shiny dashboard made by Garrick Aden-Buie including all the #rstudioconf tweets so you can browse the posts yourself. If you want to download the tweets, Mike Kearney (author of rtweet) shares the data here on his Github. Some highlights:

Amelia McNamera posted a cheat sheet comparing R’s dollar sign, formula, and tidyverse syntaxes.
Amanda Gadrow shared a RStudio debugging cheat sheet and a facebook of the rstudio::conf 2018 attendees.
Tim Mastny shared how to easily embed slides in blogdown websites.
David Robinson posted a first draft of Hadley Wickham‘s tidy tools manifesto.
Mike Kearney shared some cool analyses he conducted on the #rstudioconf Twitter data.
I can’t remember who shared it, but a very cool trick is to name the viewing tab of any dataframe you pipe into View() using df %>% View("enter_view_tab_name").

These probably only present a minimal portion of the thousands of tips and tricks you could have learned by simply attending rstudio::conf. I will definitely try to attend next year’s edition. Nevertheless, I hope the above has been useful. If I missed out on any tips, presentations, tweets, or other materials, please reply below, tweet me or pop me a message!

Animated Snow in R

Due to the recent updates to the gganimate package, the code below no longer produces the desired animation.
A working, updated version can be found here.

After hearing R play the Jingle Bells tune, I really got into the holiday vibe. It made me think of Ilya Kashnitsky (homepage, twitter) his snowy image in R.

– Papa, what are you doing?
…
How I ended up generating #rstats snow for my 3yo daughter Sophia#ggplot2 #dataviz pic.twitter.com/29sk1HpROJ
— Ilya Kashnitsky (@ikashnitsky) 4 december 2017

if(!"tidyverse" %in% installed.packages()) install.packages("tidyverse")

library("tidyverse")

n <- 100 
tibble(x = runif(n),  
y = runif(n),  
s = runif(n, min = 4, max = 20)) %>%
ggplot(aes(x, y, size = s)) +
geom_point(color = "white", pch = 42) +
scale_size_identity() +
coord_cartesian(c(0,1), c(0,1)) +
theme_void() +
theme(panel.background = element_rect("black"))

This greatly fits the Christmas theme we have going here. Inspired by Ilya’s script, I decided to make an animated snowy GIF! Sure R is able to make something like the lively visualizations Daniel Shiffman (Coding Train) usually makes in Processing/JavaScript? It seems so:

### ANIMATED SNOW === BY PAULVANDERLAKEN.COM
### PUT THIS FILE IN AN RPROJECT FOLDER

# load in packages
pkg <- c("here", "tidyverse", "gganimate", "animation")
sapply(pkg, function(x){
if (!x %in% installed.packages()){install.packages(x)}
library(x, character.only = TRUE)
})

# parameters
n <- 100 # number of flakes
times <- 100 # number of loops
xstart <- runif(n, max = 1) # random flake start x position
ystart <- runif(n, max = 1.1) # random flake start y position
size <- runif(n, min = 4, max = 20) # random flake size
xspeed <- seq(-0.02, 0.02, length.out = 100) # flake shift speeds to randomly pick from
yspeed <- runif(n, min = 0.005, max = 0.025) # random flake fall speed

# create storage vectors
xpos <- rep(NA, n * times)
ypos <- rep(NA, n * times)

# loop through simulations
for(i in seq(times)){
if(i == 1){
# initiate values
xpos[1:n] <- xstart
ypos[1:n] <- ystart
} else {
# specify datapoints to update
first_obs <- (n*i - n + 1)
last_obs <- (n*i)
# update x position
# random shift
xpos[first_obs:last_obs] <- xpos[(first_obs-n):(last_obs-n)] - sample(xspeed, n, TRUE)
# update y position
# lower by yspeed
ypos[first_obs:last_obs] <- ypos[(first_obs-n):(last_obs-n)] - yspeed
# reset if passed bottom screen
xpos <- ifelse(ypos < -0.1, runif(n), xpos) # restart at random x
ypos <- ifelse(ypos < -0.1, 1.1, ypos) # restart just above top
}
}

# store in dataframe
data_fluid <- cbind.data.frame(x = xpos,
y = ypos,
s = size,
t = rep(1:times, each = n))

# create animation
snow <- data_fluid %>%
ggplot(aes(x, y, size = s, frame = t)) +
geom_point(color = "white", pch = 42) +
scale_size_identity() +
coord_cartesian(c(0, 1), c(0, 1)) +
theme_void() +
theme(panel.background = element_rect("black"))

# save animation
gganimate(snow, filename = here("snow.gif"), title_frame = FALSE, interval = .1)

Updates:

21/12/2017: Keith combined sound and image to create this very merry video.
22/12/2017: Ioannis Kosmidis generated snow in base R
25/12/2017: Daniel Shiffman dedicated a coding challenge to the topic.
25/12/2017: Cynthia Siew combined sound and image in this Shiny Christmas card.
17/12/2018: Due to the update to gganimate, I updated the code and general setup to run still in 2018.

Jingle Bells in R

Christmas is here! Keith McNulty called on his LinkedIn network to co-create a script to play Christmas tunes. After adding some notes myself, the R script on this github page now plays Jingle Bells. The final tune you can download here and the script I pasted below. Any volunteers to make Let it snow or Silent night?

2018/12/19: Keith combined the full Jingle Bells script and my snow animation into a video.
2108/12/20: Mark Burkey posted R Jingle Bells on YouTube as well as a versions of Silent Night (R script) and Let It Snow (Rscript).

if(!"dplyr" %in% installed.packages()) install.packages("dplyr")
if(!"audio" %in% installed.packages()) install.packages("audio")

library("dplyr")
library("audio")

notes <- c(A = 0, B = 2, C = 3, D = 5, E = 7, F = 8, G = 10)

pitch <- paste("E E E",
"E E E",
"E G C D",
"E",
"F F F F",
"F E E E",
"E D D E",
"D G",
"E E E",
"E E E",
"E G C D",
"E",
"F F F F",
"F E E E E",
"G G F D",
"C",
"G3 E D C",
"G3",
"G3 G3 G3 E D C",
"A3",
"A3 F E D",
"B3",
"G G F D",
"E",
"G3 E D C",
"G3",
"G3 E D C",
"A3 A3",
"A3 F E D",
"G G G G A G F D",
"C C5 B A G F G",
"E E E G C D",
"E E E G C D",
"E F G A C E D F",
"E C D E F G A G",
"F F F F F F",
"F E E E E E",
"E D D D D E",
"D D E F G F E D",
"E E E G C D",
"E E E G C D",
"E F G A C E D F",
"E C D E F G A G",
"F F F F F F",
"F E E E E E",
"G C5 B A G F E D",
"C C E G C5")

duration <- c(1, 1, 2,
1, 1, 2,
1, 1, 1.5, 0.5,
4,
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
2, 2,
1, 1, 2,
1, 1, 2,
1, 1, 1.5, 0.5,
4,
1, 1, 1, 1,
1, 1, 1, 0.5, 0.5,
1, 1, 1, 1,
4,
1, 1, 1, 1,
3, .5, .5,
1, 1, 1, 1,
4,
1, 1, 1, 1,
4,
1, 1, 1, 1,
4,
1, 1, 1, 1,
4,
1, 1, 1, 1,
3, 1,
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 1, 0.5, 0.5, 0.5, 0.5,
1, 1, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
1, 0.5, 0.5, 0.5, 0.5, 1,
1, 0.33, 0.33, 0.33, 1, 0.33, 0.33, 0.33,
1, 1, 0.5, 0.5, 0.5, 0.5,
1, 1, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 0.33, 0.33, 0.33, 2)

jbells <- data_frame(pitch = strsplit(pitch, " ")[[1]],
duration = duration)

jbells <- jbells %>%
mutate(octave = substring(pitch, nchar(pitch)) %>%
{suppressWarnings(as.numeric(.))} %>%
ifelse(is.na(.), 4, .),
note = notes[substr(pitch, 1, 1)],
note = note + grepl("#", pitch) -
grepl("b", pitch) + octave * 12 +
12 * (note < 3),
freq = 2 ^ ((note - 60) / 12) * 440)

tempo <- 250

sample_rate <- 44100

make_sine <- function(freq, duration) {
wave <- sin(seq(0, duration / tempo * 60, 1 / sample_rate) *
freq * 2 * pi)
fade <- seq(0, 1, 50 / sample_rate)
wave * c(fade, rep(1, length(wave) - 2 * length(fade)), rev(fade))
}

jbells_wave <- mapply(make_sine, jbells$freq, jbells$duration) %>%
do.call("c", .)

play(jbells_wave)

Vega: The Grammar of Interactive Graphics

If you have ever programmed in R, you are probably familiar with the Grammar of Graphics due to ggplot2. You can read more about the Grammar of Graphics here, but the general idea behind it is that visualizations can be build up through various layers, each of which have certain characteristics (aesthetics in ggplot2).

ggplot(tips, aes(x = total_bill, y = tip)) +
  geom_point(aes(color = sex)) +
  geom_smooth(method = 'lm')

This post is not about ggplot2 nor specifically the Grammar of Graphics, but rather a summary of the official release of Vega-Lite 2, a high-level language for rapidly creating interactive visualizations which you might know from the R-package ggvis.

Vega-Lite has four operators to compose charts: layer, facet, concat and repeat. Layer stacks charts on top of each other in an orderly fashion. Facet divides and charts the data into groups. Concat combines multiple charts into dashboard layouts and, finally, repeat concatenate charts. Most importantly is that these operators can be combined! The example below compares weather data in New York and Seattle, layering data for individual years and averages within a repeated template for different measurements.

A layered and reapeted Vega-Lite graph of weather data

Vega-Lite 2 is especially useful because of the included interaction options. Programmers can specify how users can interactive select the data in their visualizations (e.g., a point or interval selection), along with possible transformations. With these interactions, users can for instance filter data, highlight points, or pan or zoom a plot. The plot below uses an interval selection, which causes the chart to include an interactive brush (shown in grey). The brush selection parameterizes the red guideline, which visualizes the average value within the selected interval.

An interactive moving average in Vega-Lite 2. Try it out!

However, this is not all! When multiple visualizations are combined in a dashboard, interactive selections can apply to all. Below, you see an interval selection being applied over a set of histograms. As a viewer adjusts the selection, they can immediately see how the other distributions change in response.

A crossfilter interaction in Vega-Lite 2. Try it out!

According to the developers Vega and Vega-Lite will be included in Jupyter Lab (the next generation of Jupyter Notebooks). Please find more details about Vega-Lite in the documentation or view the InfoVis 2016 research paper on the Vega-Lite language design. Moreover, check out the example gallery or these Vega-Lite applications. The source code you can find on GitHub. For updates, follow the Vega project on Twitter at @vega_vis. For an overview of the features you may watch the OpenVis Conference Video with the developers.

Kaggle Data Science Survey 2017: Worldwide Preferences for Python & R

Kaggle conducts industry-wide surveys to assess the state of data science and machine learning. Over 17,000 individuals worldwide participated in the survey, myself included, and 171 countries and territories are represented in the data.

There is an ongoing debate regarding whether R or Python is better suited for Data Science (probably the latter, but I nevertheless prefer the former). The thousands of responses to the Kaggle survey may provide some insights into how the preferences for each of these languages are dispersed over the globe. At least, that was what I thought when I wrote the code below.

View the Kaggle Kernel here.

### PAUL VAN DER LAKEN
### 2017-10-31
### KAGGLE DATA SCIENCE SURVEY
### VISUALIZING WORLD WIDE RESPONSES
### AND PYTHON/R PREFERENCES

# LOAD IN LIBRARIES
library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)

# OPTIONS & STANDARDIZATION
options(stringsAsFactors = F)
theme_set(theme_light())
dpi = 600
w = 12
h = 8
wm_cor = 0.8
hm_cor = 0.8
capt = "Kaggle Data Science Survey 2017 by paulvanderlaken.com"

# READ IN KAGGLE DATA
mc <- read.csv("multipleChoiceResponses.csv") %>%
  as.tibble()

# READ IN WORLDMAP DATA
worldMap <- map_data(map = "world") %>% as.tibble()

# ALIGN KAGGLE AND WORLDMAP COUNTRY NAMES
mc$Country[!mc$Country %in% worldMap$region] %>% unique()
worldMap$region %>% unique() %>% sort(F)
mc$Country[mc$Country == "United States"] <- "USA"
mc$Country[mc$Country == "United Kingdom"] <- "UK"
mc$Country[grepl("China|Hong Kong", mc$Country)] <- "China"


# CLEAN UP KAGGLE DATA
lvls = c("","Rarely", "Sometimes", "Often", "Most of the time")
labels = c("NA", lvls[-1])
ind_data <- mc %>% 
  select(Country, WorkToolsFrequencyR, WorkToolsFrequencyPython) %>%
  mutate(WorkToolsFrequencyR = factor(WorkToolsFrequencyR, 
                                      levels = lvls, labels = labels)) %>% 
  mutate(WorkToolsFrequencyPython = factor(WorkToolsFrequencyPython, 
                                           levels = lvls, labels = labels)) %>% 
  filter(!(Country == "" | is.na(WorkToolsFrequencyR) | is.na(WorkToolsFrequencyPython)))

# AGGREGATE TO COUNTRY LEVEL
country_data <- ind_data %>%
  group_by(Country) %>%
  summarize(N = n(),
            R = sum(WorkToolsFrequencyR %>% as.numeric()),
            Python = sum(WorkToolsFrequencyPython %>% as.numeric()))

# CREATE THEME FOR WORLDMAP PLOT
theme_worldMap <- theme(
    plot.background = element_rect(fill = "white"),
    panel.border = element_blank(),
    panel.grid = element_blank(),
    panel.background = element_blank(),
    legend.background = element_blank(),
    legend.position = c(0, 0.2),
    legend.justification = c(0, 0),
    legend.title = element_text(colour = "black"),
    legend.text = element_text(colour = "black"),
    legend.key = element_blank(),
    legend.key.size = unit(0.04, "npc"),
    axis.text = element_blank(), 
    axis.title = element_blank(),
    axis.ticks = element_blank()
  )

After aligning some country names (above), I was able to start visualizing the results. A first step was to look at the responses across the globe. The greener the more responses and the grey countries were not represented in the dataset. A nice addition would have been to look at the response rate relative to country population.. any volunteers?

# PLOT WORLDMAP OF RESPONSE RATE
ggplot(country_data) + 
  geom_map(data = worldMap, 
           aes(map_id = region, x = long, y = lat),
           map = worldMap, fill = "grey") +
  geom_map(aes(map_id = Country, fill = N),
           map = worldMap, size = 0.3) +
  scale_fill_gradient(low = "green", high = "darkgreen", name = "Response") +
  theme_worldMap +
  labs(title = "Worldwide Response Kaggle DS Survey 2017",
       caption = capt) +
  coord_equal()

Now, let’s look at how frequently respondents use Python and R in their daily work. I created two heatmaps: one excluding the majority of respondents who indicated not using either Python or R, probably because they didn’t complete the survey.

# AGGREGATE DATA TO WORKTOOL RESPONSES
worktool_data <- ind_data %>%
  group_by(WorkToolsFrequencyR, WorkToolsFrequencyPython) %>%
  count()

# HEATMAP OF PREFERRED WORKTOOLS
ggplot(worktool_data, aes(x = WorkToolsFrequencyR, y = WorkToolsFrequencyPython)) +
  geom_tile(aes(fill = log(n))) +
  geom_text(aes(label = n), col = "black") +
  scale_fill_gradient(low = "red", high = "yellow") +
  labs(title = "Heatmap of Python and R usage",
       subtitle = "Most respondents indicate not using Python or R (or did not complete the survey)",
       caption = capt, 
       fill = "Log(N)")

# HEATMAP OF PREFERRED WORKTOOLS
# EXCLUSING DOUBLE NA'S
worktool_data %>%
  filter(!(WorkToolsFrequencyPython == "NA" & WorkToolsFrequencyR == "NA")) %>%
  ungroup() %>%
  mutate(perc = n / sum(n)) %>%
  ggplot(aes(x = WorkToolsFrequencyR, y = WorkToolsFrequencyPython)) +
  geom_tile(aes(fill = n)) +
  geom_text(aes(label = paste0(round(perc,3)*100,"%")), col = "black") +
  scale_fill_gradient(low = "red", high = "yellow") +
  labs(title = "Heatmap of Python and R usage (non-users excluded)",
       subtitle = "There is a strong reliance on Python and less users focus solely on R",
       caption = capt, 
       fill = "N")

Okay, now let’s map these frequency data on a worldmap. Because I’m interested in the country level differences in usage, I look at the relative usage of Python compared to R. So the redder the country, the more Python is used by Data Scientists in their workflow whereas R is the preferred tool in the bluer countries. Interesting to see, there is no country where respondents really use R much more than Python.

# WORLDMAP OF RELATIVE WORKTOOL PREFERENCE
ggplot(country_data) + 
  geom_map(data = worldMap, 
           aes(map_id = region, x = long, y = lat),
           map = worldMap, fill = "grey") +
  geom_map(aes(map_id = Country, fill = Python/R),
           map = worldMap, size = 0.3) +
  scale_fill_gradient(low = "blue", high = "red", name = "Python/R") +
  theme_worldMap +
  labs(title = "Relative usage of Python to R per country",
       subtitle = "Focus on Python in Russia, Israel, Japan, Ukraine, China, Norway & Belarus",
       caption = capt) +
  coord_equal()