Author: Paul van der Laken

Game of Thrones: An R Map to Westeros

Gokhan Ciflikli, PhD candidate at LSE, scraped the Game of Thrones scripts and mined their text using the quanteda package. Gokhan showcases some cool features of the package, which may be an alternative to the tidytext package. In the process, Gokhan also downloaded cartographic data of Westeros and build the beautiful map below, using tmap. This map looks like it’s built by a professional! Very well done!

The GoT font you can find here and Gokhan shared the script here on GitHub.

Map of Westeros built in R, original posted on gokhanciflikli.com (pdf)

Animated Snow in R

Due to the recent updates to the gganimate package, the code below no longer produces the desired animation.
A working, updated version can be found here.

After hearing R play the Jingle Bells tune, I really got into the holiday vibe. It made me think of Ilya Kashnitsky (homepage, twitter) his snowy image in R.

– Papa, what are you doing?
…
How I ended up generating #rstats snow for my 3yo daughter Sophia#ggplot2 #dataviz pic.twitter.com/29sk1HpROJ
— Ilya Kashnitsky (@ikashnitsky) 4 december 2017

if(!"tidyverse" %in% installed.packages()) install.packages("tidyverse")

library("tidyverse")

n <- 100 
tibble(x = runif(n),  
y = runif(n),  
s = runif(n, min = 4, max = 20)) %>%
ggplot(aes(x, y, size = s)) +
geom_point(color = "white", pch = 42) +
scale_size_identity() +
coord_cartesian(c(0,1), c(0,1)) +
theme_void() +
theme(panel.background = element_rect("black"))

This greatly fits the Christmas theme we have going here. Inspired by Ilya’s script, I decided to make an animated snowy GIF! Sure R is able to make something like the lively visualizations Daniel Shiffman (Coding Train) usually makes in Processing/JavaScript? It seems so:

### ANIMATED SNOW === BY PAULVANDERLAKEN.COM
### PUT THIS FILE IN AN RPROJECT FOLDER

# load in packages
pkg <- c("here", "tidyverse", "gganimate", "animation")
sapply(pkg, function(x){
if (!x %in% installed.packages()){install.packages(x)}
library(x, character.only = TRUE)
})

# parameters
n <- 100 # number of flakes
times <- 100 # number of loops
xstart <- runif(n, max = 1) # random flake start x position
ystart <- runif(n, max = 1.1) # random flake start y position
size <- runif(n, min = 4, max = 20) # random flake size
xspeed <- seq(-0.02, 0.02, length.out = 100) # flake shift speeds to randomly pick from
yspeed <- runif(n, min = 0.005, max = 0.025) # random flake fall speed

# create storage vectors
xpos <- rep(NA, n * times)
ypos <- rep(NA, n * times)

# loop through simulations
for(i in seq(times)){
if(i == 1){
# initiate values
xpos[1:n] <- xstart
ypos[1:n] <- ystart
} else {
# specify datapoints to update
first_obs <- (n*i - n + 1)
last_obs <- (n*i)
# update x position
# random shift
xpos[first_obs:last_obs] <- xpos[(first_obs-n):(last_obs-n)] - sample(xspeed, n, TRUE)
# update y position
# lower by yspeed
ypos[first_obs:last_obs] <- ypos[(first_obs-n):(last_obs-n)] - yspeed
# reset if passed bottom screen
xpos <- ifelse(ypos < -0.1, runif(n), xpos) # restart at random x
ypos <- ifelse(ypos < -0.1, 1.1, ypos) # restart just above top
}
}

# store in dataframe
data_fluid <- cbind.data.frame(x = xpos,
y = ypos,
s = size,
t = rep(1:times, each = n))

# create animation
snow <- data_fluid %>%
ggplot(aes(x, y, size = s, frame = t)) +
geom_point(color = "white", pch = 42) +
scale_size_identity() +
coord_cartesian(c(0, 1), c(0, 1)) +
theme_void() +
theme(panel.background = element_rect("black"))

# save animation
gganimate(snow, filename = here("snow.gif"), title_frame = FALSE, interval = .1)

Updates:

21/12/2017: Keith combined sound and image to create this very merry video.
22/12/2017: Ioannis Kosmidis generated snow in base R
25/12/2017: Daniel Shiffman dedicated a coding challenge to the topic.
25/12/2017: Cynthia Siew combined sound and image in this Shiny Christmas card.
17/12/2018: Due to the update to gganimate, I updated the code and general setup to run still in 2018.

Jingle Bells in R

Christmas is here! Keith McNulty called on his LinkedIn network to co-create a script to play Christmas tunes. After adding some notes myself, the R script on this github page now plays Jingle Bells. The final tune you can download here and the script I pasted below. Any volunteers to make Let it snow or Silent night?

2018/12/19: Keith combined the full Jingle Bells script and my snow animation into a video.
2108/12/20: Mark Burkey posted R Jingle Bells on YouTube as well as a versions of Silent Night (R script) and Let It Snow (Rscript).

if(!"dplyr" %in% installed.packages()) install.packages("dplyr")
if(!"audio" %in% installed.packages()) install.packages("audio")

library("dplyr")
library("audio")

notes <- c(A = 0, B = 2, C = 3, D = 5, E = 7, F = 8, G = 10)

pitch <- paste("E E E",
"E E E",
"E G C D",
"E",
"F F F F",
"F E E E",
"E D D E",
"D G",
"E E E",
"E E E",
"E G C D",
"E",
"F F F F",
"F E E E E",
"G G F D",
"C",
"G3 E D C",
"G3",
"G3 G3 G3 E D C",
"A3",
"A3 F E D",
"B3",
"G G F D",
"E",
"G3 E D C",
"G3",
"G3 E D C",
"A3 A3",
"A3 F E D",
"G G G G A G F D",
"C C5 B A G F G",
"E E E G C D",
"E E E G C D",
"E F G A C E D F",
"E C D E F G A G",
"F F F F F F",
"F E E E E E",
"E D D D D E",
"D D E F G F E D",
"E E E G C D",
"E E E G C D",
"E F G A C E D F",
"E C D E F G A G",
"F F F F F F",
"F E E E E E",
"G C5 B A G F E D",
"C C E G C5")

duration <- c(1, 1, 2,
1, 1, 2,
1, 1, 1.5, 0.5,
4,
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
2, 2,
1, 1, 2,
1, 1, 2,
1, 1, 1.5, 0.5,
4,
1, 1, 1, 1,
1, 1, 1, 0.5, 0.5,
1, 1, 1, 1,
4,
1, 1, 1, 1,
3, .5, .5,
1, 1, 1, 1,
4,
1, 1, 1, 1,
4,
1, 1, 1, 1,
4,
1, 1, 1, 1,
4,
1, 1, 1, 1,
3, 1,
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 1, 0.5, 0.5, 0.5, 0.5,
1, 1, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
1, 0.5, 0.5, 0.5, 0.5, 1,
1, 0.33, 0.33, 0.33, 1, 0.33, 0.33, 0.33,
1, 1, 0.5, 0.5, 0.5, 0.5,
1, 1, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 0.33, 0.33, 0.33, 2)

jbells <- data_frame(pitch = strsplit(pitch, " ")[[1]],
duration = duration)

jbells <- jbells %>%
mutate(octave = substring(pitch, nchar(pitch)) %>%
{suppressWarnings(as.numeric(.))} %>%
ifelse(is.na(.), 4, .),
note = notes[substr(pitch, 1, 1)],
note = note + grepl("#", pitch) -
grepl("b", pitch) + octave * 12 +
12 * (note < 3),
freq = 2 ^ ((note - 60) / 12) * 440)

tempo <- 250

sample_rate <- 44100

make_sine <- function(freq, duration) {
wave <- sin(seq(0, duration / tempo * 60, 1 / sample_rate) *
freq * 2 * pi)
fade <- seq(0, 1, 50 / sample_rate)
wave * c(fade, rep(1, length(wave) - 2 * length(fade)), rev(fade))
}

jbells_wave <- mapply(make_sine, jbells$freq, jbells$duration) %>%
do.call("c", .)

play(jbells_wave)

So You Think You Can (A/B) Test?

Decision making under uncertainty is complicated. These days, many business rely on real-life experiments, or A/B tests, to reduce that uncertainty and improve their decision-making. For instance, here’s a presentation on how A/B testing helps improve business outcomes at Etsy.

Lukas Vermeer built So You Think You Can Test, an online simulation game in which you are the decision-maker in a company. You control the backlog and running of experiments and each day you have to decide which tasks to prioritize (or deleted entirely). Your decisions affect the sales of the company, so be wise and use the experimental information to your advantage.

You can play the game here.

Sentiment Analysis of Stranger Things Seasons 1 and 2

Jordan Dworkin, a Biostatistics PhD student at the University of Pennsylvania, is one of the few million fans of Stranger Things, a 80s-themed Netflix series combining drama, fantasy, mystery, and horror. Awaiting the third season, Jordan was curious as to the emotional voyage viewers went through during the series, and he decided to examine this using a statistical approach. Like I did for the seven Harry Plotter books, Jordan downloaded the scripts of all the Stranger Things episodes and conducted a sentiment analysis in R, of course using the tidyverse and tidytext. Jordan measured the positive or negative sentiment of the words in them using the AFINN dictionary and a first exploration led Jordan to visualize these average sentiment scores per episode:

The average positive/negative sentiment during the 17 episodes of the first two seasons of Stranger Things (from Medium.com)

Jordan jokingly explains that you might expect such overly negative sentiment in show about missing children and inter-dimensional monsters. The less-than-well-received episode 15 stands out, Jordan feels this may be due to a combination of its dark plot and the lack of any comedic relief from the main characters.

Reflecting on the visual above, Jordan felt that a lot of the granularity of the actual sentiment was missing. For a next analysis, he thus calculated a rolling average sentiment during the course of the separate episodes, which he animated using the animation package:

GIF displaying the rolling average (40 words) sentiment per Stranger Things episode (from Medium.com)

Jordan has two new takeaways: (1) only 3 of the 17 episodes have a positive ending – the Season 1 finale, the Season 2 premiere, and the Season 2 finale – (2) the episodes do not follow a clear emotional pattern. Based on this second finding, Jordan subsequently compared the average emotional trajectories of the two seasons, but the difference was not significant:

Smoothed (loess, I guess) trajectories of the sentiment during the episodes in seasons one and two of Stranger Things (from Medium.com)

Potentially, it’s better to classify the episodes based on their emotional trajectory than on the season they below too, Jordan thought next. Hence, he constructed a network based on the similarity (temporal correlation) between episodes’ temporal sentiment scores. In this network, the episodes are the nodes whereas the edges are weighted for the similarity of their emotional trajectories. In that sense, more distant episodes are less similar in terms of their emotional trajectory. The network below, made using igraph (see also here), demonstrates that consecutive episodes (1 → 2, 2 → 3, 3 → 4) are not that much alike:

The network of Stranger Things episodes, where the relations between the episodes are weighted for the similarity of their emotional trajectories (from Medium.com).

A community detection algorithm Jordan ran in MATLAB identified three main trajectories among the episodes:

Three different emotional trajectories were identified among the 17 Stranger Things episodes in Season 1 and 2 (from Medium.com).

Looking at the average patterns, we can see that group 1 contains episodes that begin and end with neutral emotion and have slow fluctuations in the middle, group 2 contains episodes that begin with negative emotion and gradually climb towards a positive ending, and group 3 contains episodes that begin on a positive note and oscillate downwards towards a darker ending.

– Jordan on Medium.com

Jordan final suggestion is that producers and scriptwriters may consciously introduce these variations in emotional trajectories among consecutive episodes in order to get viewers hooked. If you want to redo the analysis or reuse some of the code used to create the visuals above, you can access Jordan’s R scripts here. I, for one, look forward to his analysis of Season 3!

Robust Effect Sizes for Independent Group Comparisons

Guillaume Rousselet explains how and when group comparisons with Cohen’s d fail, and what robust statistics one could use instead:

basic statistics

When I was an undergrad, I was told that beyond a certain sample size (n=30 if I recall correctly), t-tests and ANOVAs are fine. This was a lie. I wished I had been taught robust methods and that t-tests and ANOVAs on means are only a few options among many alternatives. Indeed, t-tests and ANOVAs on means are not robust to outliers, skewness, heavy-tails, and for independent groups, differences in skewness, variance (heteroscedasticity) and combinations of these factors (Wilcox & Keselman, 2003; Wilcox, 2012). The main consequence is a lack of statistical power. For this reason, it is often advised to report a measure of effect size to determine, for instance, if a non-significant effect (based on some arbitrary p value threshold) could be due to lack of power, or reflect a genuine lack of effect. The rationale is that an effect could be associated with a sufficiently large effect…

View original post 3,911 more words