Category: r

Jingle Bells in R

Jingle Bells in R

Christmas is here! Keith McNulty called on his LinkedIn network to co-create a script to play Christmas tunes. After adding some notes myself, the R script on this github page now plays Jingle Bells. The final tune you can download here and the script I pasted below. Any volunteers to make Let it snow or Silent night?

if(!"dplyr" %in% installed.packages()) install.packages("dplyr")
if(!"audio" %in% installed.packages()) install.packages("audio")

library("dplyr")
library("audio")

notes <- c(A = 0, B = 2, C = 3, D = 5, E = 7, F = 8, G = 10)

pitch <- paste("E E E",
"E E E",
"E G C D",
"E",
"F F F F",
"F E E E",
"E D D E",
"D G",
"E E E",
"E E E",
"E G C D",
"E",
"F F F F",
"F E E E E",
"G G F D",
"C",
"G3 E D C",
"G3",
"G3 G3 G3 E D C",
"A3",
"A3 F E D",
"B3",
"G G F D",
"E",
"G3 E D C",
"G3",
"G3 E D C",
"A3 A3",
"A3 F E D",
"G G G G A G F D",
"C C5 B A G F G",
"E E E G C D",
"E E E G C D",
"E F G A C E D F",
"E C D E F G A G",
"F F F F F F",
"F E E E E E",
"E D D D D E",
"D D E F G F E D",
"E E E G C D",
"E E E G C D",
"E F G A C E D F",
"E C D E F G A G",
"F F F F F F",
"F E E E E E",
"G C5 B A G F E D",
"C C E G C5")

duration <- c(1, 1, 2,
1, 1, 2,
1, 1, 1.5, 0.5,
4,
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
2, 2,
1, 1, 2,
1, 1, 2,
1, 1, 1.5, 0.5,
4,
1, 1, 1, 1,
1, 1, 1, 0.5, 0.5,
1, 1, 1, 1,
4,
1, 1, 1, 1,
3, .5, .5,
1, 1, 1, 1,
4,
1, 1, 1, 1,
4,
1, 1, 1, 1,
4,
1, 1, 1, 1,
4,
1, 1, 1, 1,
3, 1,
1, 1, 1, 1,
1, 1, 1, 1,
1, 1, 1, 1,
1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 1, 0.5, 0.5, 0.5, 0.5,
1, 1, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
1, 0.5, 0.5, 0.5, 0.5, 1,
1, 0.33, 0.33, 0.33, 1, 0.33, 0.33, 0.33,
1, 1, 0.5, 0.5, 0.5, 0.5,
1, 1, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
1, 0.5, 0.5, 1, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
1, 0.33, 0.33, 0.33, 2)

jbells <- data_frame(pitch = strsplit(pitch, " ")[[1]],
duration = duration)

jbells <- jbells %>%
mutate(octave = substring(pitch, nchar(pitch)) %>%
{suppressWarnings(as.numeric(.))} %>%
ifelse(is.na(.), 4, .),
note = notes[substr(pitch, 1, 1)],
note = note + grepl("#", pitch) -
grepl("b", pitch) + octave * 12 +
12 * (note < 3),
freq = 2 ^ ((note - 60) / 12) * 440)

tempo <- 250

sample_rate <- 44100

make_sine <- function(freq, duration) {
wave <- sin(seq(0, duration / tempo * 60, 1 / sample_rate) *
freq * 2 * pi)
fade <- seq(0, 1, 50 / sample_rate)
wave * c(fade, rep(1, length(wave) - 2 * length(fade)), rev(fade))
}

jbells_wave <- mapply(make_sine, jbells$freq, jbells$duration) %>%
do.call("c", .)

play(jbells_wave)
Sentiment Analysis of Stranger Things Seasons 1 and 2

Sentiment Analysis of Stranger Things Seasons 1 and 2

Jordan Dworkin, a Biostatistics PhD student at the University of Pennsylvania, is one of the few million fans of Stranger Things, a 80s-themed Netflix series combining drama, fantasy, mystery, and horror. Awaiting the third season, Jordan was curious as to the emotional voyage viewers went through during the series, and he decided to examine this using a statistical approach. Like I did for the seven Harry Plotter books, Jordan downloaded the scripts of all the Stranger Things episodes and conducted a sentiment analysis in R, of course using the tidyverse and tidytext. Jordan measured the positive or negative sentiment of the words in them using the AFINN dictionary and a first exploration led Jordan to visualize these average sentiment scores per episode:

The average positive/negative sentiment during the 17 episodes of the first two seasons of Stranger Things (from Medium.com)

Jordan jokingly explains that you might expect such overly negative sentiment in show about missing children and inter-dimensional monsters. The less-than-well-received episode 15 stands out, Jordan feels this may be due to a combination of its dark plot and the lack of any comedic relief from the main characters.

Reflecting on the visual above, Jordan felt that a lot of the granularity of the actual sentiment was missing. For a next analysis, he thus calculated a rolling average sentiment during the course of the separate episodes, which he animated using the animation package:

GIF displaying the rolling average (40 words) sentiment per Stranger Things episode (from Medium.com)

Jordan has two new takeaways: (1) only 3 of the 17 episodes have a positive ending – the Season 1 finale, the Season 2 premiere, and the Season 2 finale – (2) the episodes do not follow a clear emotional pattern. Based on this second finding, Jordan subsequently compared the average emotional trajectories of the two seasons, but the difference was not significant:

Smoothed (loess, I guess) trajectories of the sentiment during the episodes in seasons one and two of Stranger Things (from Medium.com)

Potentially, it’s better to classify the episodes based on their emotional trajectory than on the season they below too, Jordan thought next. Hence, he constructed a network based on the similarity (temporal correlation) between episodes’ temporal sentiment scores. In this network, the episodes are the nodes whereas the edges are weighted for the similarity of their emotional trajectories. In that sense, more distant episodes are less similar in terms of their emotional trajectory. The network below, made using igraph (see also here), demonstrates that consecutive episodes (1 → 2, 2 → 3, 3 → 4) are not that much alike:

The network of Stranger Things episodes, where the relations between the episodes are weighted for the similarity of their emotional trajectories (from Medium.com).

A community detection algorithm Jordan ran in MATLAB identified three main trajectories among the episodes:

Three different emotional trajectories were identified among the 17 Stranger Things episodes in Season 1 and 2 (from Medium.com).

Looking at the average patterns, we can see that group 1 contains episodes that begin and end with neutral emotion and have slow fluctuations in the middle, group 2 contains episodes that begin with negative emotion and gradually climb towards a positive ending, and group 3 contains episodes that begin on a positive note and oscillate downwards towards a darker ending.

– Jordan on Medium.com

Jordan final suggestion is that producers and scriptwriters may consciously introduce these variations in emotional trajectories among consecutive episodes in order to get viewers hooked. If you want to redo the analysis or reuse some of the code used to create the visuals above, you can access Jordan’s R scripts here. I, for one, look forward to his analysis of Season 3!

Association rules using FPGrowth in Spark MLlib through SparklyR

Association rules using FPGrowth in Spark MLlib through SparklyR

Great tutorial on how to conduct simple market basket analysis on your laptop either with association rules through the arules package or with frequent pattern mining (FPGrowth) in Spark via sparklyr!

Longhow Lam's avatarLonghow Lam's Blog

sparkfp

Introduction

Market Basket Analysis or association rules mining can be a very useful technique to gain insights in transactional data sets, and it can be useful for product recommendation. The classical example is data in a supermarket. For each customer we know what the individual products (items) are that he has bought. With association rules mining we can identify items that are frequently bought together. Other use cases for MBA could be web click data, log files, and even questionnaires.

In R there is a package arules to calculate association rules, it makes use of the so-called Apriori algorithm. For data sets that are not too big, calculating rules with arules in R (on a laptop) is not a problem. But when you have very huge data sets, you need to do something else, you can:

  • use more computing power (or cluster of computing nodes).
  • use another algorithm, for example…

View original post 727 more words

Advanced GIFs in R

Advanced GIFs in R

Rafa Irizarry is a biostatistics professor and one of the three people behind SimplyStatistics.org (the others are Jeff LeekRoger Peng). They post ideas that they find interesting and their blog contributes greatly to discussion of science/popular writing.

Rafa is the creator of many data visualization GIFs that have recently trended on the web, and in a recent post he provides all the source code behind the beautiful imagery. I sincerely recommend you check out the orginal blog if you want to find out more, but here are the GIFS:

Simpson’s paradox is a statistical phenomenon where an observed relationship within a population reverses within all subgroups that make up that population. Rafa visualized it wonderfully in a GIF that took only twenty-some lines of R code:

A different statistical phenomenon is discussed at the end of the original blog: namely  the ecological fallacy. It occurs when correlations that occur on the group-level are erroneously extrapolated to the individual-level. Rafa used the gapminder data included in the dslabs package to illustrate the fallacy: there is a very high correlation at the region level and a lower correlation at the individual country level:

The gapminder data is also used in the next GIF. This mimics Hans Rosling’s famous animation during his talk on New Insights on Poverty, but then made with R and gganimate by Rafa:

A next visualization demonstrates how the UN voting data (of Erik Voeten and Anton Strezhnev) can be used to examine different voting behaviors. It seems to reduce the voting data to a two-dimensional factor structure, and seemingly there are three distinct groups of voters these days, with particularly the USA and Israel far removed from other members:

The next GIFs are more statistical. The one below demonstrates how the local regression (LOESS) works. Simply speaking, LOESS determines the relationship for a local subset of the population and when you iteratively repeat this for all local subsets in a population you get a nicely fitting LOESS curve, the red line in Rafa’s GIF:

Not quite sure how to interpret the next one, but Rafa explains it visualized a random forest’s predictions using only one predictor variable. I think that different trees would then provide different predictions because they leverage different training samples, and an ensemble of those trees would then improve predictive accuracy?

The next one is my favorite I think. This animation illustrates how a highly accurate test would function in a population with low prevalence of true values (e.g., disease, applicant success). More details are in the original blog or here.

The blog ends with a rather funny animation of the only good use of pie charts, according to Rafa:

Hierarchical Linear Models 101

Hierarchical Linear Models 101

Multilevel models (also known as hierarchical linear models, nested data models, mixed models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parameters that vary at more than one level (Wikipedia). They are very useful in Social Sciences, where we are often interested in individuals that reside in nations, organizations, teams, or other higher-level units. Next to their individuals characteristics, the characteristics of these units they belong to may also have effects. To take into account effects from variables residing at multiple levels, we can use multilevel or hierarchical models.

Michael Freeman, a faculty member at the University of Washington Information School. made this amazing visual introduction to hierarchical modeling:

hlm

If you want to practice hierarchical modeling in R, I recommend the lesson by Page Paccini (first video) or the more elaborate video series by Statistics of DOOM (second):

Network Visualization with igraph and ggraph

Network Visualization with igraph and ggraph

Eiko Fried, researcher at the University of Amsterdam, recently blogged about personal collaborator networks. I came across his post on twitter, discussing how to conduct such analysis in R, and got inspired.

Unfortunately, my own publication record is quite boring to analyse, containing only a handful of papers. However, my promotors – Prof. dr. Jaap Paauwe and Prof. dr. Marc van Veldhoven – have more extensive publication lists. Although I did not manage to retrieve those using the scholarpackage, I was able to scrape Jaap Paauwe’s publication list from his Google Scholar page. Jaap has 141 publications listed with one or more citation on Google Scholar. More than enough for an analysis!

While Eiko uses his colleague Sacha Epskamp’s R package qgraph, I found an alternative in the packages igraph and ggraph.

### PAUL VAN DER LAKEN
### 2017-10-31
### COAUTHORSHIP NETWORK VISUALIZATION

# LOAD IN PACKAGES
library(readxl)
library(dplyr)
library(ggraph)
library(igraph)

# STANDARDIZE VISUALIZATIONS
w = 14
h = 7
dpi = 900

# LOAD IN DATA
pub_history <- read_excel("paauwe_wos.xlsx")

# RETRIEVE AUTHORS
pub_history %>%
  filter(condition == 1) %>%
  select(name) %>%
  .$name %>%
  gsub("[A-Z]{2,}|[A-Z][ ]", "", .) %>%
  strsplit(",") %>%
  lapply(function(x) gsub("\\..*", "", x)) %>%
  lapply(function(x) gsub("^[ ]+","",x)) %>%
  lapply(function(x) x[x != ""]) %>%
  lapply(function(x) tolower(x))->
  authors

# ADD JAAP PAAUWE WHERE MISSING
authors <- lapply(authors, function(x){
  if(!"paauwe" %in% x){
    return(c(x,"paauwe"))
  } else{
    return(x)
  }
})

# EXTRACT UNIQUE AUTHORS
authors_unique <- authors %>% unlist() %>% unique() %>% sort(F)

# FORMAT AUTHOR NAMES 
# CAPATILIZE
simpleCap <- function(x) {
  s <- strsplit(x, " ")[[1]]
  names(s) <- NULL
  paste(toupper(substring(s, 1,1)), substring(s, 2),
        sep="", collapse=" ")
}
authors_unique_names <- sapply(authors_unique, simpleCap)

The above retrieve the names of every unique author from the excel file I got from Google Scholar. Now we need to examine to what extent the author names co-occur. We do that with the below code, storing all co-occurance data in a matrix, which we then transform to an adjacency matrix igraph can deal with. The output graph data looks like this:

# CREATE COAUTHORSHIP MATRIX
coauthorMatrix <- do.call(
  cbind,
  lapply(authors, function(x){
  1*(authors_unique %in% x)
}))

# TRANSFORM TO ADJECENY MATRIX
adjacencyMatrix <- coauthorMatrix %*% t(coauthorMatrix)

# CREATE NETWORK GRAPH
g <- graph.adjacency(adjacencyMatrix, 
                     mode = "undirected", 
                     diag = FALSE)
V(g)$Degree <- degree(g, mode = 'in') # CALCULATE DEGREE
V(g)$Name <- authors_unique_names # ADD NAMES
g # print network
## IGRAPH f1b50a7 U--- 168 631 -- 
## + attr: Degree (v/n), Name (v/c)
## + edges from f1b50a7:
##  [1]  1-- 21  1--106  2-- 44  2-- 52  2--106  2--110  3-- 73  3--106
##  [9]  4-- 43  4-- 61  4-- 78  4-- 84  4--106  5-- 42  5--106  6-- 42
## [17]  6-- 42  6-- 97  6-- 97  6--106  6--106  6--125  6--125  6--127
## [25]  6--127  6--129  6--129  7--106  7--106  7--150  7--150  8-- 24
## [33]  8-- 38  8-- 79  8-- 98  8-- 99  8--106  9-- 88  9--106  9--133
## [41] 10-- 57 10--106 10--128 11-- 76 11-- 85 11--106 12-- 30 12-- 80
## [49] 12--106 12--142 12--163 13-- 16 13-- 16 13-- 22 13-- 36 13-- 36
## [57] 13--106 13--106 13--106 13--166 14-- 70 14-- 94 14--106 14--114
## + ... omitted several edges

This graph data we can now feed into ggraph:

# SET THEME FOR NETWORK VISUALIZATION
theme_networkMap <- theme(
  plot.background = element_rect(fill = "beige"),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  panel.background = element_blank(),
  legend.background = element_blank(),
  legend.position = "none",
  legend.title = element_text(colour = "black"),
  legend.text = element_text(colour = "black"),
  legend.key = element_blank(),
  axis.text = element_blank(), 
  axis.title = element_blank(),
  axis.ticks = element_blank()
)
# VISUALIZE NETWORK
ggraph(g, layout = "auto") +
  # geom_edge_density() +
  geom_edge_diagonal(alpha = 1, label_colour = "blue") +
  geom_node_label(aes(label = Name, size = sqrt(Degree), fill = sqrt(Degree))) +
  theme_networkMap +
  scale_fill_gradient(high = "blue", low = "lightblue") +
  labs(title = "Coauthorship Network of Jaap Paauwe",
       subtitle = "Publications with more than one Google Scholar citation included",
       caption = "paulvanderlaken.com") +
  ggsave("Paauwe_Coauthorship_Network.png", dpi = dpi, width = w, height = h)

Paauwe_Coauthorship_Network

Feel free to use the code to look at your own coauthorship networks or to share this further.