Category: visualization

Network Visualization with igraph and ggraph

Eiko Fried, researcher at the University of Amsterdam, recently blogged about personal collaborator networks. I came across his post on twitter, discussing how to conduct such analysis in R, and got inspired.

Unfortunately, my own publication record is quite boring to analyse, containing only a handful of papers. However, my promotors – Prof. dr. Jaap Paauwe and Prof. dr. Marc van Veldhoven – have more extensive publication lists. Although I did not manage to retrieve those using the scholarpackage, I was able to scrape Jaap Paauwe’s publication list from his Google Scholar page. Jaap has 141 publications listed with one or more citation on Google Scholar. More than enough for an analysis!

While Eiko uses his colleague Sacha Epskamp’s R package qgraph, I found an alternative in the packages igraph and ggraph.

### PAUL VAN DER LAKEN
### 2017-10-31
### COAUTHORSHIP NETWORK VISUALIZATION

# LOAD IN PACKAGES
library(readxl)
library(dplyr)
library(ggraph)
library(igraph)

# STANDARDIZE VISUALIZATIONS
w = 14
h = 7
dpi = 900

# LOAD IN DATA
pub_history <- read_excel("paauwe_wos.xlsx")

# RETRIEVE AUTHORS
pub_history %>%
  filter(condition == 1) %>%
  select(name) %>%
  .$name %>%
  gsub("[A-Z]{2,}|[A-Z][ ]", "", .) %>%
  strsplit(",") %>%
  lapply(function(x) gsub("\\..*", "", x)) %>%
  lapply(function(x) gsub("^[ ]+","",x)) %>%
  lapply(function(x) x[x != ""]) %>%
  lapply(function(x) tolower(x))->
  authors

# ADD JAAP PAAUWE WHERE MISSING
authors <- lapply(authors, function(x){
  if(!"paauwe" %in% x){
    return(c(x,"paauwe"))
  } else{
    return(x)
  }
})

# EXTRACT UNIQUE AUTHORS
authors_unique <- authors %>% unlist() %>% unique() %>% sort(F)

# FORMAT AUTHOR NAMES 
# CAPATILIZE
simpleCap <- function(x) {
  s <- strsplit(x, " ")[[1]]
  names(s) <- NULL
  paste(toupper(substring(s, 1,1)), substring(s, 2),
        sep="", collapse=" ")
}
authors_unique_names <- sapply(authors_unique, simpleCap)

The above retrieve the names of every unique author from the excel file I got from Google Scholar. Now we need to examine to what extent the author names co-occur. We do that with the below code, storing all co-occurance data in a matrix, which we then transform to an adjacency matrix igraph can deal with. The output graph data looks like this:

# CREATE COAUTHORSHIP MATRIX
coauthorMatrix <- do.call(
  cbind,
  lapply(authors, function(x){
  1*(authors_unique %in% x)
}))

# TRANSFORM TO ADJECENY MATRIX
adjacencyMatrix <- coauthorMatrix %*% t(coauthorMatrix)

# CREATE NETWORK GRAPH
g <- graph.adjacency(adjacencyMatrix, 
                     mode = "undirected", 
                     diag = FALSE)
V(g)$Degree <- degree(g, mode = 'in') # CALCULATE DEGREE
V(g)$Name <- authors_unique_names # ADD NAMES
g # print network

## IGRAPH f1b50a7 U--- 168 631 -- 
## + attr: Degree (v/n), Name (v/c)
## + edges from f1b50a7:
##  [1]  1-- 21  1--106  2-- 44  2-- 52  2--106  2--110  3-- 73  3--106
##  [9]  4-- 43  4-- 61  4-- 78  4-- 84  4--106  5-- 42  5--106  6-- 42
## [17]  6-- 42  6-- 97  6-- 97  6--106  6--106  6--125  6--125  6--127
## [25]  6--127  6--129  6--129  7--106  7--106  7--150  7--150  8-- 24
## [33]  8-- 38  8-- 79  8-- 98  8-- 99  8--106  9-- 88  9--106  9--133
## [41] 10-- 57 10--106 10--128 11-- 76 11-- 85 11--106 12-- 30 12-- 80
## [49] 12--106 12--142 12--163 13-- 16 13-- 16 13-- 22 13-- 36 13-- 36
## [57] 13--106 13--106 13--106 13--166 14-- 70 14-- 94 14--106 14--114
## + ... omitted several edges

This graph data we can now feed into ggraph:

# SET THEME FOR NETWORK VISUALIZATION
theme_networkMap <- theme(
  plot.background = element_rect(fill = "beige"),
  panel.border = element_blank(),
  panel.grid = element_blank(),
  panel.background = element_blank(),
  legend.background = element_blank(),
  legend.position = "none",
  legend.title = element_text(colour = "black"),
  legend.text = element_text(colour = "black"),
  legend.key = element_blank(),
  axis.text = element_blank(), 
  axis.title = element_blank(),
  axis.ticks = element_blank()
)
# VISUALIZE NETWORK
ggraph(g, layout = "auto") +
  # geom_edge_density() +
  geom_edge_diagonal(alpha = 1, label_colour = "blue") +
  geom_node_label(aes(label = Name, size = sqrt(Degree), fill = sqrt(Degree))) +
  theme_networkMap +
  scale_fill_gradient(high = "blue", low = "lightblue") +
  labs(title = "Coauthorship Network of Jaap Paauwe",
       subtitle = "Publications with more than one Google Scholar citation included",
       caption = "paulvanderlaken.com") +
  ggsave("Paauwe_Coauthorship_Network.png", dpi = dpi, width = w, height = h)

Feel free to use the code to look at your own coauthorship networks or to share this further.

Kaggle Data Science Survey 2017: Worldwide Preferences for Python & R

Kaggle conducts industry-wide surveys to assess the state of data science and machine learning. Over 17,000 individuals worldwide participated in the survey, myself included, and 171 countries and territories are represented in the data.

There is an ongoing debate regarding whether R or Python is better suited for Data Science (probably the latter, but I nevertheless prefer the former). The thousands of responses to the Kaggle survey may provide some insights into how the preferences for each of these languages are dispersed over the globe. At least, that was what I thought when I wrote the code below.

View the Kaggle Kernel here.

### PAUL VAN DER LAKEN
### 2017-10-31
### KAGGLE DATA SCIENCE SURVEY
### VISUALIZING WORLD WIDE RESPONSES
### AND PYTHON/R PREFERENCES

# LOAD IN LIBRARIES
library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)

# OPTIONS & STANDARDIZATION
options(stringsAsFactors = F)
theme_set(theme_light())
dpi = 600
w = 12
h = 8
wm_cor = 0.8
hm_cor = 0.8
capt = "Kaggle Data Science Survey 2017 by paulvanderlaken.com"

# READ IN KAGGLE DATA
mc <- read.csv("multipleChoiceResponses.csv") %>%
  as.tibble()

# READ IN WORLDMAP DATA
worldMap <- map_data(map = "world") %>% as.tibble()

# ALIGN KAGGLE AND WORLDMAP COUNTRY NAMES
mc$Country[!mc$Country %in% worldMap$region] %>% unique()
worldMap$region %>% unique() %>% sort(F)
mc$Country[mc$Country == "United States"] <- "USA"
mc$Country[mc$Country == "United Kingdom"] <- "UK"
mc$Country[grepl("China|Hong Kong", mc$Country)] <- "China"


# CLEAN UP KAGGLE DATA
lvls = c("","Rarely", "Sometimes", "Often", "Most of the time")
labels = c("NA", lvls[-1])
ind_data <- mc %>% 
  select(Country, WorkToolsFrequencyR, WorkToolsFrequencyPython) %>%
  mutate(WorkToolsFrequencyR = factor(WorkToolsFrequencyR, 
                                      levels = lvls, labels = labels)) %>% 
  mutate(WorkToolsFrequencyPython = factor(WorkToolsFrequencyPython, 
                                           levels = lvls, labels = labels)) %>% 
  filter(!(Country == "" | is.na(WorkToolsFrequencyR) | is.na(WorkToolsFrequencyPython)))

# AGGREGATE TO COUNTRY LEVEL
country_data <- ind_data %>%
  group_by(Country) %>%
  summarize(N = n(),
            R = sum(WorkToolsFrequencyR %>% as.numeric()),
            Python = sum(WorkToolsFrequencyPython %>% as.numeric()))

# CREATE THEME FOR WORLDMAP PLOT
theme_worldMap <- theme(
    plot.background = element_rect(fill = "white"),
    panel.border = element_blank(),
    panel.grid = element_blank(),
    panel.background = element_blank(),
    legend.background = element_blank(),
    legend.position = c(0, 0.2),
    legend.justification = c(0, 0),
    legend.title = element_text(colour = "black"),
    legend.text = element_text(colour = "black"),
    legend.key = element_blank(),
    legend.key.size = unit(0.04, "npc"),
    axis.text = element_blank(), 
    axis.title = element_blank(),
    axis.ticks = element_blank()
  )

After aligning some country names (above), I was able to start visualizing the results. A first step was to look at the responses across the globe. The greener the more responses and the grey countries were not represented in the dataset. A nice addition would have been to look at the response rate relative to country population.. any volunteers?

# PLOT WORLDMAP OF RESPONSE RATE
ggplot(country_data) + 
  geom_map(data = worldMap, 
           aes(map_id = region, x = long, y = lat),
           map = worldMap, fill = "grey") +
  geom_map(aes(map_id = Country, fill = N),
           map = worldMap, size = 0.3) +
  scale_fill_gradient(low = "green", high = "darkgreen", name = "Response") +
  theme_worldMap +
  labs(title = "Worldwide Response Kaggle DS Survey 2017",
       caption = capt) +
  coord_equal()

Now, let’s look at how frequently respondents use Python and R in their daily work. I created two heatmaps: one excluding the majority of respondents who indicated not using either Python or R, probably because they didn’t complete the survey.

# AGGREGATE DATA TO WORKTOOL RESPONSES
worktool_data <- ind_data %>%
  group_by(WorkToolsFrequencyR, WorkToolsFrequencyPython) %>%
  count()

# HEATMAP OF PREFERRED WORKTOOLS
ggplot(worktool_data, aes(x = WorkToolsFrequencyR, y = WorkToolsFrequencyPython)) +
  geom_tile(aes(fill = log(n))) +
  geom_text(aes(label = n), col = "black") +
  scale_fill_gradient(low = "red", high = "yellow") +
  labs(title = "Heatmap of Python and R usage",
       subtitle = "Most respondents indicate not using Python or R (or did not complete the survey)",
       caption = capt, 
       fill = "Log(N)")

# HEATMAP OF PREFERRED WORKTOOLS
# EXCLUSING DOUBLE NA'S
worktool_data %>%
  filter(!(WorkToolsFrequencyPython == "NA" & WorkToolsFrequencyR == "NA")) %>%
  ungroup() %>%
  mutate(perc = n / sum(n)) %>%
  ggplot(aes(x = WorkToolsFrequencyR, y = WorkToolsFrequencyPython)) +
  geom_tile(aes(fill = n)) +
  geom_text(aes(label = paste0(round(perc,3)*100,"%")), col = "black") +
  scale_fill_gradient(low = "red", high = "yellow") +
  labs(title = "Heatmap of Python and R usage (non-users excluded)",
       subtitle = "There is a strong reliance on Python and less users focus solely on R",
       caption = capt, 
       fill = "N")

Okay, now let’s map these frequency data on a worldmap. Because I’m interested in the country level differences in usage, I look at the relative usage of Python compared to R. So the redder the country, the more Python is used by Data Scientists in their workflow whereas R is the preferred tool in the bluer countries. Interesting to see, there is no country where respondents really use R much more than Python.

# WORLDMAP OF RELATIVE WORKTOOL PREFERENCE
ggplot(country_data) + 
  geom_map(data = worldMap, 
           aes(map_id = region, x = long, y = lat),
           map = worldMap, fill = "grey") +
  geom_map(aes(map_id = Country, fill = Python/R),
           map = worldMap, size = 0.3) +
  scale_fill_gradient(low = "blue", high = "red", name = "Python/R") +
  theme_worldMap +
  labs(title = "Relative usage of Python to R per country",
       subtitle = "Focus on Python in Russia, Israel, Japan, Ukraine, China, Norway & Belarus",
       caption = capt) +
  coord_equal()

Countries are color-coded for their relative preference for **Python (red/purple)** or **R (blue)** as a Data Science tool. 167 out of 171 countries (98%) demonstrate a value of > 1, indicating a preference for Python over R.

Thank you for reading my visualization report. Please do try and extract some other interesting insights from the data yourself.

If you liked my analysis, please upvote my Kaggle Kernel here!

Geographical Maps in ggplot2: Rectangle World Map

Maarten Lambrechts posted a tutorial where he demonstrates the steps through which he created a Eurovision Song Festival map in R.

Inspired by his tutorial, I decided to create a worldmap of my own, the R code for which you may find below.

options(stringsAsFactors = F) # options
library(tidyverse) # packages

# retrieve data file
link = "https://gist.githubusercontent.com/maartenzam/787498bbc07ae06b637447dbd430ea0a/raw/9a9dafafb44d8990f85243a9c7ca349acd3a0d07/worldtilegrid.csv"

geodata <- read.csv(link) %>% as.tibble() # load in geodata
str(geodata) # examine geodata

## Classes 'tbl_df', 'tbl' and 'data.frame':    192 obs. of  11 variables:
##  $ name           : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ alpha.2        : chr  "AF" "AL" "DZ" "AO" ...
##  $ alpha.3        : chr  "AFG" "ALB" "DZA" "AGO" ...
##  $ country.code   : int  4 8 12 24 10 28 32 51 36 40 ...
##  $ iso_3166.2     : chr  "ISO 3166-2:AF" "ISO 3166-2:AL" "ISO 3166-2:DZ" "ISO 3166-2:AO" ...
##  $ region         : chr  "Asia" "Europe" "Africa" "Africa" ...
##  $ sub.region     : chr  "Southern Asia" "Southern Europe" "Northern Africa" "Middle Africa" ...
##  $ region.code    : int  142 150 2 2 NA 19 19 142 9 150 ...
##  $ sub.region.code: int  34 39 15 17 NA 29 5 145 53 155 ...
##  $ x              : int  22 15 13 13 15 7 6 20 24 15 ...
##  $ y              : int  8 9 11 17 23 4 14 6 19 6 ...

# create worldmap
worldmap <- ggplot(geodata)

# add rectangle grid + labels
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3))

# improve geoms
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1,
                fill = region)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3),
            size = 2, 
            nudge_x = 0.5, nudge_y = -0.5,
            vjust = 0.5, hjust = 0.5) +
  scale_y_reverse()

# finalize plot look
colors = c('yellow', 'red', 'white', 'pink', 'green', 'orange')
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1,
                fill = region)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3),
            size = 3,
            nudge_x = 0.5, nudge_y = -0.5,
            vjust = 0.5, hjust = 0.5) +
  scale_y_reverse() +
  scale_fill_manual(values = colors) +
  guides(fill = guide_legend(ncol = 2), col = F) +
  theme(plot.background = element_rect(fill = "blue"),
        panel.grid = element_blank(),
        panel.background = element_blank(),
        legend.background = element_blank(),
        legend.position = c(0, 0),
        legend.justification = c(0, 0),
        legend.title = element_text(colour = "white"),
        legend.text = element_text(colour = "white"),
        legend.key = element_blank(),
        legend.key.size = unit(0.06, "npc"),
        axis.text = element_blank(), 
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        text = element_text(colour = "white", size = 16)
        ) +
  labs(title = "ggplot2: Worldmap",
       fill = "Region", 
       caption = "paulvanderlaken.com")

What would you add to your worldmap? If you end up making one, please send me a copy on paulvanderlaken@gmail.com!

Sorting Algorithms 101: Visualized

Sorting is one of the central topic in most Computer Science degrees. In general, sorting refers to the process of rearranging data according to a defined pattern with the end goal of transforming the original unsorted sequence into a sorted sequence. It lies at the heart of successful businesses ventures — such as Google and Amazon — but is also present in many applications we use daily — such as Excel or Facebook.

Many different algorithms have been developed to sort data. Wikipedia lists as many as 45 and there are probably many more. Some work by exchanging data points in a sequence, others insert and/or merge parts of the sequence. More importantly, some algorithms are quite effective in terms of the time they take to sort data — taking only $n$ time to sort $n$ datapoints — whereas others are very slow — taking as much as $n^2$ . Moreover, some algorithms are stable — in the sense that they always take the same amount of time to process $n$ datapoints — whereas others may fluctuate in terms of processing time based on the original order of the data.

I really enjoyed this video by TED-Ed on how to best sort your book collection. It provides a very intuitive introduction into sorting strategies (i.e., algorithms). Moreover, Algorithms to Live By (Christian & Griffiths, 2016) provided the amazing suggestion to get friends and pizza in whenever you need to sort something, next to the great explanation of various algorithms and their computational demand.

The main reason for this blog is that I stumbled across some nice video’s and GIFs of sorting algorithms in action. These visualizations are not only wonderfully intriguing to look at, but also help so much in understanding how the sorting algorithms process the data under the hood. You might want to start with the 4-minute YouTube video below, demonstrating how nine different sorting algorithms (Selection Sort, Shell Sort, Insertion Sort, Merge Sort, Quick Sort, Heap Sort, Bubble Sort, Comb Sort, & Cocktail Sort) process a variety of datasets.

This interactive website toptal.com allows you to play around with the most well-known sorting algorithms, putting them to work on different datasets. For the grande finale, I found these GIFs and short video’s of several sorting algorithms on imgur. In the visualizations below, each row of the image represents an independent list being sorted. You can see that Bubble Sort is quite slow:

Cocktail Shaker Sort already seems somewhat faster, but still takes quite a while.

For some algorithms, the visualization clearly shows that the settings you pick matter. For instance, Heap Sort is much quicker if you choose to shift down instead of up.

In contrast, for Merge Sort it doesn’t matter whether you sort by breadth first or depth first.

The imgur overview includes many more visualized sorting algorithms but I don’t want to overload WordPress or your computer, so I’ll leave you with two types of Radix Sort, the rest you can look up yourself!

Improved Twitter Mining in R

R users have been using the twitter package by Geoff Jentry to mine tweets for several years now. However, a recent blog suggests a novel package provides a better mining tool: rtweet by Michael Kearney (GitHub).

Both packages use a similar setup and require you to do some prep-work by creating a Twitter “app” (see the package instructions). However, rtweet will save you considerable API-time and post-API munging time. This is demonstrated by the examples below, where Twitter is searched for #rstats-tagged tweets, first using twitteR, then using rtweet.

library(twitteR)

# this relies on you setting up an app in apps.twitter.com
setup_twitter_oauth(
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
)

r_folks <- searchTwitter("#rstats", n=300)

str(r_folks, 1)
## List of 300
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant

str(r_folks[1])
## List of 1
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..$ text         : chr "RT @historying: Wow. This is an enormously helpful tutorial by @vivalosburros for anyone interested in mapping "| __truncated__
##   ..$ favorited    : logi FALSE
##   ..$ favoriteCount: num 0
##   ..$ replyToSN    : chr(0) 
##   ..$ created      : POSIXct[1:1], format: "2017-10-22 17:18:31"
##   ..$ truncated    : logi FALSE
##   ..$ replyToSID   : chr(0) 
##   ..$ id           : chr "922150185916157952"
##   ..$ replyToUID   : chr(0) 
##   ..$ statusSource : chr "Twitter for Android"
##   ..$ screenName   : chr "jasonrhody"
##   ..$ retweetCount : num 3
##   ..$ isRetweet    : logi TRUE
##   ..$ retweeted    : logi FALSE
##   ..$ longitude    : chr(0) 
##   ..$ latitude     : chr(0) 
##   ..$ urls         :'data.frame': 0 obs. of  4 variables:
##   .. ..$ url         : chr(0) 
##   .. ..$ expanded_url: chr(0) 
##   .. ..$ dispaly_url : chr(0) 
##   .. ..$ indices     : num(0) 
##   ..and 53 methods, of which 39 are  possibly relevant:
##   ..  getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet, getLatitude, getLongitude, getReplyToSID,
##   ..  getReplyToSN, getReplyToUID, getRetweetCount, getRetweeted, getRetweeters, getRetweets, getScreenName,
##   ..  getStatusSource, getText, getTruncated, getUrls, initialize, setCreated, setFavoriteCount, setFavorited, setId,
##   ..  setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN, setReplyToUID, setRetweetCount,
##   ..  setRetweeted, setScreenName, setStatusSource, setText, setTruncated, setUrls, toDataFrame, toDataFrame#twitterObj

The above operations required only several seconds to completely. The returned data is definitely usable, but not in the most handy format: the package models the Twitter API on to custom R objects. It’s elegant, but also likely overkill for most operations. Here’s the rtweet version:

library(rtweet)

# this relies on you setting up an app in apps.twitter.com
create_token(
  app = Sys.getenv("TWITTER_APP"),
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
) -> twitter_token

saveRDS(twitter_token, "~/.rtweet-oauth.rds")

# ideally put this in ~/.Renviron
Sys.setenv(TWITTER_PAT=path.expand("~/.rtweet-oauth.rds"))

rtweet_folks <- search_tweets("#rstats", n=300)

dplyr::glimpse(rtweet_folks)
## Observations: 300
## Variables: 35
## $ screen_name                     "AndySugs", "jsbreker", "__rahulgupta__", "AndySugs", "jasonrhody", "sibanjan...
## $ user_id                         "230403822", "703927710", "752359265394909184", "230403822", "14184263", "863...
## $ created_at                      2017-10-22 17:23:13, 2017-10-22 17:19:48, 2017-10-22 17:19:39, 2017-10-22 17...
## $ status_id                       "922151366767906819", "922150507745079297", "922150470382125057", "9221504090...
## $ text                            "RT:  (Rbloggers)Markets Performance after Election: Day 239  https://t.co/D1...
## $ retweet_count                   0, 0, 9, 0, 3, 1, 1, 57, 57, 103, 10, 10, 0, 0, 0, 34, 0, 0, 642, 34, 1, 1, 1...
## $ favorite_count                  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ is_quote_status                 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ quote_status_id                 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ is_retweet                      FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, F...
## $ retweet_status_id               NA, NA, "922085241493360642", NA, "921782329936408576", "922149318550843393",...
## $ in_reply_to_status_status_id    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_user_id      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_screen_name  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ lang                            "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "ro",...
## $ source                          "IFTTT", "Twitter for iPhone", "GaggleAMP", "IFTTT", "Twitter for Android", "...
## $ media_id                        NA, "922150500237062144", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "92...
## $ media_url                       NA, "http://pbs.twimg.com/media/DMwi_oQUMAAdx5A.jpg", NA, NA, NA, NA, NA, NA,...
## $ media_url_expanded              NA, "https://twitter.com/jsbreker/status/922150507745079297/photo/1", NA, NA,...
## $ urls                            NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ urls_display                    "ift.tt/2xe1xrR", NA, NA, "ift.tt/2xe1xrR", NA, "bit.ly/2yAAL0M", "bit.ly/2yA...
## $ urls_expanded                   "http://ift.tt/2xe1xrR", NA, NA, "http://ift.tt/2xe1xrR", NA, "http://bit.ly/...
## $ mentions_screen_name            NA, NA, "DataRobot", NA, "historying vivalosburros", "NoorDinTech ikashnitsky...
## $ mentions_user_id                NA, NA, "622519917", NA, "18521423 304837258", "2511247075 739773414316118017...
## $ symbols                         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ hashtags                        "rstats DataScience", "Rstats ACSmtg", "rstats", "rstats DataScience", "rstat...
## $ coordinates                     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_id                        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_type                      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_name                      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_full_name                 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country_code                    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country                         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_coordinates        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_type               NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

This operation took equal to less time but provides the data in a tidy, immediately usable structure.

On the rtweet website, you can read about the additional functionalities this new package provides. For instance,ts_plot() provides a quick visual of the frequency of tweets. It’s possible to aggregate by the minute, i.e., by = "mins", or by some value of seconds, e.g.,by = "15 secs".

## Plot time series of all tweets aggregated by second
ts_plot(rt, by = "secs")

ts_filter() creates a time series-like data structure, which consists of “time” (specific interval of time determined via the by argument), “freq” (the number of observations, or tweets, that fall within the corresponding interval of time), and “filter” (a label representing the filtering rule used to subset the data). If no filter is provided, the returned data object includes a “filter” variable, but all of the entries will be blank "", indicating that no filter filter was used. Otherwise, ts_filter() uses the regular expressions supplied to the filter argument as values for the filter variable. To make the filter labels pretty, users may also provide a character vector using the key parameter.

## plot multiple time series by first filtering the data using
## regular expressions on the tweet "text" variable
rt %>%
  dplyr::group_by(screen_name) %>%
  ## The pipe operator allows you to combine this with ts_plot
  ## without things getting too messy.
  ts_plot() + 
  ggplot2::labs(
    title = "Tweets during election day for the 2016 U.S. election",
    subtitle = "Tweets collected, parsed, and plotted using `rtweet`"
  )

The developer cautions that these plots often resemble frowny faces: the first and last points appear significantly lower than the rest. This is caused by the first and last intervals of time to be artificially shrunken by connection and disconnection processes. To remedy this, users may specify trim = TRUE to drop the first and last observation for each time series.

Give rtweet a try and let me know whether you prefer it over twitter.

Text Mining: Pythonic Heavy Metal

This blog summarized work that has been posted here, here, and here.

Iain of degeneratestate.org wrote a three-piece series where he applied text mining to the lyrics of 222,623 songs from 7,364 heavy metal bands spread over 22,314 albums that he scraped from darklyrics.com. He applied a broad range of different analyses in Python, the code of which you can find here on Github.

For example, he starts part 1 by calculated the difficulty/complexity of the lyrics of each band using the Simple Measure of Gobbledygook or SMOG and contrasted this to the number of swearwords used, finding a nice correlation.

Ratio of swear words vs readability — Lyric complexity relates positive to swearwords used.

Furthermore, he ran some word importance analysis, looking at word frequencies, log-likelihood ratios, and TF-IDF scores. This allowed him to contrast the word usage of the different bands, finding, for instance, one heavy metal band that was characterized by the words “oh yeah baby got love“: fans might recognize either Motorhead, Machinehead, or Diamondhead.

Using cosine distance measures, Iain could compare the word vectors of the different bands, ultimately recognizing band similarity, and song representativeness for a band. This allowed interesting analysis, such as a clustering of the various bands:

However, all his analysis worked out nicely. While he also applied t-SNE to visualize band similarity in a two-dimensional space, the solution was uninformative due to low variance in the data.

He could predict the band behind a song by training a one-vs-rest logistic regression classifier based on the reduced lyric space of 150 dimensions after latent semantic analysis. Despite classifying a song to one of 120 different bands, the classifier had a precision and recall both around 0.3, with negligible hyper parameter tuning. He used the classification errors to examine which bands get confused with each other, and visualized this using two network graphs.

In part 2, Iain tried to create a heavy metal lyric generator (which you can now try out).

His first approach was to use probabilistic distributions known as language models. Basically he develops a Markov Chain, in his opinion more of a “unsmoothed maximum-likelihood language model“, which determines the next most probable word based on the previous word(s). This model is based on observed word chains, for instance, those in the first two lines to Iron Maiden’s Number of the Beast:

Another approach would be to train a neural network. Iain used Keras, which ran on an amazon GPU instance. He recognizes the power of neural nets, but says they also come at a cost:

“The maximum likelihood models we saw before took twenty minutes to code from scratch. Even using powerful libraries, it took me a while to understand NNs well enough to use. On top of this, training the models here took days of computer time, plus more of my human time tweeking hyper parameters to get the models to converge. I lack the temporal, financial and computational resources to fully explore the hyperparameter space of these models, so the results presented here should be considered suboptimal.” – Iain

He started out with feed forward networks on a character level. His best try consisted of two feed forward layers of 512 units, followed by a softmax output, with layer normalisation, dropout and tanh activations, which he trained for 20 epochs to minimise the mean cross-entropy. Although it quickly beat the maximum likelihood Markov model, its longer outputs did not look like genuine heavy metal songs.

So he turned to recurrent neural network (RNN). The RNN Iain used contains two LSTM layers of 512 units each, followed by a fully connected softmax layer. He unrolled the sequence for 32 characters and trained the model by predicting the next 32 characters, given their immediately preceding characters, while minimizing the mean cross-entropy:

“To generate text from the RNN model, we step character-by-character through a sequence. At each step, we feed the current symbol into the model, and the model returns a probability distribution over the next character. We then sample from this distribution to get the next character in the sequence and this character goes on to become the next input to the model. The first character fed into the model at the beginning of generation is always a special start-of-sequence character.” – Iain

This approach worked quite well, and you can compare and contrast it with the earlier models here. If you’d just like to generate some lyrics, the models are hosted online at deepmetal.io.

In part 3, Iain looks into emotional arcs, examining the happiness and metalness of words and lyrics.

When applied to the combined lyrics of albums, you could examine how bands developed their signature sound over time. For example, the lyrics of Metallica’s first few albums seem to be quite heavy metal and unhappy, before moving to a happier place. The Black album is almost sentiment-neutral, but after that they became ever more darker and more metal, moving back to the style to their first few albums. He applied the same analysis on the text of the Harry Potter books, of which especially the first and last appear especially metal.