Tag: visualization

t-SNE, the Ultimate Drum Machine and more

This blog explains t-Distributed Stochastic Neighbor Embedding (t-SNE) by a story of programmers joining forces with musicians to create the ultimate drum machine (if you are here just for the fun, you may start playing right away).

Kyle McDonald, Manny Tan, and Yotam Mann experienced difficulties in pinpointing to what extent sounds are similar (ding, dong) and others are not (ding, beep) and they wanted to examine how we, humans, determine and experience this similarity among sounds. They teamed up with some friends at Google’s Creative Lab and the London Philharmonia to realize what they have named “the Infinite Drum Machine” turning the most random set of sounds into a musical instrument.

The project team wanted to include as many different sounds as they could, but had less appetite to compare, contrast and arrange all sounds into musical accords themselves. Instead, they imagined that a computer could perform such a laborious task. To determine the similarities among their dataset of sounds – which literally includes a thousand different sounds from the ngaaarh of a photocopier to the zing of an anvil – they used a fairly novel unsupervised machine learning technique called t-Distributed Stochastic Neighbor Embedding, or t-SNE in short (t-SNE Wiki; developer: Laurens van der Maaten). t-SNE specializes in dimensionality reduction for visualization purposes as it transforms highly-dimensional data into a two- or three-dimensional space. For a rapid introduction to highly-dimensional data and t-SNE by some smart Googlers, please watch the video below.

As the video explains, t-SNE maps complex data to a two- or three-dimensional space and was therefore really useful to compare and group similar sounds. Sounds are super highly-dimensional as they are essentially a very elaborate sequence of waves, each with a pitch, a duration, a frequency, a bass, an overall length, etcetera (clearly I am no musician). You would need a lot of information to describe a specific sound accurately. The project team compared sound to fingerprints, as there is an immense amount of data in a single padamtss.

t-SNE takes into account all this information of a sound and compares all sounds in the dataset. Next, it creates 2 or 3 new dimensions and assigns each sound values on these new dimensions in such a way that sounds which were previously similar (on the highly-dimensional data) are also similar on the new 2 – 3 dimensions. You could say that t-SNE summarizes (most of) the information that was stored in the previous complex data. This is what dimensionality reduction techniques do: they reduce the number of dimensions you need to describe data (sufficiently). Fortunately, techniques such as t-SNE are unsupervised, meaning that the project team did not have to tag or describe the sounds in their dataset manually but could just let the computer do the heavy lifting.

The result of this project is fantastic and righteously bears the name of Infinite Drum Machine (click to play)! You can use the two-dimensional map to explore similar sounds and you can even make beats using the sequencing tool. The below video summarizes the creation process.

Amazed by this application, I wanted to know how t-SNE is being used in other projects. I have found a tremendous amount of applications that demonstrate how to implement t-SNE in Python, R, and even JS whereas the method also seems popular in academia.

Luke Metz argues implementation in Python is fairly easy and Analytics Vidhya and a visualized blog by O’Reilly back this claim. Superstar Andrej Karpathy has an interactive t-SNE demo which allows you to compare the similarity among top Twitter users using t-SNE (I think in JavaScript). A Kaggle user and Data Science Heroes have demonstrated how to apply t-SNE in R and have compared the method to other unsupervised methods, for instance to PCA.

indico_features_img_callout_small-1024x973[1].jpg — Clusters of similar cats/dogs in Luke Metz’ application of t-SNE.

Cho et al., 2014 have used t-SNE in their natural language processing projects as it allows for an easy examination of the similarity among words and phrases. Mnih and colleagues (2015) have used t-SNE to examine how neural networks were playing video games.

t-SNE video games — Two-dimensional t-SNE visualization of the hidden layer activity of neural network playing Space Invaders (Mnih et al., 2015)

On a final note, while acknowledging its potential, this blog warns for the inaccuracies in t-SNE due to the aesthetical adjustments it often seems to make. They have some lovely interactive visualizations to back up their claim. They conclude that it’s incredible flexibility allows t-SNE to find structure where other methods cannot. Unfortunately, this makes it tricky to interpret t-SNE results as the algorithm makes all sorts of untransparent adjustments to tidy its visualizations and make the complex information fit on just 2-3 dimensions.

Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R

It has been twenty years since the first Harry Potter novel, the sorcerer’s/philosopher’s stone, was published. To honour the series, I started a text analysis and visualization project, which my other-half wittily dubbed Harry Plotter. In several blogs, I intend to demonstrate how Hadley Wickham’s tidyverse and packages that build on its principles, such as tidytext (free book), have taken programming in R to an all-new level. Moreover, I just enjoy making pretty graphs : )

In this first blog (easier read), we will look at the sentiment throughout the books. In a second blog, we have examined the stereotypes behind the Hogwarts houses.

Setup

First, we need to set up our environment in RStudio. We will be needing several packages for our analyses. Most importantly, Bradley Boehmke was nice enough to gather all Harry Potter books in his harrypotter package on GitHub. We need devtools to install that package the first time, but from then on can load it in normally. Next, we load the tidytext package, which automates and tidies a lot of the text mining functionalities. We also need plyr for a specific function (ldply()). Other tidyverse packages we can load in a single bundle, including ggplot2, dplyr, and tidyr, which I use in almost every of my projects. Finally, we load the wordcloud visualization package which draws on tm.

After loading these packages, I set some additional default options.

# LOAD IN PACKAGES
# library(devtools)
# devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
library(tidytext)
library(plyr)
library(tidyverse)
library(wordcloud)

# OPTIONS
options(stringsAsFactors = F, # do not convert upon loading
        scipen = 999, # do not convert numbers to e-values
        max.print = 200) # stop printing after 200 values

# VIZUALIZATION SETTINGS
theme_set(theme_light()) # set default ggplot theme to light
fs = 12 # default plot font size

Data preparation

With RStudio set, its time to the text of each book from the harrypotter package which we then “pipe” (%>% – another magical function from the tidyverse – specifically magrittr) along to bind all objects into a single dataframe. Here, each row represents a book with the text for each chapter stored in a separate columns. We want tidy data, so we use tidyr’s gather() function to turn each column into grouped rows. With tidytext’s unnest_tokens() function we can separate the tokens (in this case, single words) from these chapters.

# LOAD IN BOOK CHAPTERS
# TRANSFORM TO TOKENIZED DATASET
hp_words <- list(
 philosophers_stone = philosophers_stone,
 chamber_of_secrets = chamber_of_secrets,
 prisoner_of_azkaban = prisoner_of_azkaban,
 goblet_of_fire = goblet_of_fire,
 order_of_the_phoenix = order_of_the_phoenix,
 half_blood_prince = half_blood_prince,
 deathly_hallows = deathly_hallows
) %>%
 ldply(rbind) %>% # bind all chapter text to dataframe columns
 mutate(book = factor(seq_along(.id), labels = .id)) %>% # identify associated book
 select(-.id) %>% # remove ID column
 gather(key = 'chapter', value = 'text', -book) %>% # gather chapter columns to rows
 filter(!is.na(text)) %>% # delete the rows/chapters without text
 mutate(chapter = as.integer(chapter)) %>% # chapter id to numeric
 unnest_tokens(word, text, token = 'words') # tokenize data frame

Let’s inspect our current data format with head(), which prints the first rows (default n = 6).

# EXAMINE FIRST AND LAST WORDS OF SAGA
hp_words %>% head()

##                   book chapter  word
## 1   philosophers_stone       1   the
## 1.1 philosophers_stone       1   boy
## 1.2 philosophers_stone       1   who
## 1.3 philosophers_stone       1 lived
## 1.4 philosophers_stone       1    mr
## 1.5 philosophers_stone       1   and

Word frequency

A next step would be to examine word frequencies.

# PLOT WORD FREQUENCY PER BOOK
hp_words %>%
  group_by(book, word) %>%
  anti_join(stop_words, by = "word") %>% # delete stopwords
  count() %>% # summarize count per word per book
  arrange(desc(n)) %>% # highest freq on top
  group_by(book) %>% # 
  mutate(top = seq_along(word)) %>% # identify rank within group
  filter(top <= 15) %>% # retain top 15 frequent words
  # create barplot
  ggplot(aes(x = -top, fill = book)) + 
  geom_bar(aes(y = n), stat = 'identity', col = 'black') +
  # make sure words are printed either in or next to bar
  geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
                label = word), size = fs/3, hjust = "left") +
  theme(legend.position = 'none', # get rid of legend
        text = element_text(size = fs), # determine fontsize
        axis.text.x = element_text(angle = 45, hjust = 1, size = fs/1.5), # rotate x text
        axis.ticks.y = element_blank(), # remove y ticks
        axis.text.y = element_blank()) + # remove y text
  labs(y = "Word count", x = "", # add labels
       title = "Harry Plotter: Most frequent words throughout the saga") +
  facet_grid(. ~ book) + # separate plot for each book
  coord_flip() # flip axes

Unsuprisingly, Harry is the most common word in every single book and Ron and Hermione are also present. Dumbledore’s role as an (irresponsible) mentor becomes greater as the storyline progresses. The plot also nicely depicts other key characters:

Lockhart and Dobby in book 2,
Lupin in book 3,
Moody and Crouch in book 4,
Umbridge in book 5,
Ginny in book 6,
and the final confrontation with He who must not be named in book 7.

Finally, why does J.K. seem obsessively writing about eyes that look at doors?

Estimating sentiment

Next, we turn to the sentiment of the text. tidytext includes three famous sentiment dictionaries:

AFINN: including bipolar sentiment scores ranging from -5 to 5
bing: including bipolar sentiment scores
nrc: including sentiment scores for many different emotions (e.g., anger, joy, and surprise)

The following script identifies all words that occur both in the books and the dictionaries and combines them into a long dataframe:

# EXTRACT SENTIMENT WITH THREE DICTIONARIES
hp_senti <- bind_rows(
  # 1 AFINN 
  hp_words %>% 
    inner_join(get_sentiments("afinn"), by = "word") %>%
    filter(score != 0) %>% # delete neutral words
    mutate(sentiment = ifelse(score < 0, 'negative', 'positive')) %>% # identify sentiment
    mutate(score = sqrt(score ^ 2)) %>% # all scores to positive
    group_by(book, chapter, sentiment) %>% 
    mutate(dictionary = 'afinn'), # create dictionary identifier
  # 2 BING 
  hp_words %>% 
    inner_join(get_sentiments("bing"), by = "word") %>%
    group_by(book, chapter, sentiment) %>%
    mutate(dictionary = 'bing'), # create dictionary identifier
  # 3 NRC 
  hp_words %>% 
    inner_join(get_sentiments("nrc"), by = "word") %>%
    group_by(book, chapter, sentiment) %>%
    mutate(dictionary = 'nrc') # create dictionary identifier
)

# EXAMINE FIRST SENTIMENT WORDS
hp_senti %>% head()

## # A tibble: 6 x 6
## # Groups:   book, chapter, sentiment [2]
##                 book chapter      word score sentiment dictionary
##                                   
## 1 philosophers_stone       1     proud     2  positive      afinn
## 2 philosophers_stone       1 perfectly     3  positive      afinn
## 3 philosophers_stone       1     thank     2  positive      afinn
## 4 philosophers_stone       1   strange     1  negative      afinn
## 5 philosophers_stone       1  nonsense     2  negative      afinn
## 6 philosophers_stone       1       big     1  positive      afinn

Wordcloud

Although wordclouds are not my favorite visualizations, they do allow for a quick display of frequencies among a large body of words.

hp_senti %>%
  group_by(word) %>%
  count() %>% # summarize count per word
  mutate(log_n = sqrt(n)) %>% # take root to decrease outlier impact
  with(wordcloud(word, log_n, max.words = 100))

It appears we need to correct for some words that occur in the sentiment dictionaries but have a different meaning in J.K. Rowling’s books. Most importantly, we need to filter two character names.

# DELETE SENTIMENT FOR CHARACTER NAMES
hp_senti_sel <- hp_senti %>% filter(!word %in% c("harry","moody"))

Words per sentiment

Let’s quickly sketch the remaining words per sentiment.

# VIZUALIZE MOST FREQUENT WORDS PER SENTIMENT
hp_senti_sel %>% # NAMES EXCLUDED
  group_by(word, sentiment) %>%
  count() %>% # summarize count per word per sentiment
  group_by(sentiment) %>%
  arrange(sentiment, desc(n)) %>% # most frequent on top
  mutate(top = seq_along(word)) %>% # identify rank within group
  filter(top <= 15) %>% # keep top 15 frequent words
  ggplot(aes(x = -top, fill = factor(sentiment))) + 
  # create barplot
  geom_bar(aes(y = n), stat = 'identity', col = 'black') +
  # make sure words are printed either in or next to bar
  geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
                label = word), size = fs/3, hjust = "left") +
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs), # determine fontsize
        axis.text.x = element_text(angle = 45, hjust = 1), # rotate x text
        axis.ticks.y = element_blank(), # remove y ticks
        axis.text.y = element_blank()) + # remove y text
  labs(y = "Word count", x = "", # add manual labels
       title = "Harry Plotter: Words carrying sentiment as counted throughout the saga",
       subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
  facet_grid(. ~ sentiment) + # separate plot for each sentiment
  coord_flip() # flip axes

This seems ok. Let’s continue to plot the sentiment over time.

Positive and negative sentiment throughout the series

As positive and negative sentiment is included in each of the three dictionaries we can to compare and contrast scores.

# VIZUALIZE POSTIVE/NEGATIVE SENTIMENT OVER TIME
plot_sentiment <- hp_senti_sel %>% # NAMES EXCLUDED
  group_by(dictionary, sentiment, book, chapter) %>%
  summarize(score = sum(score), # summarize AFINN scores
            count = n(), # summarize bing and nrc counts
            # move bing and nrc counts to score 
            score = ifelse(is.na(score), count, score))  %>%
  filter(sentiment %in% c('positive','negative')) %>%   # only retain bipolar sentiment
  mutate(score = ifelse(sentiment == 'negative', -score, score)) %>% # reverse negative values
  # create area plot
  ggplot(aes(x = chapter, y = score)) +    
  geom_area(aes(fill = score > 0),stat = 'identity') +
  scale_fill_manual(values = c('red','green')) + # change colors
  # add black smoothed line without standard error
  geom_smooth(method = "loess", se = F, col = "black") + 
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs)) + # change font size
  labs(x = "Chapter", y = "Sentiment score", # add labels
       title = "Harry Plotter: Sentiment during the saga",
       subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
     # separate plot per book and dictionary and free up x-axes
  facet_grid(dictionary ~ book, scale = "free_x")
plot_sentiment

Let’s zoom in on the smoothed average.

plot_sentiment + coord_cartesian(ylim = c(-100,50)) # zoom in plot

Sentiment seems overly negative throughout the series. Particularly salient is that every book ends on a down note, except the Prisoner of Azkaban. Moreover, sentiment becomes more volatile in books four through six. These start out negative, brighten up in the middle, just to end in misery again. In her final book, J.K. Rowling depicts a world about to be conquered by the Dark Lord and the average negative sentiment clearly resembles this grim outlook.

The bing sentiment dictionary estimates the most negative sentiment on average, but that might be due to this specific text.

Other emotions throughout the series

Finally, let’s look at the other emotions that are included in the nrc dictionary.

# VIZUALIZE EMOTIONAL SENTIMENT OVER TIME
hp_senti_sel %>% # NAMES EXCLUDED 
  filter(!sentiment %in% c('negative','positive')) %>% # only retain other sentiments (nrc)
  group_by(sentiment, book, chapter) %>%
  count() %>% # summarize count
  # create area plot
  ggplot(aes(x = chapter, y = n)) +
  geom_area(aes(fill = sentiment), stat = 'identity') + 
  # add black smoothing line without standard error
  geom_smooth(aes(fill = sentiment), method = "loess", se = F, col = 'black') + 
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs)) + # change font size
  labs(x = "Chapter", y = "Emotion score", # add labels
       title = "Harry Plotter: Emotions during the saga",
       subtitle = "Using tidytext and the nrc sentiment dictionary") +
  # separate plots per sentiment and book and free up x-axes
  facet_grid(sentiment ~ book, scale = "free_x")

This plot is less insightful as either the eight emotions are represented by similar words or J.K. Rowling combines all in her writing simultaneously. Patterns across emotions are highly similar, evidenced especially by the patterns in the Chamber of Secrets. In a next post, I will examine sentiment in a more detailed fashion, testing the differences over time and between characters statistically. For now, I hope you enjoyed these visualizations. Feel free to come back or subscribe to read my subsequent analyses.

The second blog in the Harry Plotter series examines the stereotypes behind the Hogwarts houses.

Geographical maps using Shazam Recognitions

Shazam is a mobile app that can be asked to identify a song by making it “listen”’ to a piece of music. Due to its immense popularity, the organization’s name quickly turned into a verb used in regular conversation (“Do you know this song? Let’s Shazam it.“). A successful identification is referred to as a Shazam recognition.

Shazam users can opt-in to anonymously share their location data with Shazam. Umar Hansa used to work for Shazam and decided to plot the geospatial data of 1 billion Shazam recognitions, during one of the company’s “hackdays“. The following wonderful city, country, and world maps are the result.

All visualisations (source) follow the same principle: Dots, representing successful Shazam recognitions, are plotted onto a blank geographical coordinate system. Can you guess the cities represented by these dots?

These first maps have an additional colour coding for operating systems. Can you guess which is which?

Blue dots represent iOS (Apple iPhones) and seem to cluster in the downtown area’s whereas red Android phones dominate the zones further from the city centres. Did you notice something else? Recall that Umar used a blank canvas, not a map from Google. Nevertheless, in all visualizations the road network is clearly visible. Umar guesses that passengers (hopefully not the drivers) often Shazam music playing in the car.

Try to guess the Canadian and American cities below and compare their layout to the two European cities that follow.

The maps were respectively of Toronto, San Fransisco, London, and Paris. It is just amazing how accurate they resemble the actual world. You have got to love the clear Atlantic borders of Europe in the world map below.

Are iPhones less common (among Shazam users) in Southern and Eastern Europe? In contrast, England and the big Japanese and Russian cities jump right out as iPhone hubs. In order to allow users to explore the data in more detail, Umar created an interactive tool comparing his maps to Google’s maps. A publicly available version you can access here (note that you can zoom in).This required quite complex code, the details of which are in his blog. For now, here is another, beautiful map of England, with (the density of) Shazam recognitions reflected by color intensity on a dark background.

London is so crowded! New York also looks very cool. Central Park, the rivers and the bay are so clearly visible, whereas Governors Island is completely lost on this map.

If you liked this blog, please read Umar’s own blog post on this project for more background information, pieces of the JavaScript code, and the original images. If you which to follow his work, you can find him on Twitter.

EDIT — Here and here you find an alternative way of visualizing geographical maps using population data as input for line maps in the R-package ggjoy.

HD version of this world map can be found on http://spatial.ly/

Fredericton Property Values — Spot the river flowing through this city

Digitizing the Tour de France 2017 – II

A few weeks back, I gave some examples of how data, predictive analytics, and visualization are changing the Tour de France experience. Today, I came across another wonderful example visualizing the sequences of geospatial data (i.e., the movement) of the cyclists during the 11th stage of the Tour de France (blue dots). Moreover, the locations of the four choppers capturing the live video feed are tracked in yellow.

This short clip again reflects the enormous amounts of rich data currently being collected in this sports event.

Google Facets: Interactive Visualization for Everybody

Last week, Google released Facets, their new, open source visualization tool. Facets consists of two interfaces that allow users to investigate their data at different levels.

Facets Overview provides users with a quick understanding of the distribution of values across the variables in their dataset. Overview is especially helpful in detecting unexpected values, missing values, unbalanced distributions, and skewed distributions. Overview will detect all kinds of statistics for every column (i.e., variable) in your dataset, along with some simple vizualizations, such as histograms.

Dive is the name of the second interface of Facets. It provides an intuitive dashboard in which users can explore relationships between data points across the different variables in their dataset. The dashboard is easy to customize and users can control the position, color, and visual representation of each data point based on the underlying values.

Moreover, if the data points have images associated with them, these images can be used as the visual representations of the data points. The latter is especially helpful when Facets is used for its actual purpose: aiding in machine learning processes. The below GIF demonstrates how Facets Dive spots incorrectly labelled images with ease, allowing users to zoom in on a case-by-case level, for instance, to identify a frog that has been erroneously labelled as a cat.

Exploration of the CIFAR-10 dataset using Facets Dive

To use a demo version of the tools with your own data, visit the Facets website. For more details, visit the Facets website or Google’s Research blog on Facets.

‘Wie is de Mol?’ volgens Twitter – Deel 2 (s17e2)

Dit is een repost van mijn Linked-In artikel van 17 januari 2017.
Helaas heb ik er door gebrek aan tijd geen vervolg meer aan gegeven.
De twitter data ben ik wel blijven scrapen, dus wie weet komt het nog…

TL;DR // Samenvatting

Vorige week postte ik een eerste blog (Nederlands & Engels) waarin ik Twitter gebruik om te analyseren in hoeverre Wie is de Mol-kandidaten worden verdacht. De resultaten toonden dat Twitterend Nederland toen vooral Jeroen verdacht vond en dit kwam opvallend overeen met de populaire online polls. Na de tweede aflevering heeft Twitter echter een andere hoofdverdachte aangewezen, namelijk Diederik. Verder heb ik deze week, op aanraden van diverse lezers, iets dieper gegraven in de inhoud van de tweets. Ik hoop dat deze nieuwe analyses jou helpen #tunnelvisie te voorkomen.

Door de positieve respons op de vorige blog (Nederlands / Engels) heb ik besloten mijn WIDM project een vervolg te geven. Ondanks dat Twitter slechts toestaat om berichten tot en met 9 dagen terug te downloaden, had ik de eerdere berichten lokaal opgeslagen zodat ik nu de meest recente #WIDM tweets aan de eerdere dataset kan toevoegen. De complete dataset komt daarmee op 22,696 unieke (re)tweets! Dit zijn alle tweets gepost tussen 31 december 2016 en de avond van dinsdag 16 januari 2017. Ondanks mijn voornemen heb besloten om geen andere hashtags mee te nemen in de analyse, omdat de eerdere dataset die gegevens niet bevat en ik door de bovengenoemde download restrictie niet meer aan die gegevens kon komen. Wel heb ik de analyses uitgebreid op basis van de suggesties die jullie me hebben gegeven. Mocht jij als lezer dus nog tips, suggesties of opmerkingen hebben, schroom dan vooral niet om een berichtje te sturen of een reactie te plaatsen onder deze blog.

Aflevering 2: “Meegaand”

Er is deze week weer flink getweet over WIDM. Ondanks het klassieke laserschiet-element lag het volume deze tweede aflevering een stuk lager dan tijdens de seizoenspremière. Met ‘slechts’ 6,491 tweets afgelopen zaterdag werd er ongeveer 40% minder gepost dan vorige week. Ook het aantal berichten op de zondag na de aflevering was beduidend lager. Daarnaast bleek Twitterend Nederland doordeweeks met haar gedachten ergens anders te zitten.

Tijdens de uitzending van vorige week werden Jeroen, Diederik en Sanne (in die volgorde) het meeste genoemd. Het verloop van de tweede aflevering ziet er anders uit. Jeroen is verstoten uit de top 3 en Diederik heeft zijn plek overgenomen. Hij werd het meest genoemd tijdens de aflevering en heeft dit vooral te danken aan de slotfase van de uitzending, wellicht door zijn geloofwaardige verhaal over de schattige bevertjes (wat kan Diederik goed vertellen zeg). Desalniettemin wordt hij kort gevolgd door Roos en Sanne, wiens beider naam tijdens de uitzending ook meer dan 200 keer werd getwitterd.

Imanuelle werd deze week eindelijk opgemerkt als WIDM kandidaat, na anderhalve aflevering nauwelijks te zijn genoemd door twitterend Nederland. Opvallend is hoe zij na ongeveer 28 minuten in de aflevering opeens drastisch omhoog schiet. Iemand een idee wat daar gebeurde? Ook Sanne nam een sprintje ongeveer 20 minuten na de start. Zou dit tijdens die typmachineopdracht zijn? Of waren we toen al aan het laserschieten? Instegenstelling tot Imanuelle is en blijft kandidaat Thomas een muurbloempje. Hoewel Vincent vorige week tijdens de slotfase van de aflevering een enorme boost kreeg als afvaller is zulke belangstelling deze week in mindere mate zichtbaar voor afvaller Yvonne.

Alle tweets bij elkaar opgeteld heeft Diederik na aflevering twee het stokje overgenomen van eerdere koploper Jeroen. Zoals hieronder zichtbaar werd Diederik zijn naam in maar liefst 6.4% van alle tweets genoemd. Sanne en Roos hebben echter ook een goede aflevering gedraaid en staan nu op een gedeelde derde plaats qua vermeldingen.

Deze rangorde verschilt substantieel van de telling na aflevering 1. Onderstaande figuur geeft de relatieve stijging/daling in de belangstelling voor de verschillende kandidaten weer. Hierbij zijn de totale vermeldingen voor de start van aflevering 2 gedeeld door de vermeldingen sindsdien. Opvallend is dat hoogvlieger Jeroen relatief een stuk minder besproken is sinds afgelopen zaterdag, echter kon hij natuurlijk ook alleen maar verliezen met zijn vroege piek in de eerste aflevering. Imanuelle kwam, zoals eerder gezegd, van ver onderaan de rangorde en zag haar vermeldingen zodoende meer dan verdubbelen sinds afgelopen zaterdag. Roos stond vorige week al in de middenmoot maar is desondanks ook bijna dubbel zo vaak genoemd op Twitter sinds de start van de tweede aflevering. Persoonlijk vind ik het opvallend dat Sigrid haar naam niet vaker is gepost. Wie gaat er tijdens het laserschieten nou schuilen achter een gewoven ijzeren picknicktafel?! Zo raak je die 750 euro wel kwijt ja… Verder lijkt het spreekwoord ‘Uit het oog, uit het hart’ op te gaan als het op tweets aankomt want Vincent’s roem was van zeer korte duur.

Een suggestie heb gekregen sinds de vorige blog, is dat een telling van de daadwerkelijke verdenkingen informatiever zou zijn dan een telling van het aantal keer dat een kandidaat zijn of haar naam genoemd is. Hier ben ik mij volledig van bewust en in de vorige blog heb ik al kort uitgelegd waarom ik toentertijd besloten had dit niet te doen. Desalniettemin heb ik deze week gedetailleerder gekeken naar de daadwerkelijke inhoud van de tweets. Na beraad bij enkele mede-molloten heb ik ingezoomd op de woorden mol, verdenk* en verdacht*. Hierbij heb ik het systeem opgedragen dat moleen precieze match moest hebben, met uitzondering van een hashtag. Zo zijn bijvoorbeeld molloot, moltalk of #wieisdemol niet geteld, maar #mol wel. Bij zowel verdenk en verdacht heb ik toegestaan dat zij gevolgd mochten worden door willekeurige letters (*), zodat ook woorden zoals verdenkingen en verdachte zouden worden meegeteld. De uitkomst van de uiteindelijke telling is gepresenteerd in de figuur hieronder. Hierbij is de gehele dataset aan tweets gebruikt.

Hoewel deze manier van tellen uiteraard tot minder hoge totalen leidt, is de verdeling en rangorde onder de kandidaten verassend gelijk aan de eerder gepresenteerde grijze staafdiagram. Dit blijkt ook uit onderstaande scatterplot. De twee manieren van tellen hangen zeer sterk positief met elkaar samen en zodoende neig ik te concluderen dat de simpele telling van het aantal naamsvermeldingen op Twitter een goed beeld geeft van de onderliggende verdenkingen van twitterend Nederland. Echter is het goed mogelijk dat ik belangrijke woorden over het hoofd heb gezien, dus laat vooral in een reactie hieronder weten welke woorden ik in het vervolg wel/niet mee moet nemen. Ook hoor ik graag welke manier van tellen jullie graag hebben dat ik aanhoud. Daarnaast zal ik bij aanhoudende respons proberen een interactieve webapp maken zodat jullie zelf met de woorden en data kunnen spelen.

(Tip voor useRs: je kunt xlim beter gebruiken met coord_cartesian(), dan knipt hij de error band niet van je smoothing line af… daar kwam ik later pas achter)

Ook voor deze blog heb ik de vermeldingen van de kandidaten over de loop van de tijd uitgedraaid. Beiden afleveringen zijn goed zichtbaar in onderstaande grafiek op dagbasis. Op dagen zonder uitzendingen is het erg stil, met uitzondering van een aantal tweets op de zondag. De meest significante ontwikkeling deze week lijkt de eerder besproken stijging van Diederik, waarmee hij Jeroen inhaalt. Roos heeft een goede inhaalslag gemaakt ten opzichte van Sanne en zij lijken de derde plek nu te delen, zeker als je de beschuldigende woorden in d

Als we de stand na deze week vergelijken met de polls op de officiële WIDM website en de WIDM fanpagina, dan lijkt Twitter vooral Roos sterker te verdenken dan de respondenten van de polls dat doen. Daarnaast doen Sigrid en Jochem het vrij goed in de peilingen, terwijl zij door twitteraars over het hoofd worden gezien.

En zo zijn we aan het eind gekomen van deze blog over de tweede aflevering van Wie is de Mol 2017. Zoals je wellicht hebt gemerkt probeer ik bij het schrijven zo objectief mogelijk te blijven. Enerzijds omdat ik jaar op jaar verschrikkelijk slecht blijk in het ontmaskeren van de mol. Anderzijds omdat ik na de aflevering altijd al de helft van de gebeurtenissen al weer vergeten ben. Heb jij wel een oplettend oog, ben je bedreven in het geschreven woord en lijkt het je leuk om het bovenstaande in het vervolg van wat inhoud te voorzien neem dan vooral contact op. Verder kun je hieronder in de reacties natuurlijk ook al je verdenkingen, suggesties, opmerkingen of tips kwijt. Deel daarnaast de blog en haar plaatjes vooral met vrienden of op fora, je hoeft hiervoor geen toestemming te vragen.

Ik hoop dat jullie net zo genieten van dit nu al klassieke #WIDM seizoen als ik, en dat jullie na het lezen van deze blog wellicht iets dichter zijn gekomen bij het ontmaskeren van jullie mol. Groetjes, en hopelijk tot volgende week!

– Paul

Link naar deel 1 (NL)

Link naar deel 1 (ENG)

Link naar deel 3 (NL) … komt nog

Over de auteur: Paul van der Laken is promovendus aan het department Human Resource Studies van Tilburg University. In samenwerking met organisaties zoals Shell en Unilever onderzoekt Paul hoe statistische analyse kan worden ingezet binnen de P&O/HR-functie. Hij verdiept zich onder andere in hoe organisaties hun beleid omtrent het internationaal uitzenden van medewerkers meer data-gedreven, en dus effectiever, kunnen maken. Hiernaast geeft Paul cursussen en trainingen in HR data analyse aan Tilburg University, TIAS Business School en inhouse bij bedrijven.