Category: text mining

‘Wie is the Mol?’ according to Twitter – Part 1 (s17e1)

This is a repost of my Linked-In article of January 10th 2017.
The Dutch version of this blog is posted here.

TL;DR // Summary

In order to analyze which of the contestants of a Dutch television game show was suspected of sabotage by public opinion, 10,000+ #WIDM tweets were downloaded and analyzed. Data analysis of this first episode demonstrates how certain contestants increasingly receive public attention whereas others are quickly abandoned. Hopefully, this wisdom-of-the-crowd approach will ultimately demonstrate who is most likely to be this years’ mole. (link to Dutch blog)

A sneak peak:

Introduction

“Wie is de Mol?” [literal translation: ‘Who is the mole?’], or WIDM in short, is a popular Dutch TV game show that has been running for 17 years. The format consists of 10 famous Dutchies (e.g., actors, comedians, novelists) being sent abroad to complete a series of challenging tasks and puzzles, amassing collective prize money along the way.

However, among the contestants is a mole. This saboteur is carefully trained by the WIDM production team beforehand and his/her secret purpose is to prevent the group from collecting any money. Emphasis on secret, as the mole can only operate if unidentified by the other contestants. Furthermore, at the end of each episode, contestants have to complete a test on their knowledge of the identity of the mole. The one whose suspicions are the furthest off is eliminated and sent back to the cold and rainy Netherlands. This process is repeated every episode until in the final episode only three contestants remain: one mole, one losing finalist, and the winner of the series and thus the prize money.

WIDM has a large, active fanbase of so-called ‘molloten‘, which roughly translates to mole-fanatics. Part of its popularity can be attributed to viewers themselves being challenged to uncover the mole before the end of the series. Although the production team assures that most of the sabotage is not shown to the viewer at home, each episode is filled with visual, musical and textual hints. Frequently, viewers come up with wild theories and detect the most bizarre patterns. In recent years, some dedicated fans even go as far as analyzing the contestants’ personal social media feeds in order to determine who was sent home early. A community has developed with multiple online fora, frequent polling of public suspicions, and even a mobile application so you can compare suspicions and compete with friends. Because of all this public effort, the identity of the mole is frequently known before the actual end of the series.

So, why this blog? Well, first off, I have followed the series for several years myself and, to be honest, my suspicions are often quite far off. Secondly, past year, I played around with twitter data analysis and WIDM seemed like a nice opportunity to dust off that R script. Third, I hoped the LinkedIn community might enjoy a step-by-step example of twitter data analysis. The following is the first of, hopefully, a series of blogs in which I try to uncover the mole using the wisdom of the tweeting crowd. I hope you enjoy it as much as I do.

Analysis & Results

To not keep you waiting, let’s start with the analysis and the results right away.

Time of creation

First, let’s examine when the #WIDM tweets were posted. Episode 1 is clearly visible in the data with most of the traffic occurring in a short timeframe on Saturday evening. Note that unfortunately Twitter only allows data to be downloaded nine days back in time.

# inspect when tweets were posted
hist(tweets.df$created, breaks = 50,xlab = 'Day & Time', main = 'Tweets by date') # simple histogram
ggplot(tweets.df) + geom_histogram(aes(created),col = 'black', fill = 'grey') + 
  labs(x = 'Date & Time', y = 'Frequency', title = '#WIDM tweets over time') + 
  theme_bw()
ggsave('e1_tweetsovertime_histogram.png')

Hashtags

Next, it seemed wise to examine which other hashtags were being used so that future search queries on WIDM can be more comprehensive.

# hashtags frequency
hashtags <- table(tolower(unlist(str_extract_all(tweets.df$text,'#\\S+\\b'))))
head(sort(hashtags,T),20)

           #widm         #moltalk        #widmtips      #wieisdemol        #widm2017 
           10272             1722             1248              360               91 
#etherdiscipline             #app            #npo1             #mol          #widm17 
              56               55               50               47               45 
          #promo             #dtv         #vincent          #oregon           #zinin 
              30               27               27               24               23 
           #2017     #chriszegers        #portland     #tunnelvisie       #ellielust 
              21               20               20               19               18

png('e1_hashtags_wordcloud.png')
wordcloud(names(hashtags),freq = log(hashtags),rot.per = 0)
dev.off()

Because the hashtag I queried was obviously overwhelmingly used in the dataset, this wordcloud depicts hashtags’ by their logarithmic frequency.

Curiously, not all tweets had #widm included in their text. Potentially this is caused by regular expressions I used (more on those later) which may have filtered out hashtags such as #widm-poule whereas Twitter may return those when #WIDM is queried.

Contestant frequencies

Using for-loops and if-statements, described later in this blog, I retrieved the frequency with which contestants were mentioned in the tweets. I had the data in three different formats and the following consists of a series of visualizations of those data.

All tweets combined, contestant Jeroen Kijk in de Vegte (hurray for Dutch surnames) was mentioned most frequently. Vincent Vianen passes him only once retweets are excluded.

If we split the data based on the time of the post relative to the episode, it becomes clear that the majority of the tweets mentioning Vincent occured during the episode’s airtime.

This is likely due to one of two reasons. First of all, Vincent was eliminated in this first episode and the production team of WIDM has the tendency to fool the viewer and frame the contestant that is going to be eliminated as the mole. Often, the eliminated contestant received more airtime and all his/her suspicious behaviors and remarks are showed. Potentially, viewers have tweeted about Vincent throughout the episode because they suspected him. Secondly, Vincent was eliminated at the end of this current episode. This may have roused some positive/negative tweets on his behalf. These would likely not be suspicions by the public though. Let’s see what the data can tell us, by plotting the cumulative name references in tweets per minute during the episode.

Hmm… Apparently, Vincent was not being suspected by Dutch Twitter folk to the extent I had expected. He is not being mentioned any more or less than other contestants (with the exception of Jeroen) up until the very end of the episode. There is a slight bump in the frequency after his blunt behavior, but sentiment for Vincent really kicks in around the 21:25 when it becomes evident that he is going home.

The graph also tells us Jeroen is quite popular throughout the entire episode, whereas both Roos and Sanne receive some last minute boosts in the latter part of the episode. Reference to the rest of the contestants seems to be fairly level.

Also in the tweets that were posted since the episode’s end, Jeroen is mentioned most.

Compared to one of the more popular WIDM polls, our Twitter results seem quite reliable. The four most suspected contestants according to the poll overlap nicely with our Twitter frequencies. The main difference is that Sanne Wallis de Vries is the number one suspect in the poll, whereas she comes in third in our results.

Let us now examine the frequencies of the individual contestants over the course of time, with aggregated frequencies before, during and after the first episode (note: no cumulative here). Note that Vincent has a dotted line as he was eliminated at the end of the first episode. Seemingly, the public immediately lost interest. Jeroen, in particular, seems to be of interest during as well as after the first episode. Enthusiasm about Diederik also increases a fair amount during and after the show. Finally, interest in Roos and Sanne keeps grows, but at a lesser rate. Excitement regarding the rest of the contestants seems to level off.

We have almost come to the end of my Twitter analysis of the first episode of ‘Wie is de Mol?’ 2017. As my main intent was to spark curiosity for WIDM, data visualization, and general programming, I hope this post is received with positive sentiment.

If this blog/post gets a sequel, my main focus would be to track contestant popularity over time, the start of which can be found in the final visualizations below. If I find the time, I will create a more fluent tracking tool, updating on a daily basis, potentially in an interactive Shiny webpage application. I furthermore hope to conduct some explorative text and sentiment analysis – for instance, examining the most frequently used terms to describe specific contestants, what emotions tweets depict, or what makes people retweet others. Possibly, there is even time to perform some network analysis – for instance, examining whether sub-communities exist among the tweeting ‘Molloten‘.

For now, I hope you enjoyed this post! Please do not hesitate to share or use its contents. Also, feel free to comment or to reach out at any time. Maybe you as a reader can suggest additional elements to investigate, or maybe you can spot some obvious errors I made. Also, feel free to download the data yourself and please share any alternative or complementary analyses you conduct. Most of the R script you can find below.in the appendices

Cheers!

Paul van der Laken

Link to Dutch blog (part 1)

About the author: Paul van der Laken is a Ph.D. student at the department of Human Resource Studies at Tilburg University. Working closely with organizations such as Shell and Unilever, Paul conducts research on the application of advanced statistical analysis within the field of HR. Among others, his studies examine how organizations can make global mobility policies more evidence-based and thus effective. Next to this, Paul provides graduate and post-graduate training on HR data analysis at Tilburg University, TIAS Business School and in-house at various organizations.

Appendix: R setup

Let me run you through the packages I used.

# load libraries
libs <- c('plyr','dplyr','tidyr','stringr','twitteR','tm','wordcloud','ggplot2')
lapply(libs,require,character.only = T)

The plyr–family make coding so much easier and prettier. (here for more of my blogs on the tidyverse)
stringr comes in handy when dealing with text data.
twitteR is obviously the package with which to download Twitter data.
Though I think I did not use tm in the current analysis, it will probably come in handy for further text analysis.
wordcloud is not necessarily useful, but does quick frequency visualizations of text data.
It takes a while to become fluent in ggplot2 but it is so much more flexible than the base R plotting. A must have IMHO and I recommend anyone who works with R to learn ggplot sooner rather than later.

Retrieving the contestants

Although it was not really needed, I wanted to load in the 2017 WIDM contestants right from the official website. I quickly regretted this decision as it took me significantly longer than just typing in the names by hand would have. Nevertheless, it posed a good learning experience. Never before had I extracted raw HTML data and, secondly, this allowed me to refresh my knowledge of regular expressions (R specific). For those of you not familiar with regex, they are tremendously valuable and make coding so much easier and prettier. I am still learning myself and found this playlist by Daniel Shiffman quite helpful and entertaining, despite the fact that it is programming in, I think, javascript, and Mr. Shiffman can become overly enthusiastic from time to time.

After extracting the raw HTML data from the WIDM website contestants page (unfortunately, the raw data did not disclose the identity of the mole), I trimmed it down until a vector containing the contestants’ full names remained. By sorting the vector and creating a color palette right away, I hope to have ensured that I use the same color per contestant in future blogs. In case you may wonder, I specifically use a color-blind friendly palette (with two additions) as I have trouble myself. : )

In a later stage, I added a vector containing the first names of the losers (for lack of a better term) to simplify visualization.

#### contestants ####
# retrieve the contestants' names from the website
# load in website
webpage <- readLines('http://wieisdemol.avrotros.nl/contestants/') 
 # retrieve contestant names
contestants <- webpage[grepl('<strong>[A-Za-z]+</strong>',webpage)]
# remove html formatting
contestants <- gsub('</*\\w+>','',trimws(contestants)) 
contestants <- sort(contestants)
# color per contestant
cbPalette <- c("#999999", "#000000", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7","#E9D2D2")
# store losing contestants
losers <- c('Vincent',rep(NA,9))

Retrieving the tweets

This is not the first blog on Twitter analysis in R. Other blogs demonstrate the sequence of steps to follow before one can extract Twitter data in a structured way.

After following these steps, I did a quick exploration of the actual Twitter feeds surrounding the WIDM series and decided that the hashtag #WIDM would serve as a good basis for my extraction. The latest time at which I ran this script is Monday 2017-01-09 17:03 GTM+0, two days after the first episode was aired. It took the system less than 3 minutes to download the 10,503 #WIDM tweets. The tweets and their metadata I stored into a data frame after which I ran a custom cleaning function to extract only the tweeted text.

Next, I subsetted the data in multiple ways. First of all, there seem to be a lot of retweets in my dataset and I expected original messages to differ from those retweeted (in a sense duplicates). Hence, I stored the original tweets in a separate file. Next, I decided to split the tweets based on the time of their creation relative to the show’s airtime. Those in the pre-subset were uploaded before the broadcast, those in the during-subset were posted during the episode, and those in the post-subset were sent after the first episode had ended and the first loser was known.

# download the tweets
system.time(tweets.raw <- searchTwitter('#WIDM', n = 50000,lang = 'nl'))

   user  system elapsed 
  87.75    0.89  159.65

tweets.df <- twListToDF(tweets.raw) # store tweets in dataframe
tweets.text.clean <- CleanTweets(tweets.df$text) # run custom cleaning function on text column
tweets.text.clean.lc <- tolower(tweets.text.clean) # convert to lower case
# store cleaned text without retweets
tweets.text.clean.lc.org <- tweets.text.clean.lc[substring(tweets.df$text,1,2) != 'RT']
# store cleaned text based on time of tweet
airdate <- '2017-01-07'
e1.start <- as.POSIXct(paste(airdate,'19:25:00')) # 20:25 GMT +1 
e1.end <- as.POSIXct(paste(airdate,'20:35:00')) # 21:35 GMT +1 
 # select all tweets before start
tweets.text.clean.lc.pre <- tweets.text.clean.lc[tweets.df$created < e1.start]
# select all tweets during
tweets.text.clean.lc.during <- tweets.text.clean.lc[tweets.df$created > e1.start & tweets.df$created < e1.end]
# select all tweets after 
tweets.text.clean.lc.post <- tweets.text.clean.lc[tweets.df$created > e1.end]

Contestant mentions

Ultimately, my goal was to create some sort of thermometer or measurement instrument to analyze which of the contestants is suspected most by the public. Some of the tweets include quite clear statements of suspicion (“verdenkingen” in Dutch) or just plain accusations:

head(tweets.text.clean[grepl('ik verdenk [A-Z]',tweets.text.clean)])
[1] " widm ik verdenk Sigrid omdat bij de executie  haar reactie erg geacteerd leek"
[2] "Oké  ik verdenk Jochem heel erg  widm" 

head(tweets.text.clean[grepl('[A-Z][a-z]+ is de mol',tweets.text.clean)],4)
[1] "Jeroen is de mol  widm"                                       
[2] "  Ik weet het zeker    Jandino is de mol  amoz  widm  moltalk"
[3] "Ik weet het zeker    Jandino is de mol  amoz  widm  moltalk"  
[4] "Net  widm teruggekeken  Ik zeg Sigrid is de mol

However, writing the regular expression(s) to retrieve all the different ways in which tweets can accuse contestants or name suspicions would be quite the job. Besides, in this early phase of the series, tweets often just mention uncommon behaviors contestants have displayed, rather than accusing them. I theorized that those who act more suspiciously would be mentioned more frequently on Twitter, and decided to do a simple count of contestant name occurrences.

What follows are three multi-layer for-loops; probably super inefficient for larger datasets, but here it does the trick in mere seconds while being relatively easy to program. In it, I loop through the different subsets I created earlier and do a contestant name count in each of these. I also count references over time and during the episode’s airtime in specific. I recommend you to scroll past it quickly.

#### LOOP :: BEFORE / DURING / AFTER ####
# times contestants are mentioned in tweets
named <- data.frame() # create empty dataframe
# loop through contestants
for(i in contestants){
  # convert contestant first name to lower case 
  name <- tolower(word(i,1))
  # create counter for number of mentions
  count.rt <- 0
  # loop through all cleaned up, lower case tweets
  for (j in 1:length(tweets.text.clean.lc)){
    # extract current tweet
    tweet <- tweets.text.clean.lc[j]
    # if contestants' name occurs in current tweet
    if(grepl(name,tweet)){
      count.rt <- count.rt + 1 # counter++
    }
  }
  
[......truncated......]
  
  # store number of mentions in dataframe
  named <- rbind.data.frame(named,
                               cbind(Contestant = i,
                                     Total = count.rt,
                                     Original= count.org,
                                     BeforeEp = count.pre,
                                     DuringEp = count.during,
                                     AfterEp = count.post),
                               stringsAsFactors = F)
  
  print(paste(i,'... done!'))
  # continue to next contestant
}


#### LOOP :: OVERTIME ####
# create empty dataframe
named.overtime <- data.frame()
# loop through every day
for(Day in unique(as.Date(tweets.df$created))){
  # select only tweets of that day
  tweets <- tweets.text.clean.lc[as.Date(tweets.df$created) == Day]
  # print progress
  cat(as.Date(as.numeric(Day), origin = '1970-01-01'),':',sep = '')
  # loop through contestants
  for(i in contestants){
    # extract first name in lower case
    name <- tolower(word(i,1))
    # set counter at zero
    Count <- 0
    # loop through every single tweet
    for(j in 1:length(tweets)){
      # extract tweet
      tweet <- tweets[j]
      if(grepl(name,tweet)){
        Count <- Count + 1
      }
    }
    # add to data frame
    named.overtime <- rbind.data.frame(
      named.overtime,
      cbind(Day,Contestant = word(i,1),Count),
      stringsAsFactors = F
    )
    # next contestant 
    # print progress
    cat(word(i,1),' ')
  }
  cat('\n')
}


#### LOOP :: DURING EPISODE ####
# create empty dataframe
named.during <- data.frame()
# loop through every minute of the episode
for(Minute in unique(minutes.during)){
  # extract the tweets during that minute
  temp <- tweets.text.clean.lc.during[minutes.during == Minute]
  # loop through every contestant
  for(i in contestants){
    # save the lowercase name
    name = tolower(word(i,1))
    # create counter
    Count <- 0
    # loop through every tweet of that minute
    for(tweet in temp){
      # if contestant is mentioned, add one to counter
      if(grepl(name,tweet)){
        Count <- Count + 1
      }
    }
    # store all data in date frame
    named.during <- rbind.data.frame(named.during,
                                       cbind(Minute,Contestant = word(i,1),Count),
                                       stringsAsFactors = F)
  }
}

After some final transformations, I have the following tables nicely stored two data frames.

> named
                Contestant Total Original BeforeEp DuringEp AfterEp
1           Diederik Jekel   358      201       16      128     214
2         Imanuelle Grives    96       65        5       43      48
3  Jeroen Kijk in de Vegte   517      335       12      218     287
4        Jochem van Gelder   157      124       15       73      69
5           Roos Schlikker   203      154        7       88     108
6    Sanne Wallis de Vries   255      194        3      106     146
7         Sigrid Ten Napel   135      102        7       65      63
8          Thomas Cammaert    97       69        5       45      47
9           Vincent Vianen   406      354       19      285     102
10      Yvonne Coldeweijer   148      109       16       66      66

> named.overtime
          Day Contestant Count Cumulative
       <date>      <chr> <int>      <int>
1  2016-12-31   Diederik     0          0
2  2016-12-31  Imanuelle     0          0
3  2016-12-31     Jeroen     0          0
4  2016-12-31     Jochem     0          0
5  2016-12-31       Roos     1          1
6  2016-12-31      Sanne     0          0
7  2016-12-31     Sigrid     0          0
8  2016-12-31     Thomas     0          0
9  2016-12-31    Vincent     0          0
10 2016-12-31     Yvonne     0          0
# ... with 90 more rows

> named.during
  Minute Contestant Count Cumulative
   <chr>      <chr> <int>      <int>
1  19:25   Diederik     0          0
2  19:25  Imanuelle     0          0
3  19:25     Jeroen     0          0
4  19:25     Jochem     0          0
5  19:25       Roos     0          0
6  19:25      Sanne     0          0

In order to summarize the frequency table above in a straightforward visual, I wrote the following custom function to automate the generation of barplots for each of the subsets I created earlier.

# assign fixed value to y axis limits to simplify comparison
y.max <- ceiling(max(named$Total)/100)*100 

# custom ggplot function for sideways barplot
GeomBarFlipped <- function(data, x, y, y.max = y.max,
x.lab = 'Contestant', y.lab = '# Mentioned', title.lab){
  ggplot(data) + 
    geom_bar(aes(x = reorder(x,y), y = y), stat = 'identity') + 
    geom_text(aes(x = reorder(x,y), y = y, label = y), 
              col = 'white', hjust = 1.3) + 
    labs(x = x.lab, y = y.lab, title = title.lab) + 
    lims(y = c(0,y.max)) + theme_bw() + coord_flip() 
}

For further details surrounding the analyses, please feel free to reach out.

“Wie is de Mol?” volgens Twitter: Deel 1 (s17e1)

Dit is een re-post van mijn Linked-In artikel van 10 januari 2017.
Deel 2 vind je hier.

TL;DR // Samenvatting

Om te achterhalen in welke mate kandidaten verdacht worden door het Nederlandse publiek heb ik 10,000+ #WIDM tweets gedownload en geanalyseerd. Hoewel niets is wat het lijkt, komt uit de tweets duidelijk naar voren dat bepaalde kandidaten zich in de ogen van Twitterend Nederland verdachter gedragen dan anderen. Het doel van deze blog(s) is tweedelig. Enerzijds hoop ik de lezer een voorbeeld te geven van hoe Twitter gegevens kunnen worden gebruikt en geanalyseerd. Anderzijds hoop ik te kunnen profiteren van de zogenoemde wisdom-of-the-crowd en op den duur te achterhalen wie er dit jaar aan het mollen is. Een voorproefje:

Introductie

“Wie is de Mol?“, of WIDM in het kort, behoeft voor Nederlands publiek eigenlijk geen introductie. De spelshow loopt inmiddels 17 jaar en heeft een fanatieke achterban. Deze zelf-benoemde ‘molloten‘ houden ieder frame van iedere aflevering nauw in de gaten, achterhalen de meeste bizarre patronen en verzinnen de wildste theoriën. Afgelopen jaren zijn er zelfs uitgebreide analyses uitgevoerd op de kandidaten hun persoonlijke social media feeds om te achterhalen wie er mogelijk vervroegd terug in Nederland was. Tevens zijn er meerdere online fora, worden verdenkingen regelmatig gepollt, hebben twee NRC-redacteurs een wekelijkse bespreking en is er zelfs een mobile applicatiezodat je jouw vrienden kunt uitdagen.

‘Waarom dan deze blog?’ vraag je je misschien af. Wel, ik ben al enkele jaren een trouwe volger van de serie, maar mijn eigen verdenkingen zijn vaak verre van goed. Met deze analyses hoop ik inzicht te krijgen welke kandidaten het meest zijn opgevallen bij de kijker thuis. Tevens is programmeren mijn werk en hobby en was dit een goed excuus om een oud R-script weer eens af te stoffen. Deze Nederlandse versie van de blog bevat weinig details over de achterliggende code. Mocht je meer willen weten of het R script willen zien, lees dan vooral de Engelse versie of stuur een berichtje.

Aflevering 1: “… ZO GEDAAN”

De tweets downloaden

Dit is zeker niet de eerste blog over Twitter data analyse in R. Andere blogs laten zien welke stappen je moet volgen om gegevens op een gestructureerde manier van Twitter te downloaden. Na deze stappen zelf te hebben uitgevoerd, heb ik een snelle zoektocht gedaan door de Twitter feeds over WIDM. De hashtag #WIDM werd in de meeste berichten gebruikt en leek dus een goed begin. Maandag 9 januari 2017 om 18:03 heb ik voor het laatst de data van Twitter gedownload. Op dat moment waren er iets minder dan twee dagen verstreken sinds aflevering 1 was uitgezonden. De 10.503 #WIDM tweets waren binnen 3 minuten gedownload waarna ik ze heb opgeschoond met een aantal regular expressions (hierover meer in de Engelse versie).

Analyse en resultaten

Met de data nu onder de knoppen kon het analyseren beginnen

Tijd van posten

Allereerst leek het van belang om te kijken wanneer de #WIDM tweets waren gepost. Aflvering 1 is duidelijk terug te zien in de onderstaande visualisatie, waar een grote piek zich precies bevindt rond de tijd van de uitzending afgelopen zaterdag. Helaas staat Twitter slechts downloads toe van tweets gedurende de afgelopen 9 dagen, dus in 2016 kon ik helaas niet veel zien.

Hashtags

Daarnaast leek het verstandig om na te gaan of ik andere gangbare hashtags over het hoofd had gezien. Zo ja, dan zou ik in vervolg analyses ook de data van andere hashtags kunnen downloaden .

# hashtags frequency
hashtags <- table(tolower(unlist(str_extract_all(tweets.df$text,'#\\S+\\b'))))
head(sort(hashtags,T),20)

           #widm         #moltalk        #widmtips      #wieisdemol        #widm2017 
           10272             1722             1248              360               91 
#etherdiscipline             #app            #npo1             #mol          #widm17 
              56               55               50               47               45 
          #promo             #dtv         #vincent          #oregon           #zinin 
              30               27               27               24               23 
           #2017     #chriszegers        #portland     #tunnelvisie       #ellielust 
              21               20               20               19               18

Naast de door mij gebruikte hashtag (#WIDM) werden #moltaks, #widmtips en #wieisdemol ook veelvuldig gebruikt. Ook Ellie Lust en Chris Zegers zijn nog steeds populair zo te zien.

Kandidaat vermeldingen

Het uiteindelijke doel dat ik voor ogen had was om een soort van thermometer of meetinstrument te programmeren waarmee ik in een oogopslag kon zien wie van de kandidaten het meest verdacht werd door Twitterend Nederland. Sommige tweets in de dataset bevatten inderdaad verdenkingen van bepaalde kandidaten en andere waren kort door de bocht mollen aan het benoemen. (Wat doet Jan Dino daar?)

head(tweets.text.clean[grepl('ik verdenk [A-Z]',tweets.text.clean)])
[1] " widm ik verdenk Sigrid omdat bij de executie  haar reactie erg geacteerd leek"
[2] "Oké  ik verdenk Jochem heel erg  widm" 

head(tweets.text.clean[grepl('[A-Z][a-z]+ is de mol',tweets.text.clean)],4)
[1] "Jeroen is de mol  widm"                                       
[2] "  Ik weet het zeker    Jandino is de mol  amoz  widm  moltalk"
[3] "Ik weet het zeker    Jandino is de mol  amoz  widm  moltalk"  
[4] "Net  widm teruggekeken  Ik zeg Sigrid is de mol

Echter, er zijn veel verschillende manieren om met woorden te zeggen in hoeverre je iemand verdenkt. Daarnaast kunnen zinnen ontkenningen of zelf dubbele ontkenningen bevatten. Hoewel dit alles te programmeren valt, besloot ik om een simpelere route te bewandelen. Mijn theorie is dat kandidaten die zich meer verdacht gedragen tijdens de uitzendingen en kandidaten waarnaar meer hints verwijzen automatisch meer besproken worden op het internet. Indien deze theorie klopt, dan zou een relatief simpele telling van het aantal keer dat kandidaten worden genoemd in tweets voldoende zijn.Deze theorie is kort door de bocht, en ik ben zeker van plan om uitgebreidere analyses te draaien, maar voor nu heiligt het doel de middelen.

Zodoende heb ik mijn laptop met verschillende for-loops en if-statements opgedragen om ieder van de 10.000+ tweets te bekijken en te tellen hoe vaak ieder van de kandidaten in deze tweets werd genoemd. Handmatig zou dit dagen duren, maar de laptop bliepte triomfantelijk na enkele seconden. Zo beschikte ik over onder andere de volgende twee datasets:

> named
                Contestant Total Original BeforeEp DuringEp AfterEp
1           Diederik Jekel   358      201       16      128     214
2         Imanuelle Grives    96       65        5       43      48
3  Jeroen Kijk in de Vegte   517      335       12      218     287
4        Jochem van Gelder   157      124       15       73      69
5           Roos Schlikker   203      154        7       88     108
6    Sanne Wallis de Vries   255      194        3      106     146
7         Sigrid Ten Napel   135      102        7       65      63
8          Thomas Cammaert    97       69        5       45      47
9           Vincent Vianen   406      354       19      285     102
10      Yvonne Coldeweijer   148      109       16       66      66

> named.overtime
          Day Contestant Count Cumulative
       <date>      <chr> <chr>      <dbl>
1  2016-12-31   Diederik     0          0
2  2016-12-31  Imanuelle     0          0
3  2016-12-31     Jeroen     0          0
4  2016-12-31     Jochem     0          0
5  2016-12-31       Roos     1          1
6  2016-12-31      Sanne     0          0
7  2016-12-31     Sigrid     0          0
8  2016-12-31     Thomas     0          0
9  2016-12-31    Vincent     0          0
10 2016-12-31     Yvonne     0          0
# ... with 90 more rows

Alle tweets samengevoegd werd kandidaat Jeroen Kijk in de Vegte het meeste genoemd gedurende 9 dagen tweet-historie. Vincent Vianen werd echter vaker genoemd als men alleen de orginele tweets zou meerekenen.

Als we uitsplitsen naar het moment dat de tweets werden gepost wordt al snel zichtbaar dat Vincent vooral werd genoemd tijdens de aflevering.

Voor diegene die WIDM niet volgen: Vincent viel af deze eerste aflevering. Dit zou kunnen verklaren waarom hij zo veel Twitter-aandacht heeft gekregen gedurende de aflevering. Ook lijkt de WIDM productie de laatste jaren extra veel zendtijd te besteden aan de kandidaat die af gaat vallen. Als om de kijker op de verkeerde voet te zetten worden alle mogelijke molacties en rare opmerkingen van de toekomstige afvaller benadrukt tijdens de aflevering. Lang verhaal kort, wellicht is het interessant om de tweets gedurende de aflevering per minuut te volgen:

Het lijkt er op dat Vincent niet meer of minder werd besproken dan de andere kandidaten tot aan het laatste kwartier van de uitzending. Dit duidt erop dat hij wellicht niet zozeer als verdacht werd gezien door twitteraars, maar dat het weggeven van zijn vrijstelling in het laatste half uur (die toch al zou worden afgepakt) en zijn aankomende vertrek uit de serie, de tweets hebben veroorzaakt. Ook Roos lijkt een eindspurt te hebben genomen in het laatste kwartier van de uitzending, en Sanne pakt nog een snelle boost in de laatste paar minuten.

Tijdens de uitzending werd er stevig over Jeroen getweet, en dit zette zich door na de aflevering waar zijn naam wederom het meeste werd genoemd. Wellicht heb ik een verdachte handeling gemist die Twitterend Nederland wel is opgevallen? De wilde theorie over zijn verstandskiezen heb ik in ieder geval zeker gemist. Naast Jeroen werden Diederik en Sanne ook veelvuldig besproken na het slot van de aflevering. Thomas en Imanuelle, daarentegen, kregen erg weinig aandacht in het algemeen. Alle tweets bij elkaar opgeteld komen ze maar net aan de 100 vermeldingen de helft waarvan na de aflevering.

Vergeleken met een van de populaire WIDM polls, doen de resultaten van onze Twitter analyse het redelijk goed. De vier meest verdachte kandidaten volgens de poll vallen mooi samen met de vermeldingen op Twitter (nadat bekend was dat Vincent afviel). Het grootste verschil is dat Sanne Wallis de Vriesde nummer een verdachte is in de pol maar bij onze resultaten op de derde plek uitkomt.

Laten we de eerdere staafdiagrammen nog eens bekijken maar nu in een grafiek waarin we de individuele kandidaten volgen over de loop van de tijd. Onderstaande grafiek geeft de opgetelde vermeldingen voor, tijdens en na de eerste aflevering weer. Vincent heeft een stippellijn gekregen omdat hij is afgevallen deze aflevering. Blijkbaar was het publiek ook meteen een deel van haar interesse in hem kwijt want zijn vermeldingen kelderen sterk meteen na afloop van de aflevering. Jeroen is de sterkste stijger, met Didierik als achtervolger. Roos en Sanne worden ook steeds regelmatiger genoemd, maar toch een stuk minder dan hun voorgangers. De rest van de groep lijkt nauwelijks te worden opgemerkt door de twitteraars.

Mocht deze blog een vervolg krijgen, dan denk ik dat de focus ligt op het volgen van kandidaat populariteit over een langere periode. Een beginnetje hier van kun je hieronder vinden in de laatste twee grafieken. Mocht ik de tijd vrij kunnen maken, dan hoop ik een dagelijkse tracker te kunnen maken, wellicht in Shiny zodat lezers zelf interactief met de achterliggende data kunnen spelen. Anderzijds zou het interessant zijn om diepgaandere text en sentiment analyses uit te voeren, bijvoorbeeld door te kijken naar welke termen worden gebruikt om kandidaten te omschrijven, welke gevoelens en emoties verstopt gaan in de tweets, of wat mensen ertoe zet een bericht te retweeten. Daarnaast zou het inzichtelijk kunnen zijn om netwerkanalyses uit te voeren, bijvoorbeeld om te achterhalen of er subgroepen bestaan onder de twitterende ‘Molloten‘.

Ik hoop dat jij net zo genoten hebt van deze blog als ik! Schroom niet om de inhoud te delen of anderwijs te gebruiken. Laat ook vooral een reactie achter onder dit bericht of stuur een persoonlijk berichtje. Wellicht kun jij als lezer bedenken wat voor informatie we nog meer uit de data kunenn trekken, of hoe we de gegevens wellicht inzichtelijker kunnen weer-/vormgeven. Ik ben in ieder geval benieuwd naar jullie reacties!

Link naar de Engels versie van deze blog (inclusief code)

Link naar deel 2 (NL)

TACIT: An open-source Text Analysis, Crawling, and Interpretation Tool

Click here for the original PDF: TACIT 2017

The first programs for (scientific) text mining are already over 50 years old. More recent efforts, such as the Linguistic Inquiry Word Count (LIWC; Tausczik & Pennebaker, 2010), have greatly improved our text analytical capabilities. Moreover, several single-purpose programs have been developed, which also consider syntactic text structures (e.g., Syntactic Complexity Analyzer [Lu, 2010], TAALES [Kyle & Crossley, 2015]).However, the widespread use of many of these programs has been hampered by two major barriers.

First, considerable technical expertise is required, which obstructs researchers without statistical backgrounds. For example, packages such as tm in R (Meyer et al., 2015) have been developed to conduct natural-language processing, but the steep learning curve forms a challenge. Additionally, the constant increase of computational processing power and the proliferation of new algorithms makes it difficult for researchers to maintain working knowledge of state-of-the-art methods.

Alternatively, most of the existing user-friendly NLP programs (and packages), such as RapidMiner (Akthar & Hahne, 2012), SAS Text Miner (Abell, 2014), or SPSS Modeler (IBM Corp., 2011), charge either a large software fee up front or a subscription fee. The cost of these programs can be prohibitively expensive for junior researchers and researchers looking to integrate new techniques into their research toolbox.

In the attached article, TACIT is introduced: Text Analysis, Crawling and Investigation Tool. TACIT is an open-source architecture that establishes a pipeline between the various stages of text-based research by integrating tools for text mining, data cleaning, and analysis under a single user-friendly architecture. In addition to being prepackaged with a range of easily applied, cutting-edge methods, TACIT’s design also allows other researchers to write their own plugins.

The authors’ hope is that TACIT can facilitate the integration and use of advancements in computational linguistics in psychological research, and by doing so can help researchers make use of the ever-growing documents of our social discourse in ways that have previously not been possible.

TL;DR // Summary

Introduction

Analysis & Results

Time of creation

Hashtags

Contestant frequencies

Paul van der Laken

Appendix: R setup

Retrieving the contestants

Retrieving the tweets

Contestant mentions

Share this:

TL;DR // Samenvatting

Introductie

Aflevering 1: “… ZO GEDAAN”

De tweets downloaden

Analyse en resultaten

Tijd van posten

Hashtags

Kandidaat vermeldingen

Share this:

Share this: