Category: report

IBM’s Watson for Oncology: A Biased and Unproven Recommendation System in Cancer Treatment?

The below reiterates and summarizes this Stat article.

Recently, I addressed how bias may slip into Machine Learning applications and this weekend I came across another real-life example: IBM’s Watson, specifically Watson for Oncology. With a single machine, IBM intended to tackle humanity’s most vexing diseases and revolutionize medicine and they quickly zeroed in on a high-profile target: cancer.

However, three years later now, a STAT investigation has found that the supercomputer isn’t living up to the lofty expectations IBM created for it. IBM claims that, through Artificial Intelligence, Watson for Oncology can generate new insights and identify “new approaches” to cancer care. However, the STAT investigation (video below) concludes that the system doesn’t create new knowledge and is artificially intelligent only in the most rudimentary sense of the term. Similarly, cancer specialists using the product argue Watson is still in its “toddler stage” when it comes to oncology.

Let’s start with the positive side. For specific treatments, Watson can scan academic literature, immediately providing the “best data” about a treatment — survival rates, for example — thereby relieving doctors of tedious literature searches. Due to this transparency, Watson may level the hierarchy commonly found in hospital settings, by holding (senior) doctors accountable to the data and empowering junior physicians to back up their arguments. Furthermore, Watson’s information may empower patients as they can be offered a comprehensive packet of treatment options, including potential treatment plans along with relevant scientific articles. Patients can do their own research about these treatments, and maybe even disagree with the doctor about the right course of action.

Although study results demonstrate that Watson saves doctors time and can have a high concordance rate with their treatment recommendations, much more research is needed. The studies were all conference abstracts, which haven’t been published in peer-reviewed journals — and all but one was either conducted by a paying customer or included IBM staff on the author list, or both. More importantly, IBM has failed to exposed Watson for Oncology to critical review by outside scientists nor have they conducted clinical trials to assess its effectiveness. It would be very interesting to examine whether Watson’s implementation is actually saving lives or making healthcare more efficient/effective.

imaging-video[1].jpg — IBM Watson Health

Such validation is especially necessary because several issues are identified. First, the actual capabilities of Watson for Oncology are not well-understood by the public, and even by some of the hospitals that use it. It’s taken nearly six years of painstaking work by data engineers and doctors to train Watson in just seven types of cancer, and keep the system updated with the latest knowledge. Moreover, because of the complexity of the underlying machine learning algorithms, the recommendations Watson puts out are a black box, and Watson can not provide the specific reasons for picking treatment A over treatment B.

Second, the system is essentially Memorial Sloan Kettering in a portable box. IBM celebrates Memorial Sloan Kettering’s role as the only trainer of Watson. After all, who better to educate the system than doctors at one of the world’s most renowned cancer hospitals? However, doctors claim that Memorial Sloan Kettering’s training has caused bias in the system, because the treatment recommendations it puts into Watson don’t always comport with the practices of doctors elsewhere in the world. When users ask Watson for advice, the system also searches published literature — some of which is curated by Memorial Sloan Kettering — to provide relevant studies and background information to support its recommendation. But the recommendation itself is derived from the training provided by the hospital’s doctors, not the outside literature.

Doctors at Memorial Sloan Kettering acknowledged their influence on Watson. “We are not at all hesitant about inserting our bias, because I think our bias is based on the next best thing to prospective randomized trials, which is having a vast amount of experience,” said Dr. Andrew Seidman, one of the hospital’s lead trainers of Watson. “So it’s a very unapologetic bias.”

However, this bias causes serious problems when Watson for Oncology is implemented in other countries/hospitals. The generally affluent population treated at Memorial Sloan Kettering doesn’t reflect the diversity of people around the world. According to Martijn van Oijen, an epidemiologist and associate professor at Academic Medical Center in the Netherlands, Watson has not been implemented in because of country level differences in treatment approaches. Similarly, oncologists at one hospital in Denmark said they have dropped implementation altogether after finding that local doctors agreed with Watson in only about 33 percent of cases. Different problems occurred in South Korea, where researchers reported that the treatment Watson most often recommended for breast cancer patients simply wasn’t covered by their national insurance system.

Kris, the lead trainer at Memorial Sloan Kettering, says nobody wants to hear the problems. “All they want to hear is that Watson is the answer. And it always has the right answer, and you get it right away, and it will be cheaper. But like anything else, it’s kind of human.”

‘Wie is de Mol?’ volgens Twitter – Deel 2 (s17e2)

Dit is een repost van mijn Linked-In artikel van 17 januari 2017.
Helaas heb ik er door gebrek aan tijd geen vervolg meer aan gegeven.
De twitter data ben ik wel blijven scrapen, dus wie weet komt het nog…

TL;DR // Samenvatting

Vorige week postte ik een eerste blog (Nederlands & Engels) waarin ik Twitter gebruik om te analyseren in hoeverre Wie is de Mol-kandidaten worden verdacht. De resultaten toonden dat Twitterend Nederland toen vooral Jeroen verdacht vond en dit kwam opvallend overeen met de populaire online polls. Na de tweede aflevering heeft Twitter echter een andere hoofdverdachte aangewezen, namelijk Diederik. Verder heb ik deze week, op aanraden van diverse lezers, iets dieper gegraven in de inhoud van de tweets. Ik hoop dat deze nieuwe analyses jou helpen #tunnelvisie te voorkomen.

Door de positieve respons op de vorige blog (Nederlands / Engels) heb ik besloten mijn WIDM project een vervolg te geven. Ondanks dat Twitter slechts toestaat om berichten tot en met 9 dagen terug te downloaden, had ik de eerdere berichten lokaal opgeslagen zodat ik nu de meest recente #WIDM tweets aan de eerdere dataset kan toevoegen. De complete dataset komt daarmee op 22,696 unieke (re)tweets! Dit zijn alle tweets gepost tussen 31 december 2016 en de avond van dinsdag 16 januari 2017. Ondanks mijn voornemen heb besloten om geen andere hashtags mee te nemen in de analyse, omdat de eerdere dataset die gegevens niet bevat en ik door de bovengenoemde download restrictie niet meer aan die gegevens kon komen. Wel heb ik de analyses uitgebreid op basis van de suggesties die jullie me hebben gegeven. Mocht jij als lezer dus nog tips, suggesties of opmerkingen hebben, schroom dan vooral niet om een berichtje te sturen of een reactie te plaatsen onder deze blog.

Aflevering 2: “Meegaand”

Er is deze week weer flink getweet over WIDM. Ondanks het klassieke laserschiet-element lag het volume deze tweede aflevering een stuk lager dan tijdens de seizoenspremière. Met ‘slechts’ 6,491 tweets afgelopen zaterdag werd er ongeveer 40% minder gepost dan vorige week. Ook het aantal berichten op de zondag na de aflevering was beduidend lager. Daarnaast bleek Twitterend Nederland doordeweeks met haar gedachten ergens anders te zitten.

Tijdens de uitzending van vorige week werden Jeroen, Diederik en Sanne (in die volgorde) het meeste genoemd. Het verloop van de tweede aflevering ziet er anders uit. Jeroen is verstoten uit de top 3 en Diederik heeft zijn plek overgenomen. Hij werd het meest genoemd tijdens de aflevering en heeft dit vooral te danken aan de slotfase van de uitzending, wellicht door zijn geloofwaardige verhaal over de schattige bevertjes (wat kan Diederik goed vertellen zeg). Desalniettemin wordt hij kort gevolgd door Roos en Sanne, wiens beider naam tijdens de uitzending ook meer dan 200 keer werd getwitterd.

Imanuelle werd deze week eindelijk opgemerkt als WIDM kandidaat, na anderhalve aflevering nauwelijks te zijn genoemd door twitterend Nederland. Opvallend is hoe zij na ongeveer 28 minuten in de aflevering opeens drastisch omhoog schiet. Iemand een idee wat daar gebeurde? Ook Sanne nam een sprintje ongeveer 20 minuten na de start. Zou dit tijdens die typmachineopdracht zijn? Of waren we toen al aan het laserschieten? Instegenstelling tot Imanuelle is en blijft kandidaat Thomas een muurbloempje. Hoewel Vincent vorige week tijdens de slotfase van de aflevering een enorme boost kreeg als afvaller is zulke belangstelling deze week in mindere mate zichtbaar voor afvaller Yvonne.

Alle tweets bij elkaar opgeteld heeft Diederik na aflevering twee het stokje overgenomen van eerdere koploper Jeroen. Zoals hieronder zichtbaar werd Diederik zijn naam in maar liefst 6.4% van alle tweets genoemd. Sanne en Roos hebben echter ook een goede aflevering gedraaid en staan nu op een gedeelde derde plaats qua vermeldingen.

Deze rangorde verschilt substantieel van de telling na aflevering 1. Onderstaande figuur geeft de relatieve stijging/daling in de belangstelling voor de verschillende kandidaten weer. Hierbij zijn de totale vermeldingen voor de start van aflevering 2 gedeeld door de vermeldingen sindsdien. Opvallend is dat hoogvlieger Jeroen relatief een stuk minder besproken is sinds afgelopen zaterdag, echter kon hij natuurlijk ook alleen maar verliezen met zijn vroege piek in de eerste aflevering. Imanuelle kwam, zoals eerder gezegd, van ver onderaan de rangorde en zag haar vermeldingen zodoende meer dan verdubbelen sinds afgelopen zaterdag. Roos stond vorige week al in de middenmoot maar is desondanks ook bijna dubbel zo vaak genoemd op Twitter sinds de start van de tweede aflevering. Persoonlijk vind ik het opvallend dat Sigrid haar naam niet vaker is gepost. Wie gaat er tijdens het laserschieten nou schuilen achter een gewoven ijzeren picknicktafel?! Zo raak je die 750 euro wel kwijt ja… Verder lijkt het spreekwoord ‘Uit het oog, uit het hart’ op te gaan als het op tweets aankomt want Vincent’s roem was van zeer korte duur.

Een suggestie heb gekregen sinds de vorige blog, is dat een telling van de daadwerkelijke verdenkingen informatiever zou zijn dan een telling van het aantal keer dat een kandidaat zijn of haar naam genoemd is. Hier ben ik mij volledig van bewust en in de vorige blog heb ik al kort uitgelegd waarom ik toentertijd besloten had dit niet te doen. Desalniettemin heb ik deze week gedetailleerder gekeken naar de daadwerkelijke inhoud van de tweets. Na beraad bij enkele mede-molloten heb ik ingezoomd op de woorden mol, verdenk* en verdacht*. Hierbij heb ik het systeem opgedragen dat moleen precieze match moest hebben, met uitzondering van een hashtag. Zo zijn bijvoorbeeld molloot, moltalk of #wieisdemol niet geteld, maar #mol wel. Bij zowel verdenk en verdacht heb ik toegestaan dat zij gevolgd mochten worden door willekeurige letters (*), zodat ook woorden zoals verdenkingen en verdachte zouden worden meegeteld. De uitkomst van de uiteindelijke telling is gepresenteerd in de figuur hieronder. Hierbij is de gehele dataset aan tweets gebruikt.

Hoewel deze manier van tellen uiteraard tot minder hoge totalen leidt, is de verdeling en rangorde onder de kandidaten verassend gelijk aan de eerder gepresenteerde grijze staafdiagram. Dit blijkt ook uit onderstaande scatterplot. De twee manieren van tellen hangen zeer sterk positief met elkaar samen en zodoende neig ik te concluderen dat de simpele telling van het aantal naamsvermeldingen op Twitter een goed beeld geeft van de onderliggende verdenkingen van twitterend Nederland. Echter is het goed mogelijk dat ik belangrijke woorden over het hoofd heb gezien, dus laat vooral in een reactie hieronder weten welke woorden ik in het vervolg wel/niet mee moet nemen. Ook hoor ik graag welke manier van tellen jullie graag hebben dat ik aanhoud. Daarnaast zal ik bij aanhoudende respons proberen een interactieve webapp maken zodat jullie zelf met de woorden en data kunnen spelen.

(Tip voor useRs: je kunt xlim beter gebruiken met coord_cartesian(), dan knipt hij de error band niet van je smoothing line af… daar kwam ik later pas achter)

Ook voor deze blog heb ik de vermeldingen van de kandidaten over de loop van de tijd uitgedraaid. Beiden afleveringen zijn goed zichtbaar in onderstaande grafiek op dagbasis. Op dagen zonder uitzendingen is het erg stil, met uitzondering van een aantal tweets op de zondag. De meest significante ontwikkeling deze week lijkt de eerder besproken stijging van Diederik, waarmee hij Jeroen inhaalt. Roos heeft een goede inhaalslag gemaakt ten opzichte van Sanne en zij lijken de derde plek nu te delen, zeker als je de beschuldigende woorden in d

Als we de stand na deze week vergelijken met de polls op de officiële WIDM website en de WIDM fanpagina, dan lijkt Twitter vooral Roos sterker te verdenken dan de respondenten van de polls dat doen. Daarnaast doen Sigrid en Jochem het vrij goed in de peilingen, terwijl zij door twitteraars over het hoofd worden gezien.

En zo zijn we aan het eind gekomen van deze blog over de tweede aflevering van Wie is de Mol 2017. Zoals je wellicht hebt gemerkt probeer ik bij het schrijven zo objectief mogelijk te blijven. Enerzijds omdat ik jaar op jaar verschrikkelijk slecht blijk in het ontmaskeren van de mol. Anderzijds omdat ik na de aflevering altijd al de helft van de gebeurtenissen al weer vergeten ben. Heb jij wel een oplettend oog, ben je bedreven in het geschreven woord en lijkt het je leuk om het bovenstaande in het vervolg van wat inhoud te voorzien neem dan vooral contact op. Verder kun je hieronder in de reacties natuurlijk ook al je verdenkingen, suggesties, opmerkingen of tips kwijt. Deel daarnaast de blog en haar plaatjes vooral met vrienden of op fora, je hoeft hiervoor geen toestemming te vragen.

Ik hoop dat jullie net zo genieten van dit nu al klassieke #WIDM seizoen als ik, en dat jullie na het lezen van deze blog wellicht iets dichter zijn gekomen bij het ontmaskeren van jullie mol. Groetjes, en hopelijk tot volgende week!

– Paul

Link naar deel 1 (NL)

Link naar deel 1 (ENG)

Link naar deel 3 (NL) … komt nog

Over de auteur: Paul van der Laken is promovendus aan het department Human Resource Studies van Tilburg University. In samenwerking met organisaties zoals Shell en Unilever onderzoekt Paul hoe statistische analyse kan worden ingezet binnen de P&O/HR-functie. Hij verdiept zich onder andere in hoe organisaties hun beleid omtrent het internationaal uitzenden van medewerkers meer data-gedreven, en dus effectiever, kunnen maken. Hiernaast geeft Paul cursussen en trainingen in HR data analyse aan Tilburg University, TIAS Business School en inhouse bij bedrijven.

‘Wie is the Mol?’ according to Twitter – Part 1 (s17e1)

This is a repost of my Linked-In article of January 10th 2017.
The Dutch version of this blog is posted here.

TL;DR // Summary

In order to analyze which of the contestants of a Dutch television game show was suspected of sabotage by public opinion, 10,000+ #WIDM tweets were downloaded and analyzed. Data analysis of this first episode demonstrates how certain contestants increasingly receive public attention whereas others are quickly abandoned. Hopefully, this wisdom-of-the-crowd approach will ultimately demonstrate who is most likely to be this years’ mole. (link to Dutch blog)

A sneak peak:

Introduction

“Wie is de Mol?” [literal translation: ‘Who is the mole?’], or WIDM in short, is a popular Dutch TV game show that has been running for 17 years. The format consists of 10 famous Dutchies (e.g., actors, comedians, novelists) being sent abroad to complete a series of challenging tasks and puzzles, amassing collective prize money along the way.

However, among the contestants is a mole. This saboteur is carefully trained by the WIDM production team beforehand and his/her secret purpose is to prevent the group from collecting any money. Emphasis on secret, as the mole can only operate if unidentified by the other contestants. Furthermore, at the end of each episode, contestants have to complete a test on their knowledge of the identity of the mole. The one whose suspicions are the furthest off is eliminated and sent back to the cold and rainy Netherlands. This process is repeated every episode until in the final episode only three contestants remain: one mole, one losing finalist, and the winner of the series and thus the prize money.

WIDM has a large, active fanbase of so-called ‘molloten‘, which roughly translates to mole-fanatics. Part of its popularity can be attributed to viewers themselves being challenged to uncover the mole before the end of the series. Although the production team assures that most of the sabotage is not shown to the viewer at home, each episode is filled with visual, musical and textual hints. Frequently, viewers come up with wild theories and detect the most bizarre patterns. In recent years, some dedicated fans even go as far as analyzing the contestants’ personal social media feeds in order to determine who was sent home early. A community has developed with multiple online fora, frequent polling of public suspicions, and even a mobile application so you can compare suspicions and compete with friends. Because of all this public effort, the identity of the mole is frequently known before the actual end of the series.

So, why this blog? Well, first off, I have followed the series for several years myself and, to be honest, my suspicions are often quite far off. Secondly, past year, I played around with twitter data analysis and WIDM seemed like a nice opportunity to dust off that R script. Third, I hoped the LinkedIn community might enjoy a step-by-step example of twitter data analysis. The following is the first of, hopefully, a series of blogs in which I try to uncover the mole using the wisdom of the tweeting crowd. I hope you enjoy it as much as I do.

Analysis & Results

To not keep you waiting, let’s start with the analysis and the results right away.

Time of creation

First, let’s examine when the #WIDM tweets were posted. Episode 1 is clearly visible in the data with most of the traffic occurring in a short timeframe on Saturday evening. Note that unfortunately Twitter only allows data to be downloaded nine days back in time.

# inspect when tweets were posted
hist(tweets.df$created, breaks = 50,xlab = 'Day & Time', main = 'Tweets by date') # simple histogram
ggplot(tweets.df) + geom_histogram(aes(created),col = 'black', fill = 'grey') + 
  labs(x = 'Date & Time', y = 'Frequency', title = '#WIDM tweets over time') + 
  theme_bw()
ggsave('e1_tweetsovertime_histogram.png')

Hashtags

Next, it seemed wise to examine which other hashtags were being used so that future search queries on WIDM can be more comprehensive.

# hashtags frequency
hashtags <- table(tolower(unlist(str_extract_all(tweets.df$text,'#\\S+\\b'))))
head(sort(hashtags,T),20)

           #widm         #moltalk        #widmtips      #wieisdemol        #widm2017 
           10272             1722             1248              360               91 
#etherdiscipline             #app            #npo1             #mol          #widm17 
              56               55               50               47               45 
          #promo             #dtv         #vincent          #oregon           #zinin 
              30               27               27               24               23 
           #2017     #chriszegers        #portland     #tunnelvisie       #ellielust 
              21               20               20               19               18

png('e1_hashtags_wordcloud.png')
wordcloud(names(hashtags),freq = log(hashtags),rot.per = 0)
dev.off()

Because the hashtag I queried was obviously overwhelmingly used in the dataset, this wordcloud depicts hashtags’ by their logarithmic frequency.

Curiously, not all tweets had #widm included in their text. Potentially this is caused by regular expressions I used (more on those later) which may have filtered out hashtags such as #widm-poule whereas Twitter may return those when #WIDM is queried.

Contestant frequencies

Using for-loops and if-statements, described later in this blog, I retrieved the frequency with which contestants were mentioned in the tweets. I had the data in three different formats and the following consists of a series of visualizations of those data.

All tweets combined, contestant Jeroen Kijk in de Vegte (hurray for Dutch surnames) was mentioned most frequently. Vincent Vianen passes him only once retweets are excluded.

If we split the data based on the time of the post relative to the episode, it becomes clear that the majority of the tweets mentioning Vincent occured during the episode’s airtime.

This is likely due to one of two reasons. First of all, Vincent was eliminated in this first episode and the production team of WIDM has the tendency to fool the viewer and frame the contestant that is going to be eliminated as the mole. Often, the eliminated contestant received more airtime and all his/her suspicious behaviors and remarks are showed. Potentially, viewers have tweeted about Vincent throughout the episode because they suspected him. Secondly, Vincent was eliminated at the end of this current episode. This may have roused some positive/negative tweets on his behalf. These would likely not be suspicions by the public though. Let’s see what the data can tell us, by plotting the cumulative name references in tweets per minute during the episode.

Hmm… Apparently, Vincent was not being suspected by Dutch Twitter folk to the extent I had expected. He is not being mentioned any more or less than other contestants (with the exception of Jeroen) up until the very end of the episode. There is a slight bump in the frequency after his blunt behavior, but sentiment for Vincent really kicks in around the 21:25 when it becomes evident that he is going home.

The graph also tells us Jeroen is quite popular throughout the entire episode, whereas both Roos and Sanne receive some last minute boosts in the latter part of the episode. Reference to the rest of the contestants seems to be fairly level.

Also in the tweets that were posted since the episode’s end, Jeroen is mentioned most.

Compared to one of the more popular WIDM polls, our Twitter results seem quite reliable. The four most suspected contestants according to the poll overlap nicely with our Twitter frequencies. The main difference is that Sanne Wallis de Vries is the number one suspect in the poll, whereas she comes in third in our results.

Let us now examine the frequencies of the individual contestants over the course of time, with aggregated frequencies before, during and after the first episode (note: no cumulative here). Note that Vincent has a dotted line as he was eliminated at the end of the first episode. Seemingly, the public immediately lost interest. Jeroen, in particular, seems to be of interest during as well as after the first episode. Enthusiasm about Diederik also increases a fair amount during and after the show. Finally, interest in Roos and Sanne keeps grows, but at a lesser rate. Excitement regarding the rest of the contestants seems to level off.

We have almost come to the end of my Twitter analysis of the first episode of ‘Wie is de Mol?’ 2017. As my main intent was to spark curiosity for WIDM, data visualization, and general programming, I hope this post is received with positive sentiment.

If this blog/post gets a sequel, my main focus would be to track contestant popularity over time, the start of which can be found in the final visualizations below. If I find the time, I will create a more fluent tracking tool, updating on a daily basis, potentially in an interactive Shiny webpage application. I furthermore hope to conduct some explorative text and sentiment analysis – for instance, examining the most frequently used terms to describe specific contestants, what emotions tweets depict, or what makes people retweet others. Possibly, there is even time to perform some network analysis – for instance, examining whether sub-communities exist among the tweeting ‘Molloten‘.

For now, I hope you enjoyed this post! Please do not hesitate to share or use its contents. Also, feel free to comment or to reach out at any time. Maybe you as a reader can suggest additional elements to investigate, or maybe you can spot some obvious errors I made. Also, feel free to download the data yourself and please share any alternative or complementary analyses you conduct. Most of the R script you can find below.in the appendices

Cheers!

Paul van der Laken

Link to Dutch blog (part 1)

About the author: Paul van der Laken is a Ph.D. student at the department of Human Resource Studies at Tilburg University. Working closely with organizations such as Shell and Unilever, Paul conducts research on the application of advanced statistical analysis within the field of HR. Among others, his studies examine how organizations can make global mobility policies more evidence-based and thus effective. Next to this, Paul provides graduate and post-graduate training on HR data analysis at Tilburg University, TIAS Business School and in-house at various organizations.

Appendix: R setup

Let me run you through the packages I used.

# load libraries
libs <- c('plyr','dplyr','tidyr','stringr','twitteR','tm','wordcloud','ggplot2')
lapply(libs,require,character.only = T)

The plyr–family make coding so much easier and prettier. (here for more of my blogs on the tidyverse)
stringr comes in handy when dealing with text data.
twitteR is obviously the package with which to download Twitter data.
Though I think I did not use tm in the current analysis, it will probably come in handy for further text analysis.
wordcloud is not necessarily useful, but does quick frequency visualizations of text data.
It takes a while to become fluent in ggplot2 but it is so much more flexible than the base R plotting. A must have IMHO and I recommend anyone who works with R to learn ggplot sooner rather than later.

Retrieving the contestants

Although it was not really needed, I wanted to load in the 2017 WIDM contestants right from the official website. I quickly regretted this decision as it took me significantly longer than just typing in the names by hand would have. Nevertheless, it posed a good learning experience. Never before had I extracted raw HTML data and, secondly, this allowed me to refresh my knowledge of regular expressions (R specific). For those of you not familiar with regex, they are tremendously valuable and make coding so much easier and prettier. I am still learning myself and found this playlist by Daniel Shiffman quite helpful and entertaining, despite the fact that it is programming in, I think, javascript, and Mr. Shiffman can become overly enthusiastic from time to time.

After extracting the raw HTML data from the WIDM website contestants page (unfortunately, the raw data did not disclose the identity of the mole), I trimmed it down until a vector containing the contestants’ full names remained. By sorting the vector and creating a color palette right away, I hope to have ensured that I use the same color per contestant in future blogs. In case you may wonder, I specifically use a color-blind friendly palette (with two additions) as I have trouble myself. : )

In a later stage, I added a vector containing the first names of the losers (for lack of a better term) to simplify visualization.

#### contestants ####
# retrieve the contestants' names from the website
# load in website
webpage <- readLines('http://wieisdemol.avrotros.nl/contestants/') 
 # retrieve contestant names
contestants <- webpage[grepl('<strong>[A-Za-z]+</strong>',webpage)]
# remove html formatting
contestants <- gsub('</*\\w+>','',trimws(contestants)) 
contestants <- sort(contestants)
# color per contestant
cbPalette <- c("#999999", "#000000", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7","#E9D2D2")
# store losing contestants
losers <- c('Vincent',rep(NA,9))

Retrieving the tweets

This is not the first blog on Twitter analysis in R. Other blogs demonstrate the sequence of steps to follow before one can extract Twitter data in a structured way.

After following these steps, I did a quick exploration of the actual Twitter feeds surrounding the WIDM series and decided that the hashtag #WIDM would serve as a good basis for my extraction. The latest time at which I ran this script is Monday 2017-01-09 17:03 GTM+0, two days after the first episode was aired. It took the system less than 3 minutes to download the 10,503 #WIDM tweets. The tweets and their metadata I stored into a data frame after which I ran a custom cleaning function to extract only the tweeted text.

Next, I subsetted the data in multiple ways. First of all, there seem to be a lot of retweets in my dataset and I expected original messages to differ from those retweeted (in a sense duplicates). Hence, I stored the original tweets in a separate file. Next, I decided to split the tweets based on the time of their creation relative to the show’s airtime. Those in the pre-subset were uploaded before the broadcast, those in the during-subset were posted during the episode, and those in the post-subset were sent after the first episode had ended and the first loser was known.

# download the tweets
system.time(tweets.raw <- searchTwitter('#WIDM', n = 50000,lang = 'nl'))

   user  system elapsed 
  87.75    0.89  159.65

tweets.df <- twListToDF(tweets.raw) # store tweets in dataframe
tweets.text.clean <- CleanTweets(tweets.df$text) # run custom cleaning function on text column
tweets.text.clean.lc <- tolower(tweets.text.clean) # convert to lower case
# store cleaned text without retweets
tweets.text.clean.lc.org <- tweets.text.clean.lc[substring(tweets.df$text,1,2) != 'RT']
# store cleaned text based on time of tweet
airdate <- '2017-01-07'
e1.start <- as.POSIXct(paste(airdate,'19:25:00')) # 20:25 GMT +1 
e1.end <- as.POSIXct(paste(airdate,'20:35:00')) # 21:35 GMT +1 
 # select all tweets before start
tweets.text.clean.lc.pre <- tweets.text.clean.lc[tweets.df$created < e1.start]
# select all tweets during
tweets.text.clean.lc.during <- tweets.text.clean.lc[tweets.df$created > e1.start & tweets.df$created < e1.end]
# select all tweets after 
tweets.text.clean.lc.post <- tweets.text.clean.lc[tweets.df$created > e1.end]

Contestant mentions

Ultimately, my goal was to create some sort of thermometer or measurement instrument to analyze which of the contestants is suspected most by the public. Some of the tweets include quite clear statements of suspicion (“verdenkingen” in Dutch) or just plain accusations:

head(tweets.text.clean[grepl('ik verdenk [A-Z]',tweets.text.clean)])
[1] " widm ik verdenk Sigrid omdat bij de executie  haar reactie erg geacteerd leek"
[2] "Oké  ik verdenk Jochem heel erg  widm" 

head(tweets.text.clean[grepl('[A-Z][a-z]+ is de mol',tweets.text.clean)],4)
[1] "Jeroen is de mol  widm"                                       
[2] "  Ik weet het zeker    Jandino is de mol  amoz  widm  moltalk"
[3] "Ik weet het zeker    Jandino is de mol  amoz  widm  moltalk"  
[4] "Net  widm teruggekeken  Ik zeg Sigrid is de mol

However, writing the regular expression(s) to retrieve all the different ways in which tweets can accuse contestants or name suspicions would be quite the job. Besides, in this early phase of the series, tweets often just mention uncommon behaviors contestants have displayed, rather than accusing them. I theorized that those who act more suspiciously would be mentioned more frequently on Twitter, and decided to do a simple count of contestant name occurrences.

What follows are three multi-layer for-loops; probably super inefficient for larger datasets, but here it does the trick in mere seconds while being relatively easy to program. In it, I loop through the different subsets I created earlier and do a contestant name count in each of these. I also count references over time and during the episode’s airtime in specific. I recommend you to scroll past it quickly.

#### LOOP :: BEFORE / DURING / AFTER ####
# times contestants are mentioned in tweets
named <- data.frame() # create empty dataframe
# loop through contestants
for(i in contestants){
  # convert contestant first name to lower case 
  name <- tolower(word(i,1))
  # create counter for number of mentions
  count.rt <- 0
  # loop through all cleaned up, lower case tweets
  for (j in 1:length(tweets.text.clean.lc)){
    # extract current tweet
    tweet <- tweets.text.clean.lc[j]
    # if contestants' name occurs in current tweet
    if(grepl(name,tweet)){
      count.rt <- count.rt + 1 # counter++
    }
  }
  
[......truncated......]
  
  # store number of mentions in dataframe
  named <- rbind.data.frame(named,
                               cbind(Contestant = i,
                                     Total = count.rt,
                                     Original= count.org,
                                     BeforeEp = count.pre,
                                     DuringEp = count.during,
                                     AfterEp = count.post),
                               stringsAsFactors = F)
  
  print(paste(i,'... done!'))
  # continue to next contestant
}


#### LOOP :: OVERTIME ####
# create empty dataframe
named.overtime <- data.frame()
# loop through every day
for(Day in unique(as.Date(tweets.df$created))){
  # select only tweets of that day
  tweets <- tweets.text.clean.lc[as.Date(tweets.df$created) == Day]
  # print progress
  cat(as.Date(as.numeric(Day), origin = '1970-01-01'),':',sep = '')
  # loop through contestants
  for(i in contestants){
    # extract first name in lower case
    name <- tolower(word(i,1))
    # set counter at zero
    Count <- 0
    # loop through every single tweet
    for(j in 1:length(tweets)){
      # extract tweet
      tweet <- tweets[j]
      if(grepl(name,tweet)){
        Count <- Count + 1
      }
    }
    # add to data frame
    named.overtime <- rbind.data.frame(
      named.overtime,
      cbind(Day,Contestant = word(i,1),Count),
      stringsAsFactors = F
    )
    # next contestant 
    # print progress
    cat(word(i,1),' ')
  }
  cat('\n')
}


#### LOOP :: DURING EPISODE ####
# create empty dataframe
named.during <- data.frame()
# loop through every minute of the episode
for(Minute in unique(minutes.during)){
  # extract the tweets during that minute
  temp <- tweets.text.clean.lc.during[minutes.during == Minute]
  # loop through every contestant
  for(i in contestants){
    # save the lowercase name
    name = tolower(word(i,1))
    # create counter
    Count <- 0
    # loop through every tweet of that minute
    for(tweet in temp){
      # if contestant is mentioned, add one to counter
      if(grepl(name,tweet)){
        Count <- Count + 1
      }
    }
    # store all data in date frame
    named.during <- rbind.data.frame(named.during,
                                       cbind(Minute,Contestant = word(i,1),Count),
                                       stringsAsFactors = F)
  }
}

After some final transformations, I have the following tables nicely stored two data frames.

> named
                Contestant Total Original BeforeEp DuringEp AfterEp
1           Diederik Jekel   358      201       16      128     214
2         Imanuelle Grives    96       65        5       43      48
3  Jeroen Kijk in de Vegte   517      335       12      218     287
4        Jochem van Gelder   157      124       15       73      69
5           Roos Schlikker   203      154        7       88     108
6    Sanne Wallis de Vries   255      194        3      106     146
7         Sigrid Ten Napel   135      102        7       65      63
8          Thomas Cammaert    97       69        5       45      47
9           Vincent Vianen   406      354       19      285     102
10      Yvonne Coldeweijer   148      109       16       66      66

> named.overtime
          Day Contestant Count Cumulative
       <date>      <chr> <int>      <int>
1  2016-12-31   Diederik     0          0
2  2016-12-31  Imanuelle     0          0
3  2016-12-31     Jeroen     0          0
4  2016-12-31     Jochem     0          0
5  2016-12-31       Roos     1          1
6  2016-12-31      Sanne     0          0
7  2016-12-31     Sigrid     0          0
8  2016-12-31     Thomas     0          0
9  2016-12-31    Vincent     0          0
10 2016-12-31     Yvonne     0          0
# ... with 90 more rows

> named.during
  Minute Contestant Count Cumulative
   <chr>      <chr> <int>      <int>
1  19:25   Diederik     0          0
2  19:25  Imanuelle     0          0
3  19:25     Jeroen     0          0
4  19:25     Jochem     0          0
5  19:25       Roos     0          0
6  19:25      Sanne     0          0

In order to summarize the frequency table above in a straightforward visual, I wrote the following custom function to automate the generation of barplots for each of the subsets I created earlier.

# assign fixed value to y axis limits to simplify comparison
y.max <- ceiling(max(named$Total)/100)*100 

# custom ggplot function for sideways barplot
GeomBarFlipped <- function(data, x, y, y.max = y.max,
x.lab = 'Contestant', y.lab = '# Mentioned', title.lab){
  ggplot(data) + 
    geom_bar(aes(x = reorder(x,y), y = y), stat = 'identity') + 
    geom_text(aes(x = reorder(x,y), y = y, label = y), 
              col = 'white', hjust = 1.3) + 
    labs(x = x.lab, y = y.lab, title = title.lab) + 
    lims(y = c(0,y.max)) + theme_bw() + coord_flip() 
}

For further details surrounding the analyses, please feel free to reach out.

“Wie is de Mol?” volgens Twitter: Deel 1 (s17e1)

Dit is een re-post van mijn Linked-In artikel van 10 januari 2017.
Deel 2 vind je hier.

TL;DR // Samenvatting

Om te achterhalen in welke mate kandidaten verdacht worden door het Nederlandse publiek heb ik 10,000+ #WIDM tweets gedownload en geanalyseerd. Hoewel niets is wat het lijkt, komt uit de tweets duidelijk naar voren dat bepaalde kandidaten zich in de ogen van Twitterend Nederland verdachter gedragen dan anderen. Het doel van deze blog(s) is tweedelig. Enerzijds hoop ik de lezer een voorbeeld te geven van hoe Twitter gegevens kunnen worden gebruikt en geanalyseerd. Anderzijds hoop ik te kunnen profiteren van de zogenoemde wisdom-of-the-crowd en op den duur te achterhalen wie er dit jaar aan het mollen is. Een voorproefje:

Introductie

“Wie is de Mol?“, of WIDM in het kort, behoeft voor Nederlands publiek eigenlijk geen introductie. De spelshow loopt inmiddels 17 jaar en heeft een fanatieke achterban. Deze zelf-benoemde ‘molloten‘ houden ieder frame van iedere aflevering nauw in de gaten, achterhalen de meeste bizarre patronen en verzinnen de wildste theoriën. Afgelopen jaren zijn er zelfs uitgebreide analyses uitgevoerd op de kandidaten hun persoonlijke social media feeds om te achterhalen wie er mogelijk vervroegd terug in Nederland was. Tevens zijn er meerdere online fora, worden verdenkingen regelmatig gepollt, hebben twee NRC-redacteurs een wekelijkse bespreking en is er zelfs een mobile applicatiezodat je jouw vrienden kunt uitdagen.

‘Waarom dan deze blog?’ vraag je je misschien af. Wel, ik ben al enkele jaren een trouwe volger van de serie, maar mijn eigen verdenkingen zijn vaak verre van goed. Met deze analyses hoop ik inzicht te krijgen welke kandidaten het meest zijn opgevallen bij de kijker thuis. Tevens is programmeren mijn werk en hobby en was dit een goed excuus om een oud R-script weer eens af te stoffen. Deze Nederlandse versie van de blog bevat weinig details over de achterliggende code. Mocht je meer willen weten of het R script willen zien, lees dan vooral de Engelse versie of stuur een berichtje.

Aflevering 1: “… ZO GEDAAN”

De tweets downloaden

Dit is zeker niet de eerste blog over Twitter data analyse in R. Andere blogs laten zien welke stappen je moet volgen om gegevens op een gestructureerde manier van Twitter te downloaden. Na deze stappen zelf te hebben uitgevoerd, heb ik een snelle zoektocht gedaan door de Twitter feeds over WIDM. De hashtag #WIDM werd in de meeste berichten gebruikt en leek dus een goed begin. Maandag 9 januari 2017 om 18:03 heb ik voor het laatst de data van Twitter gedownload. Op dat moment waren er iets minder dan twee dagen verstreken sinds aflevering 1 was uitgezonden. De 10.503 #WIDM tweets waren binnen 3 minuten gedownload waarna ik ze heb opgeschoond met een aantal regular expressions (hierover meer in de Engelse versie).

Analyse en resultaten

Met de data nu onder de knoppen kon het analyseren beginnen

Tijd van posten

Allereerst leek het van belang om te kijken wanneer de #WIDM tweets waren gepost. Aflvering 1 is duidelijk terug te zien in de onderstaande visualisatie, waar een grote piek zich precies bevindt rond de tijd van de uitzending afgelopen zaterdag. Helaas staat Twitter slechts downloads toe van tweets gedurende de afgelopen 9 dagen, dus in 2016 kon ik helaas niet veel zien.

Hashtags

Daarnaast leek het verstandig om na te gaan of ik andere gangbare hashtags over het hoofd had gezien. Zo ja, dan zou ik in vervolg analyses ook de data van andere hashtags kunnen downloaden .

# hashtags frequency
hashtags <- table(tolower(unlist(str_extract_all(tweets.df$text,'#\\S+\\b'))))
head(sort(hashtags,T),20)

           #widm         #moltalk        #widmtips      #wieisdemol        #widm2017 
           10272             1722             1248              360               91 
#etherdiscipline             #app            #npo1             #mol          #widm17 
              56               55               50               47               45 
          #promo             #dtv         #vincent          #oregon           #zinin 
              30               27               27               24               23 
           #2017     #chriszegers        #portland     #tunnelvisie       #ellielust 
              21               20               20               19               18

Naast de door mij gebruikte hashtag (#WIDM) werden #moltaks, #widmtips en #wieisdemol ook veelvuldig gebruikt. Ook Ellie Lust en Chris Zegers zijn nog steeds populair zo te zien.

Kandidaat vermeldingen

Het uiteindelijke doel dat ik voor ogen had was om een soort van thermometer of meetinstrument te programmeren waarmee ik in een oogopslag kon zien wie van de kandidaten het meest verdacht werd door Twitterend Nederland. Sommige tweets in de dataset bevatten inderdaad verdenkingen van bepaalde kandidaten en andere waren kort door de bocht mollen aan het benoemen. (Wat doet Jan Dino daar?)

head(tweets.text.clean[grepl('ik verdenk [A-Z]',tweets.text.clean)])
[1] " widm ik verdenk Sigrid omdat bij de executie  haar reactie erg geacteerd leek"
[2] "Oké  ik verdenk Jochem heel erg  widm" 

head(tweets.text.clean[grepl('[A-Z][a-z]+ is de mol',tweets.text.clean)],4)
[1] "Jeroen is de mol  widm"                                       
[2] "  Ik weet het zeker    Jandino is de mol  amoz  widm  moltalk"
[3] "Ik weet het zeker    Jandino is de mol  amoz  widm  moltalk"  
[4] "Net  widm teruggekeken  Ik zeg Sigrid is de mol

Echter, er zijn veel verschillende manieren om met woorden te zeggen in hoeverre je iemand verdenkt. Daarnaast kunnen zinnen ontkenningen of zelf dubbele ontkenningen bevatten. Hoewel dit alles te programmeren valt, besloot ik om een simpelere route te bewandelen. Mijn theorie is dat kandidaten die zich meer verdacht gedragen tijdens de uitzendingen en kandidaten waarnaar meer hints verwijzen automatisch meer besproken worden op het internet. Indien deze theorie klopt, dan zou een relatief simpele telling van het aantal keer dat kandidaten worden genoemd in tweets voldoende zijn.Deze theorie is kort door de bocht, en ik ben zeker van plan om uitgebreidere analyses te draaien, maar voor nu heiligt het doel de middelen.

Zodoende heb ik mijn laptop met verschillende for-loops en if-statements opgedragen om ieder van de 10.000+ tweets te bekijken en te tellen hoe vaak ieder van de kandidaten in deze tweets werd genoemd. Handmatig zou dit dagen duren, maar de laptop bliepte triomfantelijk na enkele seconden. Zo beschikte ik over onder andere de volgende twee datasets:

> named
                Contestant Total Original BeforeEp DuringEp AfterEp
1           Diederik Jekel   358      201       16      128     214
2         Imanuelle Grives    96       65        5       43      48
3  Jeroen Kijk in de Vegte   517      335       12      218     287
4        Jochem van Gelder   157      124       15       73      69
5           Roos Schlikker   203      154        7       88     108
6    Sanne Wallis de Vries   255      194        3      106     146
7         Sigrid Ten Napel   135      102        7       65      63
8          Thomas Cammaert    97       69        5       45      47
9           Vincent Vianen   406      354       19      285     102
10      Yvonne Coldeweijer   148      109       16       66      66

> named.overtime
          Day Contestant Count Cumulative
       <date>      <chr> <chr>      <dbl>
1  2016-12-31   Diederik     0          0
2  2016-12-31  Imanuelle     0          0
3  2016-12-31     Jeroen     0          0
4  2016-12-31     Jochem     0          0
5  2016-12-31       Roos     1          1
6  2016-12-31      Sanne     0          0
7  2016-12-31     Sigrid     0          0
8  2016-12-31     Thomas     0          0
9  2016-12-31    Vincent     0          0
10 2016-12-31     Yvonne     0          0
# ... with 90 more rows

Alle tweets samengevoegd werd kandidaat Jeroen Kijk in de Vegte het meeste genoemd gedurende 9 dagen tweet-historie. Vincent Vianen werd echter vaker genoemd als men alleen de orginele tweets zou meerekenen.

Als we uitsplitsen naar het moment dat de tweets werden gepost wordt al snel zichtbaar dat Vincent vooral werd genoemd tijdens de aflevering.

Voor diegene die WIDM niet volgen: Vincent viel af deze eerste aflevering. Dit zou kunnen verklaren waarom hij zo veel Twitter-aandacht heeft gekregen gedurende de aflevering. Ook lijkt de WIDM productie de laatste jaren extra veel zendtijd te besteden aan de kandidaat die af gaat vallen. Als om de kijker op de verkeerde voet te zetten worden alle mogelijke molacties en rare opmerkingen van de toekomstige afvaller benadrukt tijdens de aflevering. Lang verhaal kort, wellicht is het interessant om de tweets gedurende de aflevering per minuut te volgen:

Het lijkt er op dat Vincent niet meer of minder werd besproken dan de andere kandidaten tot aan het laatste kwartier van de uitzending. Dit duidt erop dat hij wellicht niet zozeer als verdacht werd gezien door twitteraars, maar dat het weggeven van zijn vrijstelling in het laatste half uur (die toch al zou worden afgepakt) en zijn aankomende vertrek uit de serie, de tweets hebben veroorzaakt. Ook Roos lijkt een eindspurt te hebben genomen in het laatste kwartier van de uitzending, en Sanne pakt nog een snelle boost in de laatste paar minuten.

Tijdens de uitzending werd er stevig over Jeroen getweet, en dit zette zich door na de aflevering waar zijn naam wederom het meeste werd genoemd. Wellicht heb ik een verdachte handeling gemist die Twitterend Nederland wel is opgevallen? De wilde theorie over zijn verstandskiezen heb ik in ieder geval zeker gemist. Naast Jeroen werden Diederik en Sanne ook veelvuldig besproken na het slot van de aflevering. Thomas en Imanuelle, daarentegen, kregen erg weinig aandacht in het algemeen. Alle tweets bij elkaar opgeteld komen ze maar net aan de 100 vermeldingen de helft waarvan na de aflevering.

Vergeleken met een van de populaire WIDM polls, doen de resultaten van onze Twitter analyse het redelijk goed. De vier meest verdachte kandidaten volgens de poll vallen mooi samen met de vermeldingen op Twitter (nadat bekend was dat Vincent afviel). Het grootste verschil is dat Sanne Wallis de Vriesde nummer een verdachte is in de pol maar bij onze resultaten op de derde plek uitkomt.

Laten we de eerdere staafdiagrammen nog eens bekijken maar nu in een grafiek waarin we de individuele kandidaten volgen over de loop van de tijd. Onderstaande grafiek geeft de opgetelde vermeldingen voor, tijdens en na de eerste aflevering weer. Vincent heeft een stippellijn gekregen omdat hij is afgevallen deze aflevering. Blijkbaar was het publiek ook meteen een deel van haar interesse in hem kwijt want zijn vermeldingen kelderen sterk meteen na afloop van de aflevering. Jeroen is de sterkste stijger, met Didierik als achtervolger. Roos en Sanne worden ook steeds regelmatiger genoemd, maar toch een stuk minder dan hun voorgangers. De rest van de groep lijkt nauwelijks te worden opgemerkt door de twitteraars.

Mocht deze blog een vervolg krijgen, dan denk ik dat de focus ligt op het volgen van kandidaat populariteit over een langere periode. Een beginnetje hier van kun je hieronder vinden in de laatste twee grafieken. Mocht ik de tijd vrij kunnen maken, dan hoop ik een dagelijkse tracker te kunnen maken, wellicht in Shiny zodat lezers zelf interactief met de achterliggende data kunnen spelen. Anderzijds zou het interessant zijn om diepgaandere text en sentiment analyses uit te voeren, bijvoorbeeld door te kijken naar welke termen worden gebruikt om kandidaten te omschrijven, welke gevoelens en emoties verstopt gaan in de tweets, of wat mensen ertoe zet een bericht te retweeten. Daarnaast zou het inzichtelijk kunnen zijn om netwerkanalyses uit te voeren, bijvoorbeeld om te achterhalen of er subgroepen bestaan onder de twitterende ‘Molloten‘.

Ik hoop dat jij net zo genoten hebt van deze blog als ik! Schroom niet om de inhoud te delen of anderwijs te gebruiken. Laat ook vooral een reactie achter onder dit bericht of stuur een persoonlijk berichtje. Wellicht kun jij als lezer bedenken wat voor informatie we nog meer uit de data kunenn trekken, of hoe we de gegevens wellicht inzichtelijker kunnen weer-/vormgeven. Ik ben in ieder geval benieuwd naar jullie reacties!

Link naar de Engels versie van deze blog (inclusief code)

Link naar deel 2 (NL)

Raet HR Benchmark

[DUTCH:] In hun benchmark rapport zet Raet haar onderzoek naar HR analytics, uitgevoerd onder HR Business Partners, uiteen. Het rapport omvat onder andere de belangrijkste ontwikkelingen op het gebied van HR analytics, de toegeschreven rol van analytics op het gebied van HR verandermanagement, en een aantal praktijkcases bij o.a. Ahold.

De overzichtspagina is al visueel aantrekkelijk, maar aan de onderzijde kunt u zich opgeven om het rapport toegezonden te krijgen.

The below reiterates and summarizes this Stat article.

Share this:

TL;DR // Samenvatting

Share this:

TL;DR // Summary

Introduction

Analysis & Results

Time of creation

Hashtags

Contestant frequencies

Appendix: R setup

Retrieving the contestants

Retrieving the tweets

Contestant mentions

Share this:

TL;DR // Samenvatting

Introductie

Aflevering 1: “… ZO GEDAAN”

De tweets downloaden

Analyse en resultaten

Tijd van posten

Hashtags

Kandidaat vermeldingen

Share this:

Share this: