Chatterplots

I’ve mentioned before that I dislike wordclouds (for instance here, or here) and apparently others share that sentiment. In his recent Medium blog, Daniel McNichol goes as far as to refer to the wordcloud as the pie chart of text data! Among others, Daniel calls wordclouds disorienting, one-dimensional, arbitrary and opaque and he mentions their lack of order,…

Become a data-driven Sommelier by text mining wine reviews

Aleszu Bajak at Storybench.org published a great demonstration of the power of text mining. He used the R tidytext package to analyse 150,000 wine reviews which Zach Thoutt had scraped from Wine Enthusiast in November of 2017. Aleszu started his analysis on only the French wines, with a simple word count per region: Next, he applied TF-IDF to surface the…

Identifying “Dirty” Twitter Bots with R and Python

Past week, I came across two programming initiatives to uncover Twitter bots and one attempt to identify fake Instagram accounts. Mike Kearney developed the R package botornot which applies machine learning to estimate the probability that a Twitter user is a bot. His default model is a gradient boosted model trained using both users-level (bio, location, number of…

Harry Plotter: Network analysis of spell usage

Apparently, I was not the only geek who decided to celebrate the 20th anniversary of the Harry Potter saga with statistical analysis. Students Moritz Haine and Markus Dienstknecht of the Data Science for Decision Making Master at Maastricht University started their own celebratory project as part of a course Information Retrieval and Text Mining. Students in…

Harry Plotter: Shiny App of Spell Usage

In my second Harry Plotter blog (22-Aug-2017), I wrote: I would like to demonstrate how regular expressions can be used to retrieve (sub)strings that follow a specific format. We could use regex to examine, for instance, when, and by whom, which magical spells are cast. Well, Prusinowskik (real name unknown) beat me to it, and how! S/He formed a comprehensive…

Sentiment Analysis of Stranger Things Seasons 1 and 2

Jordan Dworkin, a Biostatistics PhD student at the University of Pennsylvania, is one of the few million fans of Stranger Things, a 80s-themed Netflix series combining drama, fantasy, mystery, and horror. Awaiting the third season, Jordan was curious as to the emotional voyage viewers went through during the series, and he decided to examine this…

Talent Works: Data Science to improve Job Application Chances

Searching and applying to jobs can be a costly activity, requiring many hours upon hours of perfecting your motivation letter and CV. Hence, it can be very frustrating to get ghosted (not receiving a reply) for a job. Luckily, Talent Works is able to give us some general tips when it comes to improving the…

Regular Expressions in R – Part 1: Introduction and base R functions

The following is the first part of my introduction to regular expression (regex), in general, and the use of regex in R, in specific. It is loosely inspired on the swirl() tutorial by Jon Calder. I created it in R Markdown and uploaded it to RPubs, for an easier read. Regular expression A regular expression, regex or regexp…

Text Mining: Pythonic Heavy Metal

This blog summarized work that has been posted here, here, and here. Iain of degeneratestate.org wrote a three-piece series where he applied text mining to the lyrics of 222,623 songs from 7,364 heavy metal bands spread over 22,314 albums that he scraped from darklyrics.com. He applied a broad range of different analyses in Python, the code of which…