Harry Plotter: Network analysis of spell usage

Apparently, I was not the only geek who decided to celebrate the 20th anniversary of the Harry Potter saga with statistical analysis. Students Moritz Haine and Markus Dienstknecht of the Data Science for Decision Making Master at Maastricht University started their own celebratory project as part of a course Information Retrieval and Text Mining. Students in…

Regular Expressions in R – Part 1: Introduction and base R functions

The following is the first part of my introduction to regular expression (regex), in general, and the use of regex in R, in specific. It is loosely inspired on the swirl() tutorial by Jon Calder. I created it in R Markdown and uploaded it to RPubs, for an easier read. Regular expression A regular expression, regex or regexp…

Text Mining: Pythonic Heavy Metal

This blog summarized work that has been posted here, here, and here. Iain of degeneratestate.org wrote a three-piece series where he applied text mining to the lyrics of 222,623 songs from 7,364 heavy metal bands spread over 22,314 albums that he scraped from darklyrics.com. He applied a broad range of different analyses in Python, the code of which…

Analysis of Media Coverage on Refugees

Hannah Yan Han is doing #100dayprojects on data science and visual storytelling and I can only recommend that you take a look yourself. Below you find her R text analysis (#41) of UNHCR speeches and TV coverage on refugees. Unsurprisingly, nouns like asylum, repatriation, displacement, persecution, plight, and crisis appear significantly more often in UNHCR speeches on refugees than…

Visualizing #IRMA Tweets

Reddit user LucasCu90 used the R package twitteR to retrieve all tweets that were sent with #Irma and a Geocode of central Miami (25 mile radius) from Saturday September 9, to Sunday September 10, 2017 (the period of Irma’s approach and initial landfall on the Florida Keys and the mainland). From the 29,000 tweets he collected, Lucas then…

Quantifying Gastronomy

A statistical analysis of 4000 recipes and their ingredients: Quantifying Gastronomy  

Summarizing our Daily News: Clustering 100.000+ Articles in Python

Andrew Thompson was interested in what 10 topics a computer would identify in our daily news. He gathered over 140.000 new articles from the archives of 10 different sources, as you can see in the figure below. In Python, Andrew converted the text of all these articles into a manageable form (tf-idf document term matrix…

Harry Plotter: Part 2 – Hogwarts Houses and their Stereotypes

Two weeks ago, I started the Harry Plotter project to celebrate the 20th anniversary of the first Harry Potter book. I could not have imagined that the first blog would be so well received. It reached over 4000 views in a matter of days thanks to the lovely people in the data science and #rstats community that were kind enough to share it…

Scraping RStudio blogs to establish how “pleased” Hadley Wickham is.

This is reposted from DavisVaughan.com with minor modifications. Introduction A while back, I saw a conversation on twitter about how Hadley uses the word “pleased” very often when introducing a new blog post (I couldn’t seem to find this tweet anymore. Can anyone help?). Out of curiosity, and to flex my R web scraping muscles a bit,…