Tag: anniversary

Reviewing year 4 of paulvanderlaken.com

Reviewing year 4 of paulvanderlaken.com

Despite the pandemic, 2020 has been a great year for me.

Professionally, I grew into my role as data science product owner. And next to this, I got more and more freelance side gigs. Mostly teaching, but also some consultancy projects. Unfortunately, all my start-up ideas failed miserably again this year, yet I’ll keep trying : )

Personally, 2020 was also generous to us. We have a family expansion coming in 2021! (Un)Fortunately, the whole quarantaine situation provided a lot of time to make our house baby-ready!

A year in numbers

2020 was also a great year for our blog.

Here are some statistics. We reached 300 followers, on the last day of the year! Who could have imagined that?!

per post0,420,8297%
per post1,211,19-1%

This tremendous growth of the website is despite me posting a lot less frequently this year.

After a friend’s advice, I started posting less, but more regularly.

Can you spot the pattern in my 2020 posting behavior?

Compare that to my erratic 2019 posting:

Now my readers have got something to look forward to every Tuesday!

Yet, is Tuesday really the best day for me to post my stuff?

You seem to prefer visiting my blog on Wednesdays.

Let me know what you think in the comments!

I am looking forward to what 2021 has in store for my blogging. I guess a baby will result in even less posts… But we’ll just focus on quality over quantity!

I hope I can keep up with the exponential growth:

Best new articles in 2020

There are many ways in which you could define the quality of an article.

For me, the most obvious would be to look at some view-based metric. Something like the number of views, or the number of unique visitors.

Yet, some articles have been online longer than others. So maybe we should focus on the average views per day. Still these you can expect to be increase as articles have been in existance longer.

In my opinion, how an article attract viewers over time tells an interesting story. For instance, how stable are the daily viewer numbers? Are they rising? This is often indicative that external websites link to my article. Which implies it holds valuable information to a specific readership. In turn, this suggests that the article is likely to continue attracting viewers in the future.

Here is an abstract visualization. Every line represents and article. Every line/article starts in the lower left corner. On the x-axis you see the number of days since posting. So lines slowly move right, the longer they have been on my website. On the y-axis you see the total viewers it attracted.

You can see three types of blog articles: (1) articles that attract 90% of their views within the first month, (2) articles that generate a steady flow of visitors, (3) articles that never attract (m)any readers.

Here’s a different way of visualizing those same articles: by their average daily visitors (x) and the standard deviation in daily visitors (y).

Basically, I hope to write articles that get many daily visitors (high x). Yet, I also hope that my articles have either have stable (or preferably increasing) visitor numbers. This would mean that they either score low on y, or that y increases over time.

By these measures, my best articles of 2020 are, in my opinion:

  1. Bayesian statistics using R, Python, & Stan
  2. Automatically create perfect .gitignore file
  3. Create a publication-ready correlation matrix
  4. Simulating and visualizing the Monthy Hall problem in R & Python
  5. How most statistical tests are linear

Best all time reads

For the first time, my blog roll & archives were the most visited page of my website this year! A whopping 13k views!!

With regard to the most visited pages of this year, not much has changed since 2019. We see some golden oldies and I once again conclude that my viewership remains mostly R-based:

  1. R resources
  2. New to R?
  3. R tips and tricks
  4. The house always wins
  5. Simple correlation analysis in R
  6. Visualization innovations
  7. Beating battleships with algorithms and AI
  8. Regular expressions in R
  9. Learn project-based programming
  10. Simpson’s paradox

Which articles haven’t you read?

Did you know you can search for keywords or tags using the main page?

Two years of paulvanderlaken.com

Yesterday was the second anniversary of my website. I also reflected on this moment last year, and I thought to continue the tradition in 2019.

Let me start with a great, big
to all my readers for continuing to visit my website!

You are the reason I continue to write down what I read. And maybe even the reason I continued reading and learning last year, despite all other distractions [my “real” job and my PhD : )].

Also a big thank you to all my followers on Twitter and LinkedIn, and those who have taken the time to comment or like my blogs. All of you make that I gain energy from writing this blog!

With that said, let’s start the review of the past year on my blog.

Most popular blog posts of 2018

Most importantly, let’s examine what you guys liked. Which blogs attracted the most visitors? What did you guys read?

Unfortunately, WordPress does not allow you to scrape their statistics pages. However, I was able to download monthly data manually, which I could then visualize to show you some trends.

The visual below shows the cumulative amount of visitors attracted by each blog I’ve written in 2018. Here follow links to the top 8 blogs in terms of visitor numbers this year:

  1. “What’s the difference between data science, machine learning, and artificial intelligence?”, visualized. received 4355 visits. Following a viral blog by David Robinson, I try to demystify the popular terminology.
  2. The House Always Wins: Simulating 5,000,000 Games of Baccarat a.k.a. Punto Banco received 3079 views. After a visit to Holland Casino, I thought it’d be fun to approximate the odds of gambling through statistical simulation.
  3. Bayesian data analysis for newcomers received 2253 views. It contains the link to an open access paper explaining the basics of Bayesian analysis.
  4. Identifying “Dirty” Twitter Bots with R and Python received 2247 views. It tells the story of two programmers who uncover networks of filthy social media accounts.
  5. rstudio::conf 2018 summary received 1514 views. It provides links to the most salient talks and presentations of the yearly R gathering.
  6. R tips & tricks is relatively new and has only yet received 1212 views. Seperate from the R resources guide, this new list contains all the quick tricks that help you program more effectively in R.
  7. Super Resolution: A Photo Enhancer AI received 891 views and elaborates on the development of new tools that can upgrade photo and video data quality.
  8. ggstatsplot: Creating graphics including statistical details is also relatively new but already attracted 810 visitors. It explains the novel visualization package in R that allows you to quickly create elaborate statistical plots.

Biggest failures of 2018

Where there’s success, there’s failure. Some of my posts did not get a lot of attention by my readership. That’s unfortunate, as I really only take the time to blog about the stuff that I deem interesting enough. Were these failed blog posts just unlucky, or am I biased and were they simply really bad and uninteresting?

You be the judge! Here are some of the least read posts of 2018:

General statistics

Now, let’s move to some general statistics: in 2018, paulvanderlaken.com received 85.614 views, by 57.594 unique visitors. I posted 61 new blogs, consisting of a total of 31.598 words. Fifty-one visitors liked one of my posts, and 24 visitors took the time to post a comment of their own (my replies included, probably).

Compared to last year, my website did pretty well!

Unique visitors2694957594114%
Words / post625518-17%

However, the above statistics do not properly reflect the development of my website. For instance, I only really started generating traffic after my first viral post (i.e., Harry Plotter). The below graph takes that into account and better reflects the development of the traffic to my website.

The upward trend in traffic looks promising!

All time favorites

Looking back to the start of paulvanderlaken.com, let’s also examine which blogs have been performing well ever since their conception.

Clearly, most people have been coming for the R resources overview, as demonstrated by the visual below. Moreover, the majority of blog posts has not been visited much — only a handful ever cross the 1000 views mark.

The blogs that attracted a large public in 2017 (such as the original Harry Plotter and its sequal, and the Kaggle 2017 DS survey) have phased out a bit.

Fortunately, the introductory guide for newcomers to R is still kickstarting many programming careers! And on an additional positive note, more and more visitors seem to inspect the homepage and archives.

Redirected visitors

Finally , let’s have a closer look as to what brought people to my website.The below visualizes the main domains that redirected visitors.

Search engines provided the majority of traffic in both 2017 and 2018 –
mainly Google; to a lesser extent, DuckDuckGo and Bing (who in his right mind uses Norton Safe Search?!). My Twitter visitors increased in 2018 as compared to 2017, as did my traffic from this specific Quora page.

And that concludes my two year anniversary of paulvanderlaken.com review. I hope you enjoyed it, and that you will return to my website for the many more years to come : )

I end with a big shout out to my most loyal readers!
104 people have subscribed to my website (as of 2019-01-22)
and receive an update wherener I post a new blog.

Thank you for your continued support!

Want to join this group of elite followers?
Press the Follow button 
in the right toolbar, or at the bottom of this blog post.

One year of paulvanderlaken.com

One year ago, I registered the domain paulvanderlaken.com with three reasons in mind: (1) I wanted an online environment to store and showcase my pet projects, (2) to share and promote some of the great blogs and research others had been writing, and (3) to show others what I was doing on my path to “data science“. The year has been just amazing. I could not have imagined the amount of positive sentiment I received from friends, family, acquaintances, and old classmates. But, most of all, the nice reactions from complete strangers across the globe! Thank you all so much for the positive response.

To my surprise, some of my stuff actually got read!

Some random stats:

In one year, I wrote 103 blogs which got over 42,000 views by nearly 30,000 visitors. 97.5% of these views occurred in the last six months. Most referrals came via Google (45%), reddit (18%), LinkedIn (8%), Facebook (8%), and Twitter (4%), and my blogs were shared a total of 241 times. Now, 51 people follow my blog, which is best viewed on Tuesdays (31%) and around 15:00h CET (6%).

My views between January 2017 and 2018, made with ggplot2 in R.

Although my personal learning is still the main reason I maintain this blog, I am very glad people seem to enjoy tagging along. Hopefully, I can continue to discover and write about data (analysis) during the coming 12 months. For now, I’d want to thank my readers for their continued interest and, in particular, my girlfriend for coping with the numerous evenings and weekend I have wasted on my pet projects. Nonetheless, it was definitely worth the effort!

Hope to see you again soon,


Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R

Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R

It has been twenty years since the first Harry Potter novel, the sorcerer’s/philosopher’s stone, was published. To honour the series, I started a text analysis and visualization project, which my other-half wittily dubbed Harry Plotter. In several blogs, I intend to demonstrate how Hadley Wickham’s tidyverse and packages that build on its principles, such as tidytext (free book), have taken programming in R to an all-new level. Moreover, I just enjoy making pretty graphs : )

In this first blog (easier read), we will look at the sentiment throughout the books. In a second blog, we have examined the stereotypes behind the Hogwarts houses.


First, we need to set up our environment in RStudio. We will be needing several packages for our analyses. Most importantly, Bradley Boehmke was nice enough to gather all Harry Potter books in his harrypotter package on GitHub. We need devtools to install that package the first time, but from then on can load it in normally. Next, we load the tidytext package, which automates and tidies a lot of the text mining functionalities. We also need plyr for a specific function (ldply()). Other tidyverse packages we can load in a single bundle, including ggplot2dplyr, and tidyr, which I use in almost every of my projects. Finally, we load the wordcloud visualization package which draws on tm.

After loading these packages, I set some additional default options.

# library(devtools)
# devtools::install_github("bradleyboehmke/harrypotter")

options(stringsAsFactors = F, # do not convert upon loading
        scipen = 999, # do not convert numbers to e-values
        max.print = 200) # stop printing after 200 values

theme_set(theme_light()) # set default ggplot theme to light
fs = 12 # default plot font size

Data preparation

With RStudio set, its time to the text of each book from the harrypotter package which we then “pipe” (%>% – another magical function from the tidyverse – specifically magrittr) along to bind all objects into a single dataframe. Here, each row represents a book with the text for each chapter stored in a separate columns. We want tidy data, so we use tidyr’s gather() function to turn each column into grouped rows. With tidytext’s unnest_tokens() function we can separate the tokens (in this case, single words) from these chapters.

hp_words <- list(
 philosophers_stone = philosophers_stone,
 chamber_of_secrets = chamber_of_secrets,
 prisoner_of_azkaban = prisoner_of_azkaban,
 goblet_of_fire = goblet_of_fire,
 order_of_the_phoenix = order_of_the_phoenix,
 half_blood_prince = half_blood_prince,
 deathly_hallows = deathly_hallows
) %>%
 ldply(rbind) %>% # bind all chapter text to dataframe columns
 mutate(book = factor(seq_along(.id), labels = .id)) %>% # identify associated book
 select(-.id) %>% # remove ID column
 gather(key = 'chapter', value = 'text', -book) %>% # gather chapter columns to rows
 filter(!is.na(text)) %>% # delete the rows/chapters without text
 mutate(chapter = as.integer(chapter)) %>% # chapter id to numeric
 unnest_tokens(word, text, token = 'words') # tokenize data frame

Let’s inspect our current data format with head(), which prints the first rows (default n = 6).

hp_words %>% head()
##                   book chapter  word
## 1   philosophers_stone       1   the
## 1.1 philosophers_stone       1   boy
## 1.2 philosophers_stone       1   who
## 1.3 philosophers_stone       1 lived
## 1.4 philosophers_stone       1    mr
## 1.5 philosophers_stone       1   and

Word frequency

A next step would be to examine word frequencies.

hp_words %>%
  group_by(book, word) %>%
  anti_join(stop_words, by = "word") %>% # delete stopwords
  count() %>% # summarize count per word per book
  arrange(desc(n)) %>% # highest freq on top
  group_by(book) %>% # 
  mutate(top = seq_along(word)) %>% # identify rank within group
  filter(top <= 15) %>% # retain top 15 frequent words
  # create barplot
  ggplot(aes(x = -top, fill = book)) + 
  geom_bar(aes(y = n), stat = 'identity', col = 'black') +
  # make sure words are printed either in or next to bar
  geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
                label = word), size = fs/3, hjust = "left") +
  theme(legend.position = 'none', # get rid of legend
        text = element_text(size = fs), # determine fontsize
        axis.text.x = element_text(angle = 45, hjust = 1, size = fs/1.5), # rotate x text
        axis.ticks.y = element_blank(), # remove y ticks
        axis.text.y = element_blank()) + # remove y text
  labs(y = "Word count", x = "", # add labels
       title = "Harry Plotter: Most frequent words throughout the saga") +
  facet_grid(. ~ book) + # separate plot for each book
  coord_flip() # flip axes


Unsuprisingly, Harry is the most common word in every single book and Ron and Hermione are also present. Dumbledore’s role as an (irresponsible) mentor becomes greater as the storyline progresses. The plot also nicely depicts other key characters:

  • Lockhart and Dobby in book 2,
  • Lupin in book 3,
  • Moody and Crouch in book 4,
  • Umbridge in book 5,
  • Ginny in book 6,
  • and the final confrontation with He who must not be named in book 7.

Finally, why does J.K. seem obsessively writing about eyes that look at doors?

Estimating sentiment

Next, we turn to the sentiment of the text. tidytext includes three famous sentiment dictionaries:

  • AFINN: including bipolar sentiment scores ranging from -5 to 5
  • bing: including bipolar sentiment scores
  • nrc: including sentiment scores for many different emotions (e.g., anger, joy, and surprise)

The following script identifies all words that occur both in the books and the dictionaries and combines them into a long dataframe:

hp_senti <- bind_rows(
  # 1 AFINN 
  hp_words %>% 
    inner_join(get_sentiments("afinn"), by = "word") %>%
    filter(score != 0) %>% # delete neutral words
    mutate(sentiment = ifelse(score < 0, 'negative', 'positive')) %>% # identify sentiment
    mutate(score = sqrt(score ^ 2)) %>% # all scores to positive
    group_by(book, chapter, sentiment) %>% 
    mutate(dictionary = 'afinn'), # create dictionary identifier
  # 2 BING 
  hp_words %>% 
    inner_join(get_sentiments("bing"), by = "word") %>%
    group_by(book, chapter, sentiment) %>%
    mutate(dictionary = 'bing'), # create dictionary identifier
  # 3 NRC 
  hp_words %>% 
    inner_join(get_sentiments("nrc"), by = "word") %>%
    group_by(book, chapter, sentiment) %>%
    mutate(dictionary = 'nrc') # create dictionary identifier

hp_senti %>% head()
## # A tibble: 6 x 6
## # Groups:   book, chapter, sentiment [2]
##                 book chapter      word score sentiment dictionary
## 1 philosophers_stone       1     proud     2  positive      afinn
## 2 philosophers_stone       1 perfectly     3  positive      afinn
## 3 philosophers_stone       1     thank     2  positive      afinn
## 4 philosophers_stone       1   strange     1  negative      afinn
## 5 philosophers_stone       1  nonsense     2  negative      afinn
## 6 philosophers_stone       1       big     1  positive      afinn


Although wordclouds are not my favorite visualizations, they do allow for a quick display of frequencies among a large body of words.

hp_senti %>%
  group_by(word) %>%
  count() %>% # summarize count per word
  mutate(log_n = sqrt(n)) %>% # take root to decrease outlier impact
  with(wordcloud(word, log_n, max.words = 100))

download (1)

It appears we need to correct for some words that occur in the sentiment dictionaries but have a different meaning in J.K. Rowling’s books. Most importantly, we need to filter two character names.

hp_senti_sel <- hp_senti %>% filter(!word %in% c("harry","moody"))

Words per sentiment

Let’s quickly sketch the remaining words per sentiment.

hp_senti_sel %>% # NAMES EXCLUDED
  group_by(word, sentiment) %>%
  count() %>% # summarize count per word per sentiment
  group_by(sentiment) %>%
  arrange(sentiment, desc(n)) %>% # most frequent on top
  mutate(top = seq_along(word)) %>% # identify rank within group
  filter(top <= 15) %>% # keep top 15 frequent words
  ggplot(aes(x = -top, fill = factor(sentiment))) + 
  # create barplot
  geom_bar(aes(y = n), stat = 'identity', col = 'black') +
  # make sure words are printed either in or next to bar
  geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
                label = word), size = fs/3, hjust = "left") +
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs), # determine fontsize
        axis.text.x = element_text(angle = 45, hjust = 1), # rotate x text
        axis.ticks.y = element_blank(), # remove y ticks
        axis.text.y = element_blank()) + # remove y text
  labs(y = "Word count", x = "", # add manual labels
       title = "Harry Plotter: Words carrying sentiment as counted throughout the saga",
       subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
  facet_grid(. ~ sentiment) + # separate plot for each sentiment
  coord_flip() # flip axes

download (2).png

This seems ok. Let’s continue to plot the sentiment over time.

Positive and negative sentiment throughout the series

As positive and negative sentiment is included in each of the three dictionaries we can to compare and contrast scores.

plot_sentiment <- hp_senti_sel %>% # NAMES EXCLUDED
  group_by(dictionary, sentiment, book, chapter) %>%
  summarize(score = sum(score), # summarize AFINN scores
            count = n(), # summarize bing and nrc counts
            # move bing and nrc counts to score 
            score = ifelse(is.na(score), count, score))  %>%
  filter(sentiment %in% c('positive','negative')) %>%   # only retain bipolar sentiment
  mutate(score = ifelse(sentiment == 'negative', -score, score)) %>% # reverse negative values
  # create area plot
  ggplot(aes(x = chapter, y = score)) +    
  geom_area(aes(fill = score > 0),stat = 'identity') +
  scale_fill_manual(values = c('red','green')) + # change colors
  # add black smoothed line without standard error
  geom_smooth(method = "loess", se = F, col = "black") + 
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs)) + # change font size
  labs(x = "Chapter", y = "Sentiment score", # add labels
       title = "Harry Plotter: Sentiment during the saga",
       subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
     # separate plot per book and dictionary and free up x-axes
  facet_grid(dictionary ~ book, scale = "free_x")

download (3).png

Let’s zoom in on the smoothed average.

plot_sentiment + coord_cartesian(ylim = c(-100,50)) # zoom in plot

download (4).png

Sentiment seems overly negative throughout the series. Particularly salient is that every book ends on a down note, except the Prisoner of Azkaban. Moreover, sentiment becomes more volatile in books four through six. These start out negative, brighten up in the middle, just to end in misery again. In her final book, J.K. Rowling depicts a world about to be conquered by the Dark Lord and the average negative sentiment clearly resembles this grim outlook.

The bing sentiment dictionary estimates the most negative sentiment on average, but that might be due to this specific text.

Other emotions throughout the series

Finally, let’s look at the other emotions that are included in the nrc dictionary.

hp_senti_sel %>% # NAMES EXCLUDED 
  filter(!sentiment %in% c('negative','positive')) %>% # only retain other sentiments (nrc)
  group_by(sentiment, book, chapter) %>%
  count() %>% # summarize count
  # create area plot
  ggplot(aes(x = chapter, y = n)) +
  geom_area(aes(fill = sentiment), stat = 'identity') + 
  # add black smoothing line without standard error
  geom_smooth(aes(fill = sentiment), method = "loess", se = F, col = 'black') + 
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs)) + # change font size
  labs(x = "Chapter", y = "Emotion score", # add labels
       title = "Harry Plotter: Emotions during the saga",
       subtitle = "Using tidytext and the nrc sentiment dictionary") +
  # separate plots per sentiment and book and free up x-axes
  facet_grid(sentiment ~ book, scale = "free_x") 

download (5).png

This plot is less insightful as either the eight emotions are represented by similar words or J.K. Rowling combines all in her writing simultaneously. Patterns across emotions are highly similar, evidenced especially by the patterns in the Chamber of Secrets. In a next post, I will examine sentiment in a more detailed fashion, testing the differences over time and between characters statistically. For now, I hope you enjoyed these visualizations. Feel free to come back or subscribe to read my subsequent analyses.

The second blog in the Harry Plotter series examines the stereotypes behind the Hogwarts houses.