Category: report

Reviewing year 4 of paulvanderlaken.com

Reviewing year 4 of paulvanderlaken.com

Despite the pandemic, 2020 has been a great year for me.

Professionally, I grew into my role as data science product owner. And next to this, I got more and more freelance side gigs. Mostly teaching, but also some consultancy projects. Unfortunately, all my start-up ideas failed miserably again this year, yet I’ll keep trying : )

Personally, 2020 was also generous to us. We have a family expansion coming in 2021! (Un)Fortunately, the whole quarantaine situation provided a lot of time to make our house baby-ready!

A year in numbers

2020 was also a great year for our blog.

Here are some statistics. We reached 300 followers, on the last day of the year! Who could have imagined that?!

Statistic20192020delta
Views107.828150.59940%
Visitors70.870100.53942%
Followers15930089%
Posts9672-25%
Comments405948%
per post0,420,8297%
Likes11686-26%
per post1,211,19-1%

This tremendous growth of the website is despite me posting a lot less frequently this year.

After a friend’s advice, I started posting less, but more regularly.

Can you spot the pattern in my 2020 posting behavior?

Compare that to my erratic 2019 posting:

Now my readers have got something to look forward to every Tuesday!

Yet, is Tuesday really the best day for me to post my stuff?

You seem to prefer visiting my blog on Wednesdays.

Let me know what you think in the comments!

I am looking forward to what 2021 has in store for my blogging. I guess a baby will result in even less posts… But we’ll just focus on quality over quantity!

I hope I can keep up with the exponential growth:

Best new articles in 2020

There are many ways in which you could define the quality of an article.

For me, the most obvious would be to look at some view-based metric. Something like the number of views, or the number of unique visitors.

Yet, some articles have been online longer than others. So maybe we should focus on the average views per day. Still these you can expect to be increase as articles have been in existance longer.

In my opinion, how an article attract viewers over time tells an interesting story. For instance, how stable are the daily viewer numbers? Are they rising? This is often indicative that external websites link to my article. Which implies it holds valuable information to a specific readership. In turn, this suggests that the article is likely to continue attracting viewers in the future.

Here is an abstract visualization. Every line represents and article. Every line/article starts in the lower left corner. On the x-axis you see the number of days since posting. So lines slowly move right, the longer they have been on my website. On the y-axis you see the total viewers it attracted.

You can see three types of blog articles: (1) articles that attract 90% of their views within the first month, (2) articles that generate a steady flow of visitors, (3) articles that never attract (m)any readers.

Here’s a different way of visualizing those same articles: by their average daily visitors (x) and the standard deviation in daily visitors (y).

Basically, I hope to write articles that get many daily visitors (high x). Yet, I also hope that my articles have either have stable (or preferably increasing) visitor numbers. This would mean that they either score low on y, or that y increases over time.

By these measures, my best articles of 2020 are, in my opinion:

  1. Bayesian statistics using R, Python, & Stan
  2. Automatically create perfect .gitignore file
  3. Create a publication-ready correlation matrix
  4. Simulating and visualizing the Monthy Hall problem in R & Python
  5. How most statistical tests are linear

Best all time reads

For the first time, my blog roll & archives were the most visited page of my website this year! A whopping 13k views!!

With regard to the most visited pages of this year, not much has changed since 2019. We see some golden oldies and I once again conclude that my viewership remains mostly R-based:

  1. R resources
  2. New to R?
  3. R tips and tricks
  4. The house always wins
  5. Simple correlation analysis in R
  6. Visualization innovations
  7. Beating battleships with algorithms and AI
  8. Regular expressions in R
  9. Learn project-based programming
  10. Simpson’s paradox

Which articles haven’t you read?

Did you know you can search for keywords or tags using the main page?

Implementations of Trustworthy and Ethical AI (Report)

Implementations of Trustworthy and Ethical AI (Report)

Want to consider artificial intelligence applications and implementations from an ethical standpoint? Here’s a high-level conceptual view you might like:

Kolja Verhage wrote a report The Implementation of Trustworthy/Ethical AI in the US and Canada in cooperation with the Netherlands Innovation Attaché Network. Based on numerous interviews with AI ethics experts, Kolja presents an overview of approaches and models on how to implement ethical AI.

For over 30 years there has been academic research on ethics and technology. Over the past five years, however, we’ve seen an acceleration in the impact of algorithms on society. This has led both companies
and governments across the world to think about how to govern these algorithms and control their impact on society. The first step of this has been for companies and governments to present abstract high-level principles of what they consider “Ethical AI”.

Kolja Verhage

You can access the report here.

nlintheusa.com/ethical-ai/
Create a publication-ready correlation matrix, with significance levels, in R

Create a publication-ready correlation matrix, with significance levels, in R

In most (observational) research papers you read, you will probably run into a correlation matrix. Often it looks something like this:

FACTOR ANALYSIS

In Social Sciences, like Psychology, researchers like to denote the statistical significance levels of the correlation coefficients, often using asterisks (i.e., *). Then the table will look more like this:

Table 4 from Family moderators of relation between community ...

Regardless of my personal preferences and opinions, I had to make many of these tables for the scientific (non-)publications of my Ph.D..

I remember that, when I first started using R, I found it quite difficult to generate these correlation matrices automatically.

Yes, there is the cor function, but it does not include significance levels.

Then there the (in)famous Hmisc package, with its rcorr function. But this tool provides a whole new range of issues.

What’s this storage.mode, and what are we trying to coerce again?

Soon you figure out that Hmisc::rcorr only takes in matrices (thus with only numeric values). Hurray, now you can run a correlation analysis on your dataframe, you think…

Yet, the output is all but publication-ready!

You wanted one correlation matrix, but now you have two… Double the trouble?

To spare future scholars the struggle of the early day R programming, I would like to share my custom function correlation_matrix.

My correlation_matrix takes in a dataframe, selects only the numeric (and boolean/logical) columns, calculates the correlation coefficients and p-values, and outputs a fully formatted publication-ready correlation matrix!

You can specify many formatting options in correlation_matrix.

For instance, you can use only 2 decimals. You can focus on the lower triangle (as the lower and upper triangle values are identical). And you can drop the diagonal values:

Or maybe you are interested in a different type of correlation coefficients, and not so much in significance levels:

For other formatting options, do have a look at the source code below.

Now, to make matters even more easy, I wrote a second function (save_correlation_matrix) to directly save any created correlation matrices:

Once you open your new correlation matrix file in Excel, it is immediately ready to be copy-pasted into Word!

If you are looking for ways to visualize your correlations do have a look at the packages corrr and corrplot.

I hope my functions are of help to you!

Do reach out if you get to use them in any of your research papers!

I would be super interested and feel honored.

correlation_matrix

#' correlation_matrix
#' Creates a publication-ready / formatted correlation matrix, using `Hmisc::rcorr` in the backend.
#'
#' @param df dataframe; containing numeric and/or logical columns to calculate correlations for
#' @param type character; specifies the type of correlations to compute; gets passed to `Hmisc::rcorr`; options are `"pearson"` or `"spearman"`; defaults to `"pearson"`
#' @param digits integer/double; number of decimals to show in the correlation matrix; gets passed to `formatC`; defaults to `3`
#' @param decimal.mark character; which decimal.mark to use; gets passed to `formatC`; defaults to `.`
#' @param use character; which part of the correlation matrix to display; options are `"all"`, `"upper"`, `"lower"`; defaults to `"all"`
#' @param show_significance boolean; whether to add `*` to represent the significance levels for the correlations; defaults to `TRUE`
#' @param replace_diagonal boolean; whether to replace the correlations on the diagonal; defaults to `FALSE`
#' @param replacement character; what to replace the diagonal and/or upper/lower triangles with; defaults to `""` (empty string)
#'
#' @return a correlation matrix
#' @export
#'
#' @examples
#' `correlation_matrix(iris)`
#' `correlation_matrix(mtcars)`
correlation_matrix <- function(df, 
                               type = "pearson",
                               digits = 3, 
                               decimal.mark = ".",
                               use = "all", 
                               show_significance = TRUE, 
                               replace_diagonal = FALSE, 
                               replacement = ""){
  
  # check arguments
  stopifnot({
    is.numeric(digits)
    digits >= 0
    use %in% c("all", "upper", "lower")
    is.logical(replace_diagonal)
    is.logical(show_significance)
    is.character(replacement)
  })
  # we need the Hmisc package for this
  require(Hmisc)
  
  # retain only numeric and boolean columns
  isNumericOrBoolean = vapply(df, function(x) is.numeric(x) | is.logical(x), logical(1))
  if (sum(!isNumericOrBoolean) > 0) {
    cat('Dropping non-numeric/-boolean column(s):', paste(names(isNumericOrBoolean)[!isNumericOrBoolean], collapse = ', '), '\n\n')
  }
  df = df[isNumericOrBoolean]
  
  # transform input data frame to matrix
  x <- as.matrix(df)
  
  # run correlation analysis using Hmisc package
  correlation_matrix <- Hmisc::rcorr(x, type = type)
  R <- correlation_matrix$r # Matrix of correlation coeficients
  p <- correlation_matrix$P # Matrix of p-value 
  
  # transform correlations to specific character format
  Rformatted = formatC(R, format = 'f', digits = digits, decimal.mark = decimal.mark)
  
  # if there are any negative numbers, we want to put a space before the positives to align all
  if (sum(!is.na(R) & R < 0) > 0) {
    Rformatted = ifelse(!is.na(R) & R > 0, paste0(" ", Rformatted), Rformatted)
  }

  # add significance levels if desired
  if (show_significance) {
    # define notions for significance levels; spacing is important.
    stars <- ifelse(is.na(p), "", ifelse(p < .001, "***", ifelse(p < .01, "**", ifelse(p < .05, "*", ""))))
    Rformatted = paste0(Rformatted, stars)
  }
  
  # make all character strings equally long
  max_length = max(nchar(Rformatted))
  Rformatted = vapply(Rformatted, function(x) {
    current_length = nchar(x)
    difference = max_length - current_length
    return(paste0(x, paste(rep(" ", difference), collapse = ''), sep = ''))
  }, FUN.VALUE = character(1))
  
  # build a new matrix that includes the formatted correlations and their significance stars
  Rnew <- matrix(Rformatted, ncol = ncol(x))
  rownames(Rnew) <- colnames(Rnew) <- colnames(x)
  
  # replace undesired values
  if (use == 'upper') {
    Rnew[lower.tri(Rnew, diag = replace_diagonal)] <- replacement
  } else if (use == 'lower') {
    Rnew[upper.tri(Rnew, diag = replace_diagonal)] <- replacement
  } else if (replace_diagonal) {
    diag(Rnew) <- replacement
  }
  
  return(Rnew)
}

save_correlation_matrix

#' save_correlation_matrix
#' Creates and save to file a fully formatted correlation matrix, using `correlation_matrix` and `Hmisc::rcorr` in the backend
#' @param df dataframe; passed to `correlation_matrix`
#' @param filename either a character string naming a file or a connection open for writing. "" indicates output to the console; passed to `write.csv`
#' @param ... any other arguments passed to `correlation_matrix`
#'
#' @return NULL
#'
#' @examples
#' `save_correlation_matrix(df = iris, filename = 'iris-correlation-matrix.csv')`
#' `save_correlation_matrix(df = mtcars, filename = 'mtcars-correlation-matrix.csv', digits = 3, use = 'lower')`
save_correlation_matrix = function(df, filename, ...) {
  return(write.csv2(correlation_matrix(df, ...), file = filename))
}

Sign up to keep up to date on the latest R, Data Science & Tech content:

Top-19 articles of 2019

Top-19 articles of 2019

With only one day remaining in 2019, let’s review the year. 2019 was my third year of blogging and it went by even quicker than the previous two!

Personally, it has been a busy year for me: I started a new job, increased my speaking and teaching activities, bought and moved to my new house, and got married op top of that!

Fortunately, I also started working parttime. This way, I could still reserve some time for learning and sharing my learnings. And sharing I did:

I posted 95 blogs in 2019!
That means one new post every 4 days!

paulvanderlaken.com improved its online footprint as well. We received over 100k visitors in 2019! And many of you subscribed and sticked around. Our little community now includes 55 more members than it did last year! And that is not even including the followers to my new twitter bot Artificial Stupidity!

Thank you for your continued interest!

Join 329 other followers

Now, I am always curious as to what brings you to my website, so let’s have a look at some 2019 statistics (which I downloaded via my new Python scraper).

Most read articles

There is clearly a power distribution in the quantity with which you read my blogs.

Some blogs consistently attract dozens of visitors each day. Others have only handful of visitors over the course of a year.

These are the 19 articles which were most read in 2019. Hyperlinks are included below the bar chart. It’s a nice combination of R programming, machine learning, HR-related materials, and some entertainment (games & gambling) in between.

Which have and haven’t you read?

  1. R resources
  2. R tips and tricks
  3. New to R?
  4. Books for the modern, data-driven HR professional
  5. The house always wins
  6. Visualization innovations
  7. Simple correlation analysis in R
  8. Beating battleships with algorithms and AI
  9. Regular expressions in R
  10. Simpson’s paradox
  11. Visualizing the k-means clustering algorithm
  12. Survival of the best fit
  13. Datasets to practice and learn data science
  14. Identifying dirty twitter bots
  15. Game of Thrones map
  16. Screeps
  17. Northstar
  18. The difference between DS, ML, and AI visualized
  19. Light GBM vs. XGBoost

Rising stars

Half of these most read articles have actually been published in 2017 or ’18 already. However, of the 95 articles published in 2019, some also demonstrate promising visitor patterns:

The People Analytics books, Visual innovations, and AI Battleships are in the top 19, and several others made it too.

Some of these newer blogs haven’t had the time to mature and redeem their place yet though. Regardless, I have high hopes!

Particularly for Neural Synesthesia, which was easily one of my greatest WOW-moments for ML applications in 2019. It’s truly mesmerizing to see a GAN traverse its latent space.

Reading & posting patterns

I have been posting quite regularly throughout the year. Apart from a holiday to Thailand during the start of January, and the start of my new job in February.

While I write and post most of my blogs in the weekend, I guess I should consider postponing publishing. As you guys are mostly active during Tuesdays and Wedsnesdays!

Statistical summary of 2019

What better way to end 2019 than with a statistical summary?

I have posted more and shorter blogs, and you’ve rewarded me with visits and more likes (also per post). However, we need more discussion!

Statistic2018 2019 Δ
Views85614107388+25%
Unique visitors5759470615+23%
Posts6195+56%
Words / post518371-40%
Likes51111118%
Comments2416-33%
As of 29/12/2019

2020 Outlook

It took some time to get started, but halfway 2017 my blog started attracting an audience. People stayed on during 2018, and visitor number continued to increase through 2019.

With an ongoing expansion from R into Python, and an increased focus on sharing resources, applications, and novelties related to data visualization and machine learning, I have a lot more in store for 2020!

I hope you stick around for the ride!

Please like, subscribe, share, and comment, and we’ll make sure 2020 will be at least as interesting and full of (machine) learning as 2019 has been!

Join 329 other followers

Zeit’s interactive visualization of the 2019 European election results

Zeit’s interactive visualization of the 2019 European election results

Zeit — the German newspaper — analyzed recent election results in over 80,000 regions of Europe. They discovered many patterns – from the radical left to the extremist right. Moreover, they allow you to find patterns yourself, among others in your own region.

They published the summarized election results in this beautiful interactive map of Europe.

The map is beautifully color-coded for the dominant political view (Conservative, Green, Liberal, Socialist, Far left, or Far right) per region. Moreover, you can select these views and look for regions where they received respectively many votes. Like in the below, where I opted for the Liberal view, which finds strongest support in regions of the Netherlands, France, Czechia, Romania, Denmark, Estonia, and Finland.

For instance, the region of Tilburg in the Netherlands — where I live — voted mostly Liberal, as depicted by the yellow Netherlands. In contrast, in the German border regions conservative and socialist parties received most votes, whereas in the Belgian border regions uncategorizable parties received most votes.

Zeit discovered some cool patterns themselves as well, as discussed in the original article. These include:

  • Right-Wing Populists in Poland
  • North-South divides in Italy and Spain
  • Considerable support for regional parties in Catalonia, Belgium, Scotland and Italy
  • Dominant Green and Liberal views in the Netherlands, France, and Germany

Have a look yourself, it’s a great example of open access data-driven journalism!

Pimp my RMD: Tips for R Markdown – by Yan Holtz

Pimp my RMD: Tips for R Markdown – by Yan Holtz

R markdown creates interactive reports from R code, including interactive reports, documents, dashboards, presentations, and even books. Have a look at this awesome gallery of R markdown examples.

Yan Holtz recently created a neat little overview of handy R Markdown tips and tricks that improve the appearance of output documents. He dubbed this overview Pimp my RMD. Have a look, it’s worth it!

Via https://rmarkdown.rstudio.com/authoring_quick_tour.html