Tag: r

David Robinson’s R Programming Screencasts

David Robinson’s R Programming Screencasts

David Robinson (aka drob) is one of the best known R programmers.

Since a couple of years David has been sharing his knowledge through streaming screencasts of him programming. It’s basically part of R’s #tidytuesday movement.

Alex Cookson decided to do us all a favor and annotate all these screencasts into a nice overview.


Here you can search for video material of David using a specific function or method. There are already over a thousand linked fragments!

Very useful if you want to learn how to visualize data using ggplot2 or plotly, how to work with factors in forcats, or how to tidy data using tidyr and dplyr.

For instance, you could search for specific R functions and packages you want to learn about:

Thanks David for sharing your knowledge, and thanks Alex for maintaining this overview!

Visualizing and interpreting Cohen’s d effect sizes

Visualizing and interpreting Cohen’s d effect sizes

Cohen’s d (wiki) is a statistic used to indicate the standardised difference between two means. Resarchers often use it to compare the averages between groups, for instance to determine that there are higher outcomes values in a experimental group than in a control group.

Researchers often use general guidelines to determine the size of an effect. Looking at Cohen’s d, psychologists often consider effects to be small when Cohen’s d is between 0.2 or 0.3, medium effects (whatever that may mean) are assumed for values around 0.5, and values of Cohen’s d larger than 0.8 would depict large effects (e.g., University of Bath).

The two groups’ distributions belonging to small, medium, and large effects visualized

Kristoffer Magnusson hosts this Cohen’s d effect size comparison tool on his website the R Psychologist, but recently updated the visualization and its interactivity. And the tool looks better than ever:

Moreover, Kristoffer adds some nice explanatons of the numbers and their interpretation in real life situations:

If you find the tool useful, please consider buying Kristoffer a coffee or buying one of his beautiful posters, like the one above, or below:

Frequentisme betekenis testen poster horizontaal image 0

By the way, Kristoffer hosts many other interesting visualization tools (most made with JavaScript’s D3 library) on statistics and statistical phenomena on his website, have a look!

How to Write a Git Commit Message, in 7 Steps

How to Write a Git Commit Message, in 7 Steps

Version control is an essential tool for any software developer. Hence, any respectable data scientist has to make sure his/her analysis programs and machine learning pipelines are reproducible and maintainable through version control.

Often, we use git for version control. If you don’t know what git is yet, I advise you begin here. If you work in R, start here and here. If you work in Python, start here.

This blog is intended for those already familiar working with git, but who want to learn how to write better, more informative git commit messages. Actually, this blog is just a summary fragment of this original blog by Chris Beams, which I thought deserved a wider audience.

Chris’ 7 rules of great Git commit messaging

  1. Separate subject from body with a blank line
  2. Limit the subject line to 50 characters
  3. Capitalize the subject line
  4. Do not end the subject line with a period
  5. Use the imperative mood in the subject line
  6. Wrap the body at 72 characters
  7. Use the body to explain what and why vs. how

For example:

Summarize changes in around 50 characters or less

More detailed explanatory text, if necessary. Wrap it to about 72
characters or so. In some contexts, the first line is treated as the
subject of the commit and the rest of the text as the body. The
blank line separating the summary from the body is critical (unless
you omit the body entirely); various tools like `log`, `shortlog`
and `rebase` can get confused if you run the two together.

Explain the problem that this commit is solving. Focus on why you
are making this change as opposed to how (the code explains that).
Are there side effects or other unintuitive consequences of this
change? Here's the place to explain them.

Further paragraphs come after blank lines.

 - Bullet points are okay, too

 - Typically a hyphen or asterisk is used for the bullet, preceded
   by a single space, with blank lines in between, but conventions
   vary here

If you use an issue tracker, put references to them at the bottom,
like this:

Resolves: #123
See also: #456, #789

If you’re having a hard time summarizing your commits in a single line or message, you might be committing too many changes at once. Instead, you should try to aim for what’s called atomic commits.

Cover image by XKCD#1296

Predictive Power Score: Finding predictive patterns in your dataset

Predictive Power Score: Finding predictive patterns in your dataset

Last week, I shared this Medium blog on PPS — or Predictive Power Score — on my LinkedIn and got so many enthousiastic responses, that I had to share it with here too.

Basically, the predictive power score is a normalized metric (values range from 0 to 1) that shows you to what extent you can use a variable X (say age) to predict a variable Y (say weight in kgs).

A PPS high score of, for instance, 0.85, would show that weight can be predicted pretty good using age.

A low PPS score, of say 0.10, would imply that weight is hard to predict using age.

The PPS acts a bit like a correlation coefficient we’re used too, but it is also different in many ways that are useful to data scientists:

  1. PPS also detects and summarizes non-linear relationships
  2. PPS is assymetric, so that it models Y ~ X, but not necessarily X ~ Y
  3. PPS can summarize predictive value of / among categorical variables and nominal data

However, you may argue that the PPS is harder to interpret than the common correlation coefficent:

  1. PPS can reflect quite complex and very different patterns
  2. Therefore, PPS are hard to compare: a 0.5 may reflect a linear relationship but also many other relationships
  3. PPS are highly dependent on the used algorithm: you can use any algorithm from OLS to CART to full-blown NN or XGBoost. Your algorithm hihgly depends the patterns you’ll detect and thus your scores
  4. PPS are highly dependent on the the evaluation metric (RMSE, MAE, etc).

Here’s an example picture from the original blog, showing a case in which PSS shows the relevant predictive value of Y ~ X, whereas a correlation coefficient would show no relationship whatsoever:


Here’s two more pictures from the original blog showing the differences with a standard correlation matrix on the Titanic data:

I highly suggest you read the original blog for more details and information, and that you check out the associated Python package ppscore:

Installing the package:

pip install ppscore

Calculating the PPS for a given pandas dataframe:

import ppscore as pps
pps.score(df, "feature_column", "target_column")

You can also calculate the whole PPS matrix:


There’s no R package yet, but it should not be hard to implement this general logic.

Florian Wetschoreck — the author — already noted that there may be several use cases where he’d think PPS may add value:

Find patterns in the data [red: data exploration]: The PPS finds every relationship that the correlation finds — and more. Thus, you can use the PPS matrix as an alternative to the correlation matrix to detect and understand linear or nonlinear patterns in your data. This is possible across data types using a single score that always ranges from 0 to 1.

Feature selection: In addition to your usual feature selection mechanism, you can use the predictive power score to find good predictors for your target column. Also, you can eliminate features that just add random noise. Those features sometimes still score high in feature importance metrics. In addition, you can eliminate features that can be predicted by other features because they don’t add new information. Besides, you can identify pairs of mutually predictive features in the PPS matrix — this includes strongly correlated features but will also detect non-linear relationships.

Detect information leakage: Use the PPS matrix to detect information leakage between variables — even if the information leakage is mediated via other variables.

Data Normalization: Find entity structures in the data via interpreting the PPS matrix as a directed graph. This might be surprising when the data contains latent structures that were previously unknown. For example: the TicketID in the Titanic dataset is often an indicator for a family.

Generative art: Let your computer design you a painting

Generative art: Let your computer design you a painting

I really like generative art, or so-called algorithmic art. Basically, it means you take a pattern or a complex system of rules, and apply it to create something new following those patterns/rules.

When I finished my PhD, I got a beautiful poster of where the k-nearest neighbors algorithms was used to generate a set of connected points.

Marcus Volz’ nearest neighbors graph, via https://marcusvolz.com/#nearest-neighbour-graph

My first piece of generative art.

As we recently moved into our new house, I decided I wanted to have a brother for the knn-poster. So I did some research in algorithms I wanted to use to generate a painting. I found some very cool ones, of which I unforunately can’t recollect the artists anymore:

However, I preferred to make one myself. So we again turned to the work of the author that made the knn-poster: Marcus Volz.

He has written (in R) many other algorithms. And we found that one specifically nicely matched the knn-poster. His metropolis – or generative city:

Marcus’ generative city, via https://marcusvolz.com/#generative-city

However, I wanted to make one myself, so I download Marcus code, and tweaked it a bit. Most importantly, I made it start in the center, made it fill up the whole space, and I made it run more efficient so I could generate a couple dozen large cities quickly, and pick the one I liked most. Here’s the end result:

And in action, in my living room:

Free Springer Books during COVID19

Free Springer Books during COVID19

Book publisher Springer just released over 400 book titles that can be downloaded free of charge following the corona-virus outbreak.

Here’s fhe full overview: https://link.springer.com/search?facet-content-type=%22Book%22&package=mat-covid19_textbooks&facet-language=%22En%22&sortOrder=newestFirst&showAll=true

Most of these books will normally set you back about $50 to $150, so this is a great deal!

There are many titles on computer science, programming, business, psychology, and here are some specific titles that might interest my readership:

Note that I only got to page 8 of 21, so there are many more free interesting titles out there!

Join 234 other followers