Category: statistics

ppsr live on CRAN!

ppsr live on CRAN!

Finding predictive patterns in your dataset with one line of code!

Today — March 2nd 2021 — my first R package was published on the comprehensive R archive network (CRAN).

ppsr is the R implementation of the Predictive Power Score (PPS).

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. You can read more about the concept in earlier blog posts (here and here), or here on Github, or via Medium.

With the ppsr package live on CRAN, it is now super easy to install the package and examine the predictive relationships in your dataset:

Here’s the official ppsr CRAN page for those interested:

ppsr: An R implementation of the Predictive Power Score

ppsr: An R implementation of the Predictive Power Score

Update March, 2021: My R package for the predictive power score (ppsr) is live on CRAN!
Try install.packages("ppsr") in your R terminal to get the latest version.

A few months ago, I wrote about the Predictive Power Score (PPS): a handy metric to quickly explore and quantify the relationships in a dataset.

As a social scientist, I was taught to use a correlation matrix to describe the relationships in a dataset. Yet, in my opinion, the PPS provides three handy advantages:

  1. PPS works for any type of data, also nominal/categorical variables
  2. PPS quantifies non-linear relationships between variables
  3. PPS acknowledges the asymmetry of those relationships

Florian Wetschoreck came up with the PPS idea, wrote the original blog, and programmed a Python implementation of it (called ppscore).

Yet, I work mostly in R and I was very keen on incorporating this powertool into my general data science workflow.

So, over the holiday period, I did something I have never done before: I wrote an R package!

It’s called ppsr and you can find the code here on github.

Installation

# You can get the official version from CRAN:
install.packages("ppsr")

## Or you can get the development version from GitHub:
# install.packages('devtools')
# devtools::install_github('https://github.com/paulvanderlaken/ppsr')

Usage

The ppsr package has three main functions that compute PPS:

  • score() – which computes an x-y PPS
  • score_predictors() – which computes X-y PPS
  • score_matrix() – which computes X-Y PPS

Visualizing PPS

Subsequently, there are two main functions that wrap around these computational functions to help you visualize your PPS using ggplot2:

  • visualize_predictors() – producing a barplot of all X-y PPS
  • visualize_matrix() – producing a heatmap of all X-Y PPS
PPS matrix for iris

Note that Species is a nominal/categorical variable, with three character/text options.

A correlation matrix would not be able to show us that the type of iris Species can be predicted extremely well by the petal length and width, and somewhat by the sepal length and width. Yet, particularly sepal width is not easily predicted by the type of species.

Correlation matrix for iris

Exploring mtcars

It takes about 10 seconds to run 121 decision trees with visualize_matrix(mtcars). Yet, the output is much more informative than the correlation matrix:

  • cyl can be much better predicted by mpg than the other way around
  • the classification of vs can be done well using nearly all variables as predictors, except for am
  • yet, it’s hard to predict anything based on the vs classification
  • a cars’ am can’t be predicted at all using these variables
PPS matrix for mtcars

The correlation matrix does provides insights that are not provided by the PPS matrix. Most importantly, the sign and strength of any linear relationship that may exist. For instance, we can deduce that mpg relates strongly negatively with cyl.

Yet, even though half of the matrix does not provide any additional information (due to the symmetry), I still find it hard to derive the most important relations and insights at a first glance.

Moreover, the rows and columns for vs and am are not very informative in this correlation matrix as it contains pearson correlations coefficients by default, whereas vs and am are binary variables. The same can be said for cyl, gear and carb, which contain ordinal categories / integer data, so you can discuss the value of these coefficients depicted here.

Correlation matrix for mtcars

Exploring trees

In R, there are many datasets built in via the datasets package. Let’s explore some using the ppsr::visualize_matrix() function.

datasets::trees has data on 31 trees’ girth, height and volume.

visualize_matrix(datasets::trees) shows that both girth and volume can be used to predict the other quite well, but not perfectly.

Let’s have a look at the correlation matrix.

The scores here seem quite higher in general. A near perfect correlation between volume and girth.

Is it near perfect though? Let’s have a look at the underlying data and fit a linear model to it.

You will still be pretty far off the real values when you use a linear model based on Girth to predict Volume. This is what the original PPS of 0.65 tried to convey.

Actually, I’ve run the math for this linaer model and the RMSE is still 4.11. Using just the mean Volume as a prediction of Volume will result in 16.17 RMSE. If we map these RMSE values on a linear scale from 0 to 1, we would get the PPS of our linear model, which is about 0.75.

So, actually, the linear model is a better predictor than the decision tree that is used as a default in the ppsr package. That was used to generate the PPS matrix above.

Yet, the linear model definitely does not provide a perfect prediction, even though the correlation may be near perfect.

Conclusion

In sum, I feel using the general idea behind PPS can be very useful for data exploration.

Particularly in more data science / machine learning type of projects. The PPS can provide a quick survey of which targets can be predicted using which features, potentially with more complex than just linear patterns.

Yet, the old-school correlation matrix also still provides unique and valuable insights that the PPS matrix does not. So I do not consider the PPS so much an alternative, as much as a complement in the toolkit of the data scientist & researcher.

Enjoy the R package, or the Python module for that matter, and let me know if you see any improvements!

Reviewing year 4 of paulvanderlaken.com

Reviewing year 4 of paulvanderlaken.com

Despite the pandemic, 2020 has been a great year for me.

Professionally, I grew into my role as data science product owner. And next to this, I got more and more freelance side gigs. Mostly teaching, but also some consultancy projects. Unfortunately, all my start-up ideas failed miserably again this year, yet I’ll keep trying : )

Personally, 2020 was also generous to us. We have a family expansion coming in 2021! (Un)Fortunately, the whole quarantaine situation provided a lot of time to make our house baby-ready!

A year in numbers

2020 was also a great year for our blog.

Here are some statistics. We reached 300 followers, on the last day of the year! Who could have imagined that?!

Statistic20192020delta
Views107.828150.59940%
Visitors70.870100.53942%
Followers15930089%
Posts9672-25%
Comments405948%
per post0,420,8297%
Likes11686-26%
per post1,211,19-1%

This tremendous growth of the website is despite me posting a lot less frequently this year.

After a friend’s advice, I started posting less, but more regularly.

Can you spot the pattern in my 2020 posting behavior?

Compare that to my erratic 2019 posting:

Now my readers have got something to look forward to every Tuesday!

Yet, is Tuesday really the best day for me to post my stuff?

You seem to prefer visiting my blog on Wednesdays.

Let me know what you think in the comments!

I am looking forward to what 2021 has in store for my blogging. I guess a baby will result in even less posts… But we’ll just focus on quality over quantity!

I hope I can keep up with the exponential growth:

Best new articles in 2020

There are many ways in which you could define the quality of an article.

For me, the most obvious would be to look at some view-based metric. Something like the number of views, or the number of unique visitors.

Yet, some articles have been online longer than others. So maybe we should focus on the average views per day. Still these you can expect to be increase as articles have been in existance longer.

In my opinion, how an article attract viewers over time tells an interesting story. For instance, how stable are the daily viewer numbers? Are they rising? This is often indicative that external websites link to my article. Which implies it holds valuable information to a specific readership. In turn, this suggests that the article is likely to continue attracting viewers in the future.

Here is an abstract visualization. Every line represents and article. Every line/article starts in the lower left corner. On the x-axis you see the number of days since posting. So lines slowly move right, the longer they have been on my website. On the y-axis you see the total viewers it attracted.

You can see three types of blog articles: (1) articles that attract 90% of their views within the first month, (2) articles that generate a steady flow of visitors, (3) articles that never attract (m)any readers.

Here’s a different way of visualizing those same articles: by their average daily visitors (x) and the standard deviation in daily visitors (y).

Basically, I hope to write articles that get many daily visitors (high x). Yet, I also hope that my articles have either have stable (or preferably increasing) visitor numbers. This would mean that they either score low on y, or that y increases over time.

By these measures, my best articles of 2020 are, in my opinion:

  1. Bayesian statistics using R, Python, & Stan
  2. Automatically create perfect .gitignore file
  3. Create a publication-ready correlation matrix
  4. Simulating and visualizing the Monthy Hall problem in R & Python
  5. How most statistical tests are linear

Best all time reads

For the first time, my blog roll & archives were the most visited page of my website this year! A whopping 13k views!!

With regard to the most visited pages of this year, not much has changed since 2019. We see some golden oldies and I once again conclude that my viewership remains mostly R-based:

  1. R resources
  2. New to R?
  3. R tips and tricks
  4. The house always wins
  5. Simple correlation analysis in R
  6. Visualization innovations
  7. Beating battleships with algorithms and AI
  8. Regular expressions in R
  9. Learn project-based programming
  10. Simpson’s paradox

Which articles haven’t you read?

Did you know you can search for keywords or tags using the main page?

Is R-squared Useless?

Is R-squared Useless?

Coming from a social sciences background, I learned to use R-squared as a way to assess model performance and goodness of fit for regression models.

Yet, in my current day job, I nearly never use the metric any more. I tend to focus on predictive power, with metrics such as MAE, MSE, or RMSE. These make much more sense to me when comparing models and their business value, and are easier to explain to stakeholders as an added bonus.

I recently wrote about the predictive power score as an alternative to correlation analysis.

Are there similar alternatives that render R-squared useless? And why?

Here’s an interesting blog explaining the standpoints of Cosma Shalizi of Carnegie Mellon University:

  • R-squared does not measure goodness of fit.
  • R-squared does not measure predictive error.
  • R-squared does not allow you to compare models using transformed responses.
  • R-squared does not measure how one variable explains another.

I have never found a situation where R-squared helped at all.

Professor Cosma Shalizi (according to Clay Ford)

Bayesian Statistics using R, Python, and Stan

Bayesian Statistics using R, Python, and Stan

For a year now, this course on Bayesian statistics has been on my to-do list. So without further ado, I decided to share it with you already.

Richard McElreath is an evolutionary ecologist who is famous in the stats community for his work on Bayesian statistics.

At the Max Planck Institute for Evolutionary Anthropology, Richard teaches Bayesian statistics, and he was kind enough to put his whole course on Statistical Rethinking: Bayesian statistics using R & Stan open access online.

You can find the video lectures here on Youtube, and the slides are linked to here:

Richard also wrote a book that accompanies this course:

For more information abou the book, click here.

For the Python version of the code examples, click here.

10 Guidelines to Better Table Design

10 Guidelines to Better Table Design

Jon Schwabisch recently proposed ten guidelines for better table design.

Next to the academic paper, Jon shared his recommendations in a Twitter thread.

Let me summarize them for you:

  • Right-align your numbers
  • Left-align your texts
  • Use decimals appropriately (one or two is often enough)
  • Display units (e.g., $, %) sparsely (e.g., only on first row)
  • Highlight outliers
  • Highlight column headers
  • Use subtle highlights and dividers
  • Use white space between rows and columns
  • Use white space (or dividers) to highlight groups
  • Use visualizations for large tables
Afbeelding
Highlights in a table. Via twitter.com/jschwabish/status/1290324966190338049/photo/2
Afbeelding
Visualizations in a table. Via twitter.com/jschwabish/status/1290325409570197509/photo/3
Afbeelding
Example of a well-organized table. Via twitter.com/jschwabish/status/1290325663543627784/photo/2