Grant McDermott developed this new R package I wish I had thought of: parttree

parttree includes a set of simple functions for visualizing decision tree partitions in R with ggplot2. The package is not yet on CRAN, but can be installed from GitHub using:

Using the familiar ggplot2 syntax, we can simply add decision tree boundaries to a plot of our data.

In this example from his Github page, Grant trains a decision tree on the famous Titanic data using the parsnip package. And then visualizes the resulting partition / decision boundaries using the simple function geom_parttree()

library(parsnip)
library(titanic) ## Just for a different data set
set.seed(123) ## For consistent jitter
titanic_train$Survived = as.factor(titanic_train$Survived)
## Build our tree using parsnip (but with rpart as the model engine)
ti_tree =
decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification") %>%
fit(Survived ~ Pclass + Age, data = titanic_train)
## Plot the data and model partitions
titanic_train %>%
ggplot(aes(x=Pclass, y=Age)) +
geom_jitter(aes(col=Survived), alpha=0.7) +
geom_parttree(data = ti_tree, aes(fill=Survived), alpha = 0.1) +
theme_minimal()

Super awesome!

This visualization precisely shows where the trained decision tree thinks it should predict that the passengers of the Titanic would have survived (blue regions) or not (red), based on their age and passenger class (Pclass).

This will be super helpful if you need to explain to yourself, your team, or your stakeholders how you model works. Currently, only rpart decision trees are supported, but I am very much hoping that Grant continues building this functionality!

Ryan Holbrook made awesome animated GIFs in R of several classifiers learning a decision rule boundary between two classes. Basically, what you see is a machine learning model in action, learning how to distinguish data of two classes, say cats and dogs, using some X and Y variables.

These visuals can be great to understand these algorithms, the models, and their learning process a bit better.

Here’s the original tweet, with the logistic regression animation. If you follow it, you will find a whole thread of classifier GIFs. These I extracted, pasted, and explained below.

Below is the GIF which I extracted using EZgif.com.

What you see is observations from two classes, say cats and dogs, each represented using colored dots. The dots are placed along X and Y axes, which represent variables about the observations. Their tail lengths and their hairyness, for instance.

Now there’s an optimal way to seperate these classes, which is the dashed line. That line best seperates the cats from the dogs based on these two variables X and Y. As this is an optimal boundary given this data, it is stable, it does not change.

However, there’s also a solid black line, which does change. This line represents the learned boundary by the machine learning model, in this case using logistic regression. As the model is shown more data, it learns, and the boundary is updated. This learned boundary represents the best line with which the model has learned to seperate cats from dogs.

Anything above the boundary is predicted to be class 1, a dog. Everything below predicted to be class 2, a cat. As logistic regression results in a linear model, the seperation boundary is very much linear/straight.

These animations are great to get a sense of how the models come to their boundaries in the back-end.

For instance, other machine learning models are able to use non-linear boundaries to dinstinguish classes, such as this quadratic discriminant analysis (qda). This “learned” boundary is much closer to the optimal boundary:

Next, we have the k-nearest neighbors algorithm, which predicts for each point (animal) the class (cat/dog) based on the “k” points closest to it. As you see, this results in a highly fluctuating, localized boundary.

Now, Ryan decided to push the challenge, and simulate new data for two classes with a more difficult decision boundary. The new data and optimal boundaries look like this:

On these data, Ryan put a whole range of non-linear models to work.

Like this support-vector machine, which tries to create optimal boundaries built of support vectors around all the cats and all the dohs (this is definitely not a technical, error-free explanation of what’s happening here).

Let’s jump into some tree-based algorithms and the resulting models. A decision tree classifies data based on multiple, sequential, binary splits. Here, Ryan trained a simple decision tree:

As well as it’s big brother, a random forest, which uses hundreds of trees in the back end and thus results in a more flexible boundary:

Extreme gradient boosting is also a tree-based algorithm, which leverages many machine learning techniques to optimize the bias-variance tradeoff. Here’s an earlier blog on how to get started with Xgboost in Python or R:

XGBOOST stands for eXtreme Gradient Boosting. A big brother of the earlier AdaBoost, XGB is a supervised learning algorithm that uses an ensemble of adaptively boosted decision trees. For those unfamiliar with adaptive boosting algorithms, here’s a 2-minute explanation video and a written tutorial. Although XGBOOST often performs well in predictive tasks, the training process can be quite time-consuming (similar to other bagging/boosting algorithms (e.g., random forest)).

In a recent blog, Analytics Vidhya compares the inner workings as well as the predictive accuracy of the XGBOOST algorithm to an upcoming boosting algorithm: Light GBM. The blog demonstrates a stepwise implementation of both algorithms in Python. The table below reflects the main conclusion of the comparison: Although the algorithms are comparable in terms of their predictive performance, light GBM is much faster to train. With continuously increasing data volumes, light GBM, therefore, seems the way forward.

Laurae also benchmarked lightGBM against xgboost on a Bosch dataset and her results show that, on average, LightGBM (binning) is between 11x to 15x faster than xgboost (without binning):

However, the differences get smaller as more threads are used due to thread inefficiencies (idle-time increases because threads are not scheduled a next task fast enough).

Neil Schneider tested the three algorithms for gradient boosting in R (GBM, xgboost, and lightGBM) and sums up their (dis)advantages:

GBM has no specific advantages but its disadvantages include no early stopping, slower training and decreased accuracy,

xgboost has demonstrated successful on kaggle and though traditionally slower than lightGBM, tree_method = 'hist' (histogram binning) provides a significant improvement.

lightGBM has the advantages of training efficiency, low memory usage, high accuracy, parallel learning, corporate support, and scale-ability. However, its’ newness is its main disadvantage because there is little community support.