Tidy Machine Learning with R’s purrr and tidyr

Jared Wilber posted this great walkthrough where he codes a simple R data pipeline using purrr and tidyr to train a large variety of models and methods on the same base data, all in a non-repetitive, reproducible, clean, and thus tidy fashion. Really impressive workflow!

Comparison between R dplyr and data.table code

Atrebas created this extremely helpful overview page showing how to program standard data manipulation and data transformation routines in R’s famous packages dplyr and data.table. The document has been been inspired by this stackoverflow question and by the data.table cheat sheet published by Karlijn Willems. Resources for data.table can be found on the data.table wiki, in the data.table vignettes,…

ROC, AUC, precision, and recall visually explained

A receiver operating characteristic (ROC) curve displays how well a model can classify binary outcomes. An ROC curve is generated by plotting the false positive rate of a model against its true positive rate, for each possible cutoff value. Often, the area under the curve (AUC) is calculated and used as a metric showing how well…

Python for R users

Wanting to broaden your scope and learn a new programming language? This great workshop was given at EARL 2018 by Mango Solutions and helps R programmers transition into Python building on their existing R knowledge. The workshop includes exercises that introduce you to the key concepts of Python and some of its most powerful packages…

Putting R in Production, by Heather Nolis & Mark Sellors

It is often said that R is hard to put into production. Fortunately, there are numerous talks demonstrating the contrary. Here’s one by Heather Nolis, who productionizes R models at T-Mobile. Her teams even shares open-source version of some of their productionized Tensorflow models on github. Read more about that model here. There’s another great…

Recreating graphics from the Fundamentals of Data Visualization

Claus Wilke wrote the Fundamentals of Data Visualization – a great resource that’s definitely high on my list of recommended data visualization books. In a recent post, Claus shared the link to a GitHub repository where he hosts some of the R programming code with which Claus made the graphics for his dataviz book. The…

Python vs. R for Data Science, by Norm Matloff

Cover image via Hacker Noon. Norm Matloff is a professor of Computer Science at University College Davis. He recently updated his viewpoint on whether R or Python is the best language for Data Science. While I normally hate those opinionated comparisons, Norm’s outline of the two languages’ (dis)advantages is actually quite balanced and well-versed. I…

Generalized Additive Models Tutorial in R, by Noam Ross

Generalized Additive Models — or GAMs in short — have been somewhat of a mystery to me. I’ve known about them, but didn’t know exactly what they did, or when they’re useful. That came to an end when I found out about this tutorial by Noam Ross. In this beautiful, online, interactive course, Noam allows…

GIF visualizations of Type 1 and Type 2 error in relation to sample size

On twitter, I came across the tweet below showing some great GIF visualizations on the dangers of taking small samples. Created by Andrew Stewart, and tweeted by John Holbein, the visuals show samples taken from a normal distributed variable with a mean of 10 and a standard deviation of 2. In the left section, Andrew…