Category: best practices

Tidy Machine Learning with R’s purrr and tidyr

Jared Wilber posted this great walkthrough where he codes a simple R data pipeline using purrr and tidyr to train a large variety of models and methods on the same base data, all in a non-repetitive, reproducible, clean, and thus tidy fashion. Really impressive workflow!

Learn Git Branching: An Interactive Tutorial

Peter Cottle built this great interactive Git tutorial that teaches you all vital branching skills right in your browser. It’s interactive, beautiful, and very informative, introducing every concept and Git command in a step-by-step fashion.

Have a look yourself: https://learngitbranching.js.org/

Here’s the associated GitHub repository for those interested in forking.

The tutorial includes many levels that progressively teach you the Git commands you’ll need to apply version control on a daily basis:

There’s also a sandbox mode where you can interactively explore and build your own Git tree.

LearnGitBranching is a git repository visualizer, sandbox, and a series of educational tutorials and challenges. Its primary purpose is to help developers understand git through the power of visualization (something that’s absent when working on the command line). This is achieved through a game with different levels to get acquainted with the different git commands.
You can input a variety of commands into LearnGitBranching (LGB) — as commands are processed, the nearby commit tree will dynamically update to reflect the effects of each command.
This visualization combined with tutorials and “levels” can help both beginners and intermediate developers polish their version control skills. A quick demo is available here: https://pcottle.github.com/learnGitBranching/?demo
Or, you can launch the application normally here: https://pcottle.github.com/learnGitBranching/
https://github.com/pcottle/learnGitBranching

Artificial Stupidity – by Vincent Warmerdam @PyData 2019 London

PyData is famous for it’s great talks on machine learning topics. This 2019 London edition, Vincent Warmerdam again managed to give a super inspiring presentation. This year he covers what he dubs Artificial Stupidity™. You should definitely watch the talk, which includes some great visual aids, but here are my main takeaways:

Vincent speaks of Artificial Stupidity, of machine learning gone HorriblyWrong™ — an example of which below — for which Vincent elaborates on three potential fixes:

Image result for paypal but still learning got scammed — Example of a model that goes HorriblyWrong™, according to Vincent’s talk.

1. Predict Less, but Carefully

Vincent argues you shouldn’t extrapolate your predictions outside of your observed sampling space. Even better: “Not predicting given uncertainty is a great idea.” As an alternative, we could for instance design a fallback mechanism, by including an outlier detection model as the first step of your machine learning model pipeline and only predict for non-outliers.

I definately recommend you watch this specific section of Vincent’s talk because he gives some very visual and intuitive explanations of how extrapolation may go HorriblyWrong™.

Be careful! One thing we should maybe start talking about to our bosses: Algorithms merely automate, approximate, and interpolate. It’s the extrapolation that is actually kind of dangerous.
Vincent Warmerdam @ Pydata 2019 London

Basically, we can choose to not make automated decisions sometimes.

2. Constrain thy Features

What we feed to our models really matters. […] You should probably do something to the data going into your model if you want your model to have any sort of fairness garantuees.
Vincent Warmerdam @ Pydata 2019 London

Often, simply removing biased features from your data does not reduce bias to the extent we may have hoped. Fortunately, Vincent demonstrates how to remove biased information from your variables by applying some cool math tricks.

Unfortunately, doing so will often result in a lesser predictive accuracy. Unsurprisingly though, as you are not closely fitting the biased data any more. What makes matters more problematic, Vincent rightfully mentions, is that corporate incentives often not really align here. It might feel that you need to pick: it’s either more accuracy or it’s more fairness.

However, there’s a nice solution that builds on point 1. We can now take the highly accurate model and the highly fair model, make predictions with both, and when these predictions differ, that’s a very good proxy where you potentially don’t want to make a prediction. Hence, there may be observations/samples where we are comfortable in making a fair prediction, whereas in most other situations we may say “right, this prediction seems unfair, we need a fallback mechanism, a human being should look at this and we should not automate this decision”.

Vincent does not that this is only one trick to constrain your model for fairness, and that fairness may often only be fair in the eyes of the beholder. Moreover, in order to correct for these biases and unfairness, you need to know about these unfair biases. Although outside of the scope of this specific topic, Vincent proposes this introduces new ethical issues:

Basically, we can choose to put our models on a controlled diet.

3. Constrain thy Model

Vincent argues that we should include constraints (based on domain knowledge, or common sense) into our models. In his presentation, he names a few. For instance, monotonicity, which implies that the relationship between X and Y should always be either entirely non-increasing, or entirely non-decreasing. Incorporating the previously discussed fairness principles would be a second example, and there are many more.

If we every come up with a model where more smoking leads to better health, that’s bad. I have enough domain knowledge to say that that should never happen. So maybe I should just make a system where I can say “look this one column with relationship to Y should always be strictly negative”.
Vincent Warmerdam @ Pydata 2019 London

Basically, we can integrate domain knowledge or preferences into our models.

Conclusion: Watch the talk!

5 Quick Tips for Coding in the Classroom, by Kelly Bodwin

Kelly Bodwin is an Assistant Professor of Statistics at Cal Poly (San Luis Obispo) and teaches multiple courses in statistical programming. Based on her experiences, she compiled this great shortlist of five great tips to teach programming.

Kelly truly mentions some best practices, so have a look at the original article, which she summarized as follows:

1. Define your terms

Establish basic coding vocabulary early on.

What is the console, a script, the environment?
What is a function a variable, a dataframe?
What are strings, characters, and integers?

2. Be deliberate about teaching versus bypassing peripheral skills

Use tools like RStudio Cloud, R Markdown, and the usethis package to shelter students from setup.

Personally, this is what kept me from learning Python for a long time — the issues with starting up.

Kelly provides this personal checklist of peripherals skills including which ones she includes in her introductory courses:

Course Type	Install/Update R and RStudio	R Markdown fluency	Package management	Data management	File and folder organization	GitHub
Intro Stat for Non-Majors	⚠️	⚠️	❌	❌	❌	❌
Intro Stat for Majors	✅	✅	⚠️	⚠️	⚠️	⚠️
Advanced Statistics	✅	✅	✅	✅	⚠️	⚠️
Intro to Statistical Computation	✅	✅	✅	✅	✅	✅

✅ = required course skill
⚠️ = optional, proceed with caution
❌ = avoid entirely
via https://teachdatascience.com/teaching_programming_tips/

3. Read code like English

The best way to debug is to read your process out loud as a sentence.

Basically Kelly argues that you should learn students to be able to translate their requirements into (R) code.

When you continuously read out your code as step-by-step computer instructions, students will learn to translate their own desires to computer instructions.

4. Require good coding practices from Day One

Kelly refers to this great talk by Jenny Bryan on “good” code and how to recognize it.

Kelly’s personal best practice included:

Clear code formatting
Object names follow consistent conventions
Lack of unnecessary code repetition
Reproducibility
Unit tests before large calculations
Commenting and/or documentation

For more R style guides, see my R resources overview.

5. Leave room for creativity

Open-ended questions (like “here’s a dataset, do a cool analysis“) let students explore and shine.

Large parts of the above were copied from this original article by Kelly Boldwin. I highly recommend you have a look at the original, and at the website hosting it: teachdatascience.com

Cover picture by freecodecamp.org.

Survival of the Best Fit: A webgame on AI in recruitment

Survival of the Best Fit is a webgame that simulates what happens when companies automate their recruitment and selection processes.

You – playing as the CEO of a starting tech company – are asked to select your favorite candidates from a line-up, based on their resumés.

As your simulated company grows, the time pressure increases, and you are forced to automate the selection process.

Fortunately, some smart techies working for your company propose training a computer to hire just like you just did.

They don’t need anything but the data you just generated and some good old supervised machine learning!

To avoid spoilers, try the game yourself and see what happens!

The game only takes a few minutes, and is best played on mobile.

www.survivalofthebestfit.com/ via Medium

Survival of the Best Fit was built by Gabor Csapo, Jihyun Kim, Miha Klasinc, and Alia ElKattan. They are software engineers, designers and technologists, advocating for better software that allows members of the public to question its impact on society.

You don’t need to be an engineer to question how technology is affecting our lives. The goal is not for everyone to be a data scientist or machine learning engineer, though the field can certainly use more diversity, but to have enough awareness to join the conversation and ask important questions.
With Survival of the Best Fit, we want to reach an audience that may not be the makers of the very technology that impact them everyday. We want to help them better understand how AI works and how it may affect them, so that they can better demand transparency and accountability in systems that make more and more decisions for us.
survivalofthebestfit.com

I found that the game provides a great intuitive explanation of how (humas) bias can slip into A.I. or machine learning applications in recruitment, selection, or other human resource management practices and processes.

If you want to read more about people analytics and machine learning in HR, I wrote my dissertation on the topic and have many great books I strongly recommend.

Finally, here’s a nice Medium post about the game.

https://www.survivalofthebestfit.com/game/

Note, as Joachin replied below, that the game apparently does not learn from user-input, but is programmed to always result in bias towards blues.
I kind of hoped that there was actually an algorithm “learning” in the backend, and while the developers could argue that the bias arises from the added external training data (you picked either Google, Apple, or Amazon to learn from), it feels like a bit of a disappointment that there is no real interactivity here.

Logistic regression is not fucked, by Jake Westfall

Recently, I came across a social science paper that had used linear probability regression. I had never heard of linear probability models (LPM), but it seems just an application of ordinary least squares regression but to a binomial dependent variable.

According to some, LPM is a commonly used alternative for logistic regression, which is what I was learned to use when the outcome is binary.

Potentially because of my own social science background (HRM), using linear regression without a link transformation on binary data just seems very unintuitive and error-prone to me. Hence, I sought for more information.

I particularly liked this article by Jake Westfall, which he dubbed “Logistic regression is not fucked”, following a series of blogs in which he talks about methods that are fucked and not useful.

Jake explains the classification problem and both methods inner workings in a very straightforward way, using great visual aids. He shows how LMP would differ from logistic models, and why its proposed benefits are actually not so beneficial. Maybe I’m in my bubble, but Jake’s arguments resonated.

Read his article yourself:
http://jakewestfall.org/blog/index.php/2018/03/12/logistic-regression-is-not-fucked/

Here’s the summary:
Arguments against the use of logistic regression due to problems with “unobserved heterogeneity” proceed from two distinct sets of premises. The first argument points out that if the binary outcome arises from a latent continuous outcome and a threshold, then observed effects also reflect latent heteroskedasticity. This is true, but only relevant in cases where we actually care about an underlying continuous variable, which is not usually the case. The second argument points out that logistic regression coefficients are not collapsible over uncorrelated covariates, and claims that this precludes any substantive interpretation. On the contrary, we can interpret logistic regression coefficients perfectly well in the face of non-collapsibility by thinking clearly about the conditional probabilities they refer to.