Category: machine learning

Artificial Stupidity – by Vincent Warmerdam @PyData 2019 London

PyData is famous for it’s great talks on machine learning topics. This 2019 London edition, Vincent Warmerdam again managed to give a super inspiring presentation. This year he covers what he dubs Artificial Stupidity™. You should definitely watch the talk, which includes some great visual aids, but here are my main takeaways:

Vincent speaks of Artificial Stupidity, of machine learning gone HorriblyWrong™ — an example of which below — for which Vincent elaborates on three potential fixes:

Image result for paypal but still learning got scammed — Example of a model that goes HorriblyWrong™, according to Vincent’s talk.

1. Predict Less, but Carefully

Vincent argues you shouldn’t extrapolate your predictions outside of your observed sampling space. Even better: “Not predicting given uncertainty is a great idea.” As an alternative, we could for instance design a fallback mechanism, by including an outlier detection model as the first step of your machine learning model pipeline and only predict for non-outliers.

I definately recommend you watch this specific section of Vincent’s talk because he gives some very visual and intuitive explanations of how extrapolation may go HorriblyWrong™.

Be careful! One thing we should maybe start talking about to our bosses: Algorithms merely automate, approximate, and interpolate. It’s the extrapolation that is actually kind of dangerous.
Vincent Warmerdam @ Pydata 2019 London

Basically, we can choose to not make automated decisions sometimes.

2. Constrain thy Features

What we feed to our models really matters. […] You should probably do something to the data going into your model if you want your model to have any sort of fairness garantuees.
Vincent Warmerdam @ Pydata 2019 London

Often, simply removing biased features from your data does not reduce bias to the extent we may have hoped. Fortunately, Vincent demonstrates how to remove biased information from your variables by applying some cool math tricks.

Unfortunately, doing so will often result in a lesser predictive accuracy. Unsurprisingly though, as you are not closely fitting the biased data any more. What makes matters more problematic, Vincent rightfully mentions, is that corporate incentives often not really align here. It might feel that you need to pick: it’s either more accuracy or it’s more fairness.

However, there’s a nice solution that builds on point 1. We can now take the highly accurate model and the highly fair model, make predictions with both, and when these predictions differ, that’s a very good proxy where you potentially don’t want to make a prediction. Hence, there may be observations/samples where we are comfortable in making a fair prediction, whereas in most other situations we may say “right, this prediction seems unfair, we need a fallback mechanism, a human being should look at this and we should not automate this decision”.

Vincent does not that this is only one trick to constrain your model for fairness, and that fairness may often only be fair in the eyes of the beholder. Moreover, in order to correct for these biases and unfairness, you need to know about these unfair biases. Although outside of the scope of this specific topic, Vincent proposes this introduces new ethical issues:

Basically, we can choose to put our models on a controlled diet.

3. Constrain thy Model

Vincent argues that we should include constraints (based on domain knowledge, or common sense) into our models. In his presentation, he names a few. For instance, monotonicity, which implies that the relationship between X and Y should always be either entirely non-increasing, or entirely non-decreasing. Incorporating the previously discussed fairness principles would be a second example, and there are many more.

If we every come up with a model where more smoking leads to better health, that’s bad. I have enough domain knowledge to say that that should never happen. So maybe I should just make a system where I can say “look this one column with relationship to Y should always be strictly negative”.
Vincent Warmerdam @ Pydata 2019 London

Basically, we can integrate domain knowledge or preferences into our models.

Conclusion: Watch the talk!

Putting R in Production, by Heather Nolis & Mark Sellors

It is often said that R is hard to put into production. Fortunately, there are numerous talks demonstrating the contrary.

Here’s one by Heather Nolis, who productionizes R models at T-Mobile. Her teams even shares open-source version of some of their productionized Tensorflow models on github. Read more about that model here.

There’s another great talk on the RStudio website. In this talk, Mark Sellors discusses some of the misinformation around the idea of what “putting something into production” actually means, and provides some tips on overcoming obstacles.

Cover image via Fotolia.

Northstar: The interactive, drag-and-drop data science platform by MIT

MIT researchers have spent years developing the new drag-and-drop analytics tools they call Northstar.

Northstar is an interactive data science platform that rethinks how people interact with data. It empowers users without programming experience, background in statistics or machine learning expertise to explore and mine data through an intuitive user interface, and effortlessly build, analyze, and evaluate machine learning (ML) pipelines.
northstar.mit.edu/

Northstar starts as a blank, white interface. Users upload datasets into the system, which appear in a “datasets” box on the left. Any data labels will automatically populate a separate “attributes” box below. There’s also an “operators” box that contains various algorithms, as well as the new AutoML tool. All data are stored and analyzed in the cloud.
news.mit.edu/2019/drag-drop-data-analytics-0627

You can read more about the tool’s functionalities in this MIT news article, which includes several promising GIFs:

Moreover, on the Northstar website you can find this longer video explaining the tool in detail.

https://vimeo.com/342787403

While Northstar looks insanely cool and promising, I do worry about putting such power in the hands of people who may not have much experience with statistics and/or machine learning. We all know how easily errors and bias may slip into data-driven processes, so I am curious to see how these next-gen kind of tools will be deployed and used.

Generalized Additive Models Tutorial in R, by Noam Ross

Generalized Additive Models — or GAMs in short — have been somewhat of a mystery to me. I’ve known about them, but didn’t know exactly what they did, or when they’re useful. That came to an end when I found out about this tutorial by Noam Ross.

In this beautiful, online, interactive course, Noam allows you to program several GAMs yourself (in R) and to progressively learn about the different functions and features. I am currently halfway through, but already very much enjoy it.

If you’re already familiar with linear models and want to learn something new, I strongly recommend this course!

The interactive course asks you to program several GAMs yourself https://noamross.github.io/gams-in-r-course/

You progressively learn how to run, interpret, and visualize GAMs yourself https://noamross.github.io/gams-in-r-course/

After a while you are even able to visualize smoothed interactions between variables https://noamross.github.io/gams-in-r-course/

Free Programming Books (I still need to read)

There are multiple unread e-mails in my inbox.

Links to books.

Just sitting there. Waiting to be opened, read. For months already.

The sender, you ask? Me. Paul van der Laken.

A nuisance that guy, I tell you. He keeps sending me reminders, of stuff to do, books to read. Books he’s sure a more productive me would enjoy.

Now, I could wipe my inbox. Be done with it. But I don’t wan’t to lose this digital to-do list… Perhaps I should put them here instead. So you can help me read them!

Each of the below links represents a formidable book on programming! (I hear)
And there are free versions! Have a quick peek. A peek won’t hurt you:

Disclaimer: This page contains one or more links to Amazon.
Any purchases made through those links provide us with a small commission that helps to host this blog.

Applied Predictive Modelling – by Max Kuhn & Kjell Johnson
Feature Engineering and Selection: A Practical Approach for Predictive Models – by Max Kuhn & Kjell Johnson
- http://www.feat.engineering/
The Pragmatic Programmer – by Andrew Hunt & David Thomas
- Buy this book via Amazon to support its authors
- https://www.nceclusters.no/globalassets/filer/nce/diverse/the-pragmatic-programmer.pdf
Clean Code – by Robert Martin
- Buy this book via Amazon to support its authors
- https://www.investigatii.md/uploads/resurse/Clean_Code.pdf
R for Data Science – by Hadley Wickham
- Buy this book via Amazon to support its authors
- https://r4ds.had.co.nz/

Advanced R – by Hadley Wickham
- Buy this book via Amazon to support its authors
- https://adv-r.hadley.nz/index.html

R Markdown: The Definitive Guide – by Yihui Xie, J. J. Allaire, & Garrett Grolemund
- Buy this book via Amazon to support its authors
- https://bookdown.org/yihui/rmarkdown/
Bookdown: Authoring Books and Technical Documents with R Markdown – by Yihui Xie
- Buy this book via Amazon to support its authors
- https://bookdown.org/yihui/bookdown/
blogdown – by Yihui Xie
- Buy this book via Amazon to support its authors
- https://bookdown.org/yihui/blogdown/
The Hundred Page Machine Learning Book – by Andriy Burkov
- Buy this book via Amazon to support its authors
- https://file.ai100.com.cn/files/file-code/original/cd136ebe-0e34-4e43-966b-224acff83005/100MLBOOK/Chapter8.pdf
An Introduction to Statistical Learning – by Gareth James, Daniela Witten, Trevor Hastie, & Robert Tibshirani
- Buy this book via Amazon to support its authors
- http://www-bcf.usc.edu/~gareth/ISL/

The Elements of Statistical Learning – by Trevor Hastie, Robert Tibshirani, & Jerone Friedman
- Buy this book via Amazon to support its authors
- http://web.stanford.edu/~hastie/ElemStatLearn/

Interpretable Machine Learning – by Christoph Molnar
- Buy this book via LeanPub to support its authors
- https://christophm.github.io/interpretable-ml-book/index.html
Deep Learning – by Ian Goodfellow, Yoshua Bengio, & Aaron Courville
- Buy this book via Amazon to support its authors
- https://www.deeplearningbook.org/
Deep Learning with Python – by Francois Chollet
Pro Git – by Scott Chacon & Ben Straub
- Buy this book via Amazon to support its authors
- https://git-scm.com/book/en/v2

The books listed above have a publicly accessible version linked. Some are legitimate. Other links are somewhat shady.
If you feel like you learned something from reading one of the books (which you surely will), please buy a hardcopy version. Or an e-book. At the very least, reach out to the author and share what you appreciated in his/her work.
It takes valuable time to write a book, and we should encourage and cherish those who take that time.

For more books on R programming, check out my R resources overview.

For books on data analytics and (behavioural) psychology in (HR) management, check out Books for the modern data-driven HR professional.

Awful AI: A curated list of scary usages of artificial intelligence

I found this amazingly horrifying list called Awful Artificial Intelligence:

Artificial intelligence [AI] in its current state is unfair, easily susceptible to attacks and notoriously difficult to control. Nevertheless, more and more concerning the uses of AI technology are appearing in the wild.
[Awful A.I.] aims to track all of them. We hope that Awful AI can be a platform to spur discussion for the development of possible contestational technology (to fight back!).
David Dao on the Awful A.I. github repository

The Awful A.I. list contains a few dozen applications of machine learning where the results were less than optimal for several involved parties. These AI solutions either resulted in discrimination, disinformation (fake news), mass surveillance, or severely violate privacy or ethical issues in many other ways.

We’ve all heard of Cambridge Analytica, but there are many more on this Awful A.I. list:

Deep Fakes – Deep Fakes is an artificial intelligence-based human image synthesis technique. It is used to combine and superimpose existing images and videos onto source images or videos. Deepfakes may be used to create fake celebrity pornographic videos or revenge porn. [AI assisted fake porn][CNN Interactive Report]
David Dao on the Awful A.I. github repository

Social Credit System – Using a secret algorithm, Sesame credit constantly scores people from 350 to 950, and its ratings are based on factors including considerations of “interpersonal relationships” and consumer habits. [summary][Foreign Correspondent (video)][travel ban]
David Dao on the Awful A.I. github repository

SenseTime & Megvii– Based on Face Recognition technology powered by deep learning algorithm, SenseFace and Megvii provides integrated solutions of intelligent video analysis, which functions in target surveillance, trajectory analysis, population management. [summary][forbes][The Economist (video)]
David Dao on the Awful A.I. github repository

Check out the full list here.

David Dao is a PhD student at DS3Lab — the computer science dpt. of Zurich — and maintains the awful AI list. The cover photo was created by LargeStupidity on Drawception