Tag: selection

How Booking.com deals with Selection Bias

I came across this PyData 2018 talk by Lucas Bernadi of Booking.com where he talks about the importance of selection bias for practical applications of machine learning.

We can’t just throw data into machines and expect to see any meaning […], we need to think [about this]. I see a strong trend in the practitioners community to just automate everything, to just throw data into a black box and expect to get money out of it, and I really don’t believe in that.
Lucas Bernadi in https://www.youtube.com/watch?v=3ZWCKr0vDtc

All pictures below are slides from the above video.

My summary / interpretation

Lucas highlights an example he has been working on at Booking.com, where they seek to predict which searching activities on their website are for family trips.

What happens is that people forget to specify that they intend to travel as a family, forget to input one/two/three child travellers will come along on the trip, and end up not being able to book the accomodations that come up during their search. If Booking.com would know, in advance, that people (may) be searching for family accomodations, they can better guide these bookers to family arrangements.

The problem here is that many business processes in real life look and act like a funnel. Samples drop out of the process during the course of it. So too the user search activity on Booking.com’s website acts like a funnel.

People come to search for arrangements
Less people end up actually booking arrangements
Even less people actually go on their trip
And even less people then write up a review

However, only for those people that end up writing a review, Booking.com knows 100% certain that they it concerned a family trip, as that is the moment the user can specify so. Of all other people, who did not reach stage 4 of the funnel, Booking.com has no (or not as accurate an) idea whether they were looking for family trips.

Such a funnel thus inherently produces business data with selection bias in it. Only for people making it to the review stage we know whether they were family trips or not. And only those labeled data can be used to train our machine learning model.

And now for the issue: if you train and evaluate a machine learning model on data generated with such a selection bias, your observed performance metrics will not reflect the actual performance of your machine learning model!

Actually, they are pretty much overestimates.

This is very much an issue, even though many ML practitioners don’t see aware. Selection bias makes us blind as to the real performance of our machine learning models. It produces high variance in the region of our feature space where labels are missing. This leads us to being overconfident in our ability to predict whether some user is looking for a family trip. And if the mechanism causing the selection bias is still there, we could never find out that we are overconfident. Consistently estimating, say, 30% of people are looking for family trips, whereas only 25% are.

Fortunately, Lucas proposes a very simple solution! Just adding more observations can (partially) alleviate this detrimental effect of selection bias. Although our bias still remains, the variance goes down and the difference between our observed and actual performance decreases.

A second issue and solution to selection bias relates to propensity (see also): the extent to which your features X influence not only the outcome Y, but also the selection criteria s.

If our features X influence both the outcome Y but also the selection criteria s, selection bias will occur in your data and can thus screw up your conclusion. In order to inspect to what extent this occurs in your setting, you will want to estimate a propensity model. If that model is good, and X appears valuable in predicting s, you have a selection bias problem.

Via a propensity model s ~ X, we quantify to what extent selection bias influences our data and model. The nice thing is that we, as data scientists, control the features X we use to train a model. Hence, we could just use only features X that do not predict s to predict Y. Conclusion: we can conduct propensity-based feature selection in our Y ~ X by simply avoiding features X that predicted s!

Still, Lucas does point that this becomes difficult when you have valuable features that predict both s and Y. Hence, propensity-based feature selection may end up cost(ing) you performance, as you will need to remove features relevant to Y.

I am sure I explained this phenomena worse than Lucas did himself, so please do have a look at the original PyData 2018 Amsterdam video!

Survival of the Best Fit: A webgame on AI in recruitment

Survival of the Best Fit is a webgame that simulates what happens when companies automate their recruitment and selection processes.

You – playing as the CEO of a starting tech company – are asked to select your favorite candidates from a line-up, based on their resumés.

As your simulated company grows, the time pressure increases, and you are forced to automate the selection process.

Fortunately, some smart techies working for your company propose training a computer to hire just like you just did.

They don’t need anything but the data you just generated and some good old supervised machine learning!

To avoid spoilers, try the game yourself and see what happens!

The game only takes a few minutes, and is best played on mobile.

www.survivalofthebestfit.com/ via Medium

Survival of the Best Fit was built by Gabor Csapo, Jihyun Kim, Miha Klasinc, and Alia ElKattan. They are software engineers, designers and technologists, advocating for better software that allows members of the public to question its impact on society.

You don’t need to be an engineer to question how technology is affecting our lives. The goal is not for everyone to be a data scientist or machine learning engineer, though the field can certainly use more diversity, but to have enough awareness to join the conversation and ask important questions.
With Survival of the Best Fit, we want to reach an audience that may not be the makers of the very technology that impact them everyday. We want to help them better understand how AI works and how it may affect them, so that they can better demand transparency and accountability in systems that make more and more decisions for us.
survivalofthebestfit.com

I found that the game provides a great intuitive explanation of how (humas) bias can slip into A.I. or machine learning applications in recruitment, selection, or other human resource management practices and processes.

If you want to read more about people analytics and machine learning in HR, I wrote my dissertation on the topic and have many great books I strongly recommend.

Finally, here’s a nice Medium post about the game.

https://www.survivalofthebestfit.com/game/

Note, as Joachin replied below, that the game apparently does not learn from user-input, but is programmed to always result in bias towards blues.
I kind of hoped that there was actually an algorithm “learning” in the backend, and while the developers could argue that the bias arises from the added external training data (you picked either Google, Apple, or Amazon to learn from), it feels like a bit of a disappointment that there is no real interactivity here.

Animated Citation Gates turned into Selection Gates

Bret Beheim — senior researcher at the Max Planck Institute for Evolutionary Anthropology — posted a great GIF animation of the response to his research survey. He calls the figure citation gates, relating the year of scientific publication to the likelihood that the research materials are published open-source or accessible.

To generate the visualization, Bret used R’s base plotting functionality combined with Thomas Lin Pedersen‘s R package tweenrto animate it.

I've been experimenting with R animations using the tweenR package for visualizing the results of our reproducibility survey, and I think it turned out pretty nice. pic.twitter.com/MRerAWHNYT
— Bret Beheim (@babeheim) November 17, 2018

Bret shared his R code for the above GIF of his citation gates on GitHub. With the open source code, this amazing visual display inspired others to make similar GIFs for their own projects. For example, Anne-Wil Kruijt’s dance of the confidence intervals:

Two wks ago I built a shiny 'CI demo' app for a job interview. Yet
I wasn't quite content with it. Then 2 days ago @babeheim posted an amazing gif (srsly, go check it!). Super inspired, & borrowing heavily from his code: my rendition of 'the Dance of the Confidence Intervals' pic.twitter.com/ORheOBBzDm
— Anne-Wil Kruijt (@t_awkr) November 20, 2018

A spin-off of the citation gates: A gif showing confidence intervals of sample means.

Applied to a Human Resource Management context, we could use this similar animation setup to explore, for instance, recruitment, selection, or talent management processes.

Unfortunately, I couldn’t get the below figure to animate properly yet, but I am working on it (damn ggplot2 facets). It’s a quick simulation of how this type of visualization could help to get insights into the recruitment and selection process for open vacancies.

The figure shows how nearly 200 applicants — sorted by their age — go through several selection barriers. A closer look demonstrates that some applicants actually skip the screening and assessment steps and join via a fast lane in the first interview round, which could happen, for instance, when there are known or preferred internal candidates. When animated, such insights would become more clearly visible.

Robert Coombs and his application robot

Robert Coombs wanted to see whether he could land a new job. He was aware that, these days, organizations often employ applicant tracking systems to progress/fail incoming applications. Hence, Robert concluded that he had two challenges in his search for a new job:

He was up against leaders in their field, so his resume wouldn’t simply jump to the top of the pile.
Robots would read his application, along with those of his competition.

Being a tech enthusiast and having some programming skills, he decided to build his own application robot, capable of sending a customized CV and resume to the thousands of jobs posted online every day, in a matter of seconds. I strongly recommend you read his full story here, but these were his conclusions:

It’s not how you apply, it’s who you know. And if you don’t know someone, don’t bother.
Companies are trying to fill a position with minimal risk, not discover someone who breaks the mold.
The number of jobs you apply to has no correlation to whether you’ll be considered, and you won’t be considered for jobs you don’t get the chance to apply to.

What I found most amusing is that he A/B tested one normal-looking cover letter and a letter in which he that admits right in the second sentence that it was being sent by a robot. “Now, one of those letters should have performed either a lot better or a lot worse than the other. For my purposes, I didn’t care which” he states. But as far as he could tell from the results of this experiment, it seems that nobody even reads cover letters anymore – not even the robots supposedly used in application tracking systems.

Multi-Armed Bandits: The Smart Alternative for A/B Testing

Just as humans, computers learn by experience.The purpose of A/B testing is often to collect data to decide whether intervention A or B is better. As such, we provide one group with intervention A whereas another group receives intervention B. With the data of these two groups coming in, the computer can statistically estimate which intervention (A or B) is more effective. The more data the computer has, the more certain the estimate is. Here, a trade-off exists: we need to collect data on both interventions to be certain which is best. But we don’t want to conduct an inefficient intervention, say B, if we are quite sure already that intervention A is better.

In his post, Corné de Ruijt of Endouble writes about multi-armed bandit algorithms, which try to optimize this trade-off: “Multi-armed bandit algorithms try to overcome the high missed opportunity cost involved in learning, by exploiting and exploring at the same time. Therefore, these methods are in particular interesting when there is a high lost opportunity cost involved in the experiment, and when exploring and exploiting must be performed during a limited time interval.“

In the full article, you can read Corné’s comparison of this multi-armed bandit approach to the traditional A/B testing approach using a recruitment and selection example. For those of you who are interested in reading how anyone can apply this algorithm and others to optimize our own daily decisions, I highly recommend the book Algorithms to Live By: The Computer Science of Human Decisions available on Amazon or the Dutch bol.com.