Tag: personality

How 457 data scientists failed to predict life outcomes

How 457 data scientists failed to predict life outcomes

This blog highlights a recent PNAS paper in which 457 data scientists and academic scholars were challenged use machine learning to predict life outcomes using a rich dataset.

Yet, I can not summarize the result better than this tweet by the author of the paper:

Over 750 scientific papers have used the Fragile Families dataset.

The dataset is famous for its richness of cohort (survey) data on the included families’ lives and their childrens’ upbringings. It includes a whopping 12.942 variables!!

Some of these variables reflect interesting life outcomes of the included families.

For instance, the childrens’ grade point averages (GPA) and grit, but also whether the family was ever evicted or experienced hardship, or whether their primary caregiver had received job training or was laid off at work.

You can read more about the exact data contents in the paper’s appendix.

A visual representation of the data
via pnas.org/content/pnas/117/15/8398/F1.medium.gif

Now Matthew and his co-authors shared this enormous dataset with over 160 teams consisting of 457 academics researchers and data scientists alike. Each of them well versed in statistics and predictive modelling.

These data scientists were challenged with this task: by all means possible, make the most predictive model for the six life outcomes (i.e., GPA, conviction, etc).

The scientists could use all the Fragile Families data, and any algorithm they liked, and their final model and its predictions would be compared against the actual life outcomes in a holdout sample.

According to the paper, many of these teams used machine-learning methods that are not typically used in social science research and that explicitly seek to maximize predictive accuracy.

Now, here’s the summary again:

If hundreds of [data] scientists created predictive algorithms with high-quality data, how well would the best predict life outcomes?

Not very well.


Even the best among the 160 teams’ predictions showed disappointing resemblance of the actual life outcomes. None of the trained models/algorithms achieved an R-squared of over 0.25.

Via twitter.com/msalganik/status/1263886779603705856/photo/1

Here’s that same plot again, but from the original publication and with more detail:

Via pnas.org/content/117/15/8398

Wondering what these best R-squared of around 0.20 look like? Here’s the disappointg reality of plot C enlarged: the actual TRUE GPA’s on the x-axis, plotted against the best team’s predicted GPA’s on the y-axis.

Via twitter.com/msalganik/status/1263886781449191424/photo/1

Sure, there’s some relationship, with higher actual scores getting higher (average) predictions. But it ain’t much.

Moreover, there’s very little variation in the predictions. They all clump together between the range of about 2.1 and 3.8… that’s not really setting apart the geniuses from the less bright!

Matthew sums up the implications quite nicely in one of his tweets:

For policymakers deploying predictive algorithms in high-stakes decisions, our result is a reminder of a basic fact: one should not assume that algorithms predict well. That must be demonstrated with transparent, empirical evidence.


According to Matthew this “collective failure of 160 teams” is hard to ignore. And it failure highlights the understanding vs. predicting paradox: these data have been used to generate knowledge on how the world works in over 750 papers, yet few checked to see whether these same data and the scientific models would be useful to predict the life outcomes we’re trying to understand.

I was super excited to read this paper and I love the approach. It is actually quite closely linked to a series of papers I have been working on with Brian Spisak and Brian Doornenbal on trying to predict which people will emerge as organizational leaders. (hint: we could not really, at least not based on their personality)

Apparently, others were as excited as I am about this paper, as Filiz Garip already published a commentary paper on this research piece. Unfortunately, it’s behind a paywall so I haven’t read it yet.

Moreover, if you want to learn more about the approaches the 160 data science teams took in modelling these life outcomes, here are twelve papers in which some teams share their attempts.

Very curious to hear what you think of the paper and its implications. You can access it here, and I’d love to read your comments below.

Simpson’s Paradox: Two HR examples with R code.

Simpson’s Paradox: Two HR examples with R code.

Simpson (1951) demonstrated that a statistical relationship observed within a population—i.e., a group of individuals—could be reversed within all subgroups that make up that population. This phenomenon, where X seems to relate to Y in a certain way, but flips direction when the population is split for W, has since been referred to as Simpson’s paradox. Others names, according to Wikipedia, include the Simpson-Yule effect, reversal paradox or amalgamation paradox.

The most famous example has to be the seemingly gender-biased Berkeley admission rates:

“Examination of aggregate data on graduate admissions to the University of California, Berkeley, for fall 1973 shows a clear but misleading pattern of bias against female applicants. Examination of the disaggregated data reveals few decision-making units that show statistically significant departures from expected frequencies of female admissions, and about as many units appear to favor women as to favor men. If the data are properly pooled, taking into account the autonomy of departmental decision making, thus correcting for the tendency of women to apply to graduate departments that are more difficult for applicants of either sex to enter, there is a small but statistically significant bias in favor of women. […] The bias in the aggregated data stems not from any pattern of discrimination on the part of admissions committees, which seem quite fair on the whole, but apparently from prior screening at earlier levels of the educational system.” – part of abstract of Bickel, Hammel, & O’Connel (1975)

In a table, the effect becomes clear. While it seems as if women are rejected more often overall, women are actually less often rejected on a departmental level. Women simply applied to more selective departments more often (E & C below), resulting in the overall lower admission rate for women (35% as opposed to 44% for men).

Afbeeldingsresultaat voor berkeley simpson's paradox
Copied from Bits of Pi

Examples in HR

Simpsons Paradox can easily occur in organizational or human resources settings as well. Let me run you through two illustrated examples, I simulated:

Assume you run a company of 1000 employees and you have asked all of them to fill out a Big Five personality survey. Per individual, you therefore have a score depicting his/her personality characteristic Neuroticism, which can run from 0 (not at all neurotic) to 7 (very neurotic). Now you are interested in the extent to which this Neuroticism of employees relates to their Job Performance (measured 0 – 100) and their Salary (measured in Euro’s per Year). In order to get a sense of the effects, you may decide to visualize both these relations in scatter plots:

downloaddownload (6)

From these visualizations it would look like Neuroticism relates significantly and positively to both employees’ performance and their yearly salary. Should you select more neurotic people to improve your overall company performance? Or are you discriminating emotionally-stable (non-neurotic) employees when it comes to salary?

Taking a closer look at the subgroups in your data, you might however find very different relationships. For instance, the positive relationship between neuroticism and performance may only apply to technical positions, but not to those employees’ in service-oriented jobs.

download (7).png

Similarly, splitting the employees by education level, it becomes clear that there is a relationship between neuroticism and education level that may explain the earlier association with salary. More educated employees receive higher salaries and within these groups, neuroticism is actually related to lower yearly income.

download (8).png

If you’d like to see the code used to simulate these data and generate the examples, you can find the R markdown file here on Rpubs.

Solving the paradox

Kievit and colleagues (2013) argue that Simpsons paradox may occur in a wide variety of research designs, methods, and questions, particularly within the social and medical sciences. As such, they propose several means to “control” or minimize the risk of it occurring. The paradox may be prevented from occurring altogether by more rigorous research design: testing mechanisms in longitudinal or intervention studies. However, this is not always feasible. Alternatively, the researchers pose that data visualization may help recognize the patterns and subgroups and thereby diagnose paradoxes. This may be easy if your data looks like this:

An external file that holds a picture, illustration, etc. Object name is fpsyg-04-00513-g0001.jpg

But rather hard, or even impossible, when your data looks more like the below:

An external file that holds a picture, illustration, etc. Object name is fpsyg-04-00513-g0003.jpg

Clustering may nevertheless help to detect Simpson’s paradox when it is not directly observable in the data. To this end, Kievit and Epskamp (2012) have developed a tool to facilitate the detection of hitherto undetected patterns of association in existing datasets. It is written in R, a language specifically tailored for a wide variety of statistical analyses which makes it very suitable for integration into the regular analysis workflow. As an R package, the tool is is freely available and specializes in the detection of cases of Simpson’s paradox for bivariate continuous data with categorical grouping variables (also known as Robinson’s paradox), a very common inference type for psychologists. Finally, its code is open source and can be extended and improved upon depending on the nature of the data being studied.

One example of application is provided in the paper, for a dataset on coffee and neuroticism. A regression analysis would suggest a significant positive association between coffee and neuroticism overall. However, when the detection algorithm of the R package is applied, a different picture appears: the analysis shows that there are three latent clusters present and that the purported positive relationship only holds for one cluster whereas it is negative in the others.

An external file that holds a picture, illustration, etc. Object name is fpsyg-04-00513-g0006.jpg

Update 24-10-2017: minutephysics – one of my favorite YouTube channels – uploaded a video explaining Simpson’s paradox very intuitively in a medical context:

Update 01-11-2017: minutephysics uploaded a follow-up video:

The paradox is that we remain reluctant to fight our bias, even when they are put in plain sight.

EAWOP 2017 – Takeaways

Past week, I attended the 2017 conference of the European Association of Work and Organizational Psychology (EAWOP), which was hosted by University College Dublin. There were many interesting sessions, the venue was amazing, and Dublin is a lovely city.  Personally, I mostly enjoyed the presentations on selection and assessment test validity, and below are my main takeaways:

  • circumplexProfessor Stephen Woods gave a most interesting presentation on the development of a periodic table of personality. The related 2016 JAP article you can find here. Woods compares the most commonly used personality indices, “plotting” each scale on a two-dimensional circumplex of the most strongly related Big-Five OCEAN scales. This creates a structure that closely resembles a periodic table, with which he demonstrates which elements of personality are well-researched and which require more scholarly attention. In the presentation, Woods furthermore reviewed the relationship of several of these elements and their effect on job-related outcomes. You can find the abstracts of the larger personality & analytics symposium here.
  • One of the symposia focused on social desirability, impression management, and faking behaviors in personality measurement. The first presentation by Patrick Dunlop elaborated on the various ways in which to measure faking, such as with bogus items, social desirability scales, or by measuring blatant extreme responses. Dunlop’s exemplary study on repeat applicants to firefighter positions was highly amusing. Second, Nicolas Roulin demonstrated how the perceived competitive climate in organizations can cause applicants to positively inflate most of their personality scores, with the exception of their self-reported Extraversion and Conscientiousness which seemed quite stable no matter the perceived competitiveness. Third, Pelt (Ph.D. at Erasmus University and IXLY) demonstrated how (after some statistical corrections) the level of social desirability in personality tests can be reduced by using forced-choice instead of Likert scales. If practitioners catch on, this will likely become the new status quo. The fourth presentation was also highly relevant, proposing to use items that are less biased in their formulation towards specific personality traits (Extraversion is often promoted whereas items on Introversion inherently have negative connotations (e.g., “shyness”)). Fifth and most interestingly, Van der Linden (also Erasmus) showed how a higher-order factor analysis on the Big-Five OCEAN scales results in a single factor of personality – commonly referred to as the Big-One or the general factor of personality. This one factor could represent some sort of social desirability, but according to meta-analytical results presented by van der Linden, the factor correlates .88 with emotional intelligence! Moreover, it consistently predicts performance behaviors (also as rated by supervisors or in 360 assessments) better than the Big-Five factors separately, with only Conscientiousness retaining some incremental validity. You can find the abstracts and the author details of the symposium here.


  • Schäpers (Free University Berlin) demonstrates with three independent experiments that the situational or contextual prompts in a situational judgment test (SJT) do not matter for its validity. In other words, excluding the work-related critical incidents before the item did not affect the predictive validity: not for general mental ability, personality dimensions, emotional intelligence, nor job performance. Actually, the validity improved a little for certain outcomes. These results suggest that SJTs may measure something completely different from what is previously posed. Schäpers found similar effects for written and video-based SJTs. The abstract of Schäpers’ paper can be found here.
  • Finally, assessment vendor cut-e was the main sponsor of the conference. They presented among others their new tool chatAssess, which brings SJTs to a mobile environment. Via this link (https://maptq.com/default/home/nl/start/2tkxsmdi) you can run a demo using the password demochatassess. The abstract of this larger session on game-based assessment can be found here.


The rest of the 2017 EAWOP program can be viewed here.