Tag: science

How 457 data scientists failed to predict life outcomes

How 457 data scientists failed to predict life outcomes

This blog highlights a recent PNAS paper in which 457 data scientists and academic scholars were challenged use machine learning to predict life outcomes using a rich dataset.

Yet, I can not summarize the result better than this tweet by the author of the paper:

Over 750 scientific papers have used the Fragile Families dataset.

The dataset is famous for its richness of cohort (survey) data on the included families’ lives and their childrens’ upbringings. It includes a whopping 12.942 variables!!

Some of these variables reflect interesting life outcomes of the included families.

For instance, the childrens’ grade point averages (GPA) and grit, but also whether the family was ever evicted or experienced hardship, or whether their primary caregiver had received job training or was laid off at work.

You can read more about the exact data contents in the paper’s appendix.

A visual representation of the data
via pnas.org/content/pnas/117/15/8398/F1.medium.gif

Now Matthew and his co-authors shared this enormous dataset with over 160 teams consisting of 457 academics researchers and data scientists alike. Each of them well versed in statistics and predictive modelling.

These data scientists were challenged with this task: by all means possible, make the most predictive model for the six life outcomes (i.e., GPA, conviction, etc).

The scientists could use all the Fragile Families data, and any algorithm they liked, and their final model and its predictions would be compared against the actual life outcomes in a holdout sample.

According to the paper, many of these teams used machine-learning methods that are not typically used in social science research and that explicitly seek to maximize predictive accuracy.

Now, here’s the summary again:

If hundreds of [data] scientists created predictive algorithms with high-quality data, how well would the best predict life outcomes?

Not very well.

@msalganik

Even the best among the 160 teams’ predictions showed disappointing resemblance of the actual life outcomes. None of the trained models/algorithms achieved an R-squared of over 0.25.

Afbeelding
Via twitter.com/msalganik/status/1263886779603705856/photo/1

Here’s that same plot again, but from the original publication and with more detail:

Via pnas.org/content/117/15/8398

Wondering what these best R-squared of around 0.20 look like? Here’s the disappointg reality of plot C enlarged: the actual TRUE GPA’s on the x-axis, plotted against the best team’s predicted GPA’s on the y-axis.

Afbeelding
Via twitter.com/msalganik/status/1263886781449191424/photo/1

Sure, there’s some relationship, with higher actual scores getting higher (average) predictions. But it ain’t much.

Moreover, there’s very little variation in the predictions. They all clump together between the range of about 2.1 and 3.8… that’s not really setting apart the geniuses from the less bright!

Matthew sums up the implications quite nicely in one of his tweets:

For policymakers deploying predictive algorithms in high-stakes decisions, our result is a reminder of a basic fact: one should not assume that algorithms predict well. That must be demonstrated with transparent, empirical evidence.

@msalganik

According to Matthew this “collective failure of 160 teams” is hard to ignore. And it failure highlights the understanding vs. predicting paradox: these data have been used to generate knowledge on how the world works in over 750 papers, yet few checked to see whether these same data and the scientific models would be useful to predict the life outcomes we’re trying to understand.

I was super excited to read this paper and I love the approach. It is actually quite closely linked to a series of papers I have been working on with Brian Spisak and Brian Doornenbal on trying to predict which people will emerge as organizational leaders. (hint: we could not really, at least not based on their personality)

Apparently, others were as excited as I am about this paper, as Filiz Garip already published a commentary paper on this research piece. Unfortunately, it’s behind a paywall so I haven’t read it yet.

Moreover, if you want to learn more about the approaches the 160 data science teams took in modelling these life outcomes, here are twelve papers in which some teams share their attempts.

Very curious to hear what you think of the paper and its implications. You can access it here, and I’d love to read your comments below.

The 12 Truths of Machine Learning – by Delip Rao

The 12 Truths of Machine Learning – by Delip Rao

In this original blog, with equally original title, Delip Rao poses twelve (+1) harsh truths about the real world practice of machine learning. I found it quite enlightning to read a non-hyped article about ML for once. Particularly because Delip’s experiences seem to overlap quite nicely with the principles of software design and Agile working.

Delip’s 12 truths I’ve copied in headers below. If they spark your interest, read more here:

  1. It has to work
  2. No matter how hard you push and no matter what the priority, you can’t increase the speed of light
  3. With sufficient thrust, pigs fly just fine. However, this is not necessarily a good idea
  4. Some things in life can never be fully appreciated nor understood unless experienced firsthand
  5. It is always possible to agglutinate multiple separate problems into a single complex interdependent solution. In most cases, this is a bad idea
  6. It is easier to ignore or move a problem around than it is to solve it
  7. You always have to tradeoff something
  8. Everything is more complicated than you think
  9. You will always under-provision resources
  10. One size never fits all. Your model will make embarrassing errors all the time despite your best intentions
  11. Every old idea will be proposed again with a different name and a different presentation, regardless of whether it works
  12. Perfection has been reached not when there is nothing left to add, but when there is nothing left to take away

Delip added in a +1, with his zero-indexed truth: You are Not a Scientist.

Yes, that’s all of you building stuff with machine learning with a “scientist” in the title, including all of you with PhDs, has-been-academics, and academics with one foot in the industry. Machine learning (and other AI application areas, like NLP, Vision, Speech, …) is an engineering research discipline (as opposed to science research).

Delip Rao via deliprao.com/archives/227

Delip [bio] is the VP of Research at AI Foundation where he leads speech, language, and vision research efforts for generating and detecting artificial content. You can find his personal webblog here.

Cover image via the-vital-edge.com/lie-detector

Google’s Dataset Search: Direct access to 25 million interesting datasets

Google’s Dataset Search: Direct access to 25 million interesting datasets

I used to keep a repository of links to interesting datasets to learn data science. However, that page I can retire, as Google has launched its new service Dataset Search.

The “world wide web” hosts millions of datasets, on nearly any topic you can think of. Google’s Dataset Search has indexed almost 25 million of these datasets, giving you a single entry point to search for datasets online. After a year of testing, Dataset Search is now officially out of beta.

After alpha testing, Dataset Search now includes filter based on the types of dataset that you want (e.g., tables, images, text), on whether the dataset is open source/access. For dataset on geographic area’s, you can see the map. The quality of dataset’s descriptions has improved greatly, and the tool now has a mobile version.

Anyone who publishes data can make their datasets discoverable in Dataset Search by describe the properties of their dataset using a special schema on their own web page.

How to Read Scientific Papers

How to Read Scientific Papers

Cover image via wikihow.com/Read-a-Scientific-Paper

Reddit is a treasure trove of random stuff. However, every now and then, in the better groups, quite valuable topics pop up. Here’s one I came across on r/statistics:

Particularly the advice by grandzooby seemed worth a like, and he linked to several useful resources which I’ve summarized for you below.

An 11-step guide to reading a paper

Jennifer Raff — assistant professor at the University of Kansas — wrote this 3-page guide on how to read papers. It elaborates on 11 main pieces of advice for reading academic papers:

  1. Begin by reading the introduction, skip the abstract.
  2. Identify the general problem: “What problem is this research field trying to solve?”
  3. Try to uncover the reason and need for this specific study.
  4. Identify the specific problem: “What problems is this paper trying to solve?”
  5. Identify what the researchers are going to do to solve that problem
  6. Read & identify the methods: draw the studies in diagrams
  7. Read & identify the results: write down the main findings
  8. Determine whether the results solve the specific problem
  9. Read the conclusions and determine whether you agree
  10. Read the abstract
  11. Find out what others say about this paper

Jennifer also dedicated a more elaborate blog post to the matter (to which u/grandzooby refers).

4-step Infographic

Natalia Rodriguez made a beautiful infographic with some general advice for Elsevier:

Via https://www.elsevier.com/connect/infographic-how-to-read-a-scientific-paper

How to take notes while reading

Mary Purugganan and Jan Hewitt of Rice University propose slightly different steps for reading academic papers. Though they seem more general pointers to keep in mind to me:

  1. Skim the article and identify its structure
  2. Distinguish its main points
  3. Generate questions before and during reading
  4. Draw inferences while reading
  5. Take notes while reading

Regarding the note taking Mary and Jan propose the following template which may proof useful:

  • Citation:
  • URL:
  • Keywords:
  • General subject:
  • Specific subject:
  • Hypotheses:
  • Methodology:
  • Results:
  • Key points:
  • Context (in the broader field/your work):
  • Significance (to the field/your work):
  • Important figures/tables (description/page numbers):
  • References for further reading:
  • Other comments:

Scholars sharing their experiences

Science Magazine dedicated a long read to how to seriously read scientific papers, in which they asked multiple scholars to share their experiences and tips.

Anatomy of a scientific paper

This 13-page guide by the American Society of Plant Biologists was recommended by some, but I personally don’t find it as useful as the other advices here. Nevertheless, for the laymen, it does include a nice visualization of the anatomy of scientific papers:

Via https://aspb.org/wp-content/uploads/2016/04/HowtoReadScientificPaper.pdf

Learning How to Learn

One reddit user recommend this Coursera course, Learning How to Learn: Powerful mental tools to help you master tough subjects. It’s free, and can be taken in English, but also Portuguese, Spanish, or Chinese.

This course gives you easy access to the invaluable learning techniques used by experts in art, music, literature, math, science, sports, and many other disciplines. We’ll learn about the how the brain uses two very different learning modes and how it encapsulates (“chunks”) information. We’ll also cover illusions of learning, memory techniques, dealing with procrastination, and best practices shown by research to be most effective in helping you master tough subjects.

https://www.coursera.org/learn/learning-how-to-learn

Overviews of Graph Classification and Network Clustering methods

Thanks to Sebastian Raschka I am able to share this great GitHub overview page of relevant graph classification techniques, and the scientific papers behind them. The overview divides the algorithms into four groups:

  1. Factorization
  2. Spectral and Statistical Fingerprints
  3. Deep Learning
  4. Graph Kernels

Moreover, the overview contains links to similar collections on community detectionclassification/regression trees and gradient boosting papers with implementations.

As well as a link to relevant graph classification benchmark datasets.

2019 Shortlist for the Royal Society Prize for Science Books

2019 Shortlist for the Royal Society Prize for Science Books

Since 1988, the Royal Society has celebrated outstanding popular science writing and authors.

Each year, a panel of expert judges choose the book that they believe makes popular science writing compelling and accessible to the public.

Over the decades, the Prize has celebrated some notable winners including Bill Bryson and Stephen Hawking.

The author of the winning book receives £25,000 and £2,500 is awarded to each of the five shortlisted books. And this year’s shortlist includes some definite must-reads on data and statistics!

Infinite Powers – by Steven Strogatz

The captivating story of mathematics’ greatest ever idea: calculus. Without it, there would be no computers, no microwave ovens, no GPS, and no space travel. But before it gave modern man almost infinite powers, calculus was behind centuries of controversy, competition, and even death. 

Taking us on a thrilling journey through three millennia, Professor Steven Strogatz charts the development of this seminal achievement, from the days of Archimedes to today’s breakthroughs in chaos theory and artificial intelligence. Filled with idiosyncratic characters from Pythagoras to Fourier, Infinite Powers is a compelling human drama that reveals the legacy of calculus in nearly every aspect of modern civilisation, including science, politics, medicine, philosophy, and more.

https://royalsociety.org/grants-schemes-awards/book-prizes/science-book-prize/2019/infinite-powers/

Invisible Women – by Caroline Criado Perez

Imagine a world where your phone is too big for your hand, where your doctor prescribes a drug that is wrong for your body, where in a car accident you are 47% more likely to be seriously injured, where every week the countless hours of work you do are not recognised or valued. If any of this sounds familiar, chances are that you’re a woman.

Invisible Women shows us how, in a world largely built for and by men, we are systematically ignoring half the population. It exposes the gender data gap–a gap in our knowledge that is at the root of perpetual, systemic discrimination against women, and that has created a pervasive but invisible bias with a profound effect on women’s lives. From government policy and medical research, to technology, workplaces, urban planning and the media, Invisible Women reveals the biased data that excludes women.

https://royalsociety.org/grants-schemes-awards/book-prizes/science-book-prize/2019/invisible-women/

Six Impossible Things – by John Gribbin

This book does not deal with data or statistics specifically, but might even be more interesting, as it covers the topic of quantum physics:

Quantum physics is strange. It tells us that a particle can be in two places at once. That particle is also a wave, and everything in the quantum world can be described entirely in terms of waves, or entirely in terms of particles, whichever you prefer. 

All of this was clear by the end of the 1920s, but to the great distress of many physicists, let alone ordinary mortals, nobody has ever been able to come up with a common sense explanation of what is going on. Physicists have sought ‘quanta of solace’ in a variety of more or less convincing interpretations. 

This short guide presents us with the six theories that try to explain the wild wonders of quantum. All of them are crazy, and some are crazier than others, but in this world crazy does not necessarily mean wrong, and being crazier does not necessarily mean more wrong.

https://royalsociety.org/grants-schemes-awards/book-prizes/science-book-prize/2019/six-impossible-things/

The other shortlisted books