Category: best practices

Propensity Score Matching Explained Visually

Propensity score matching (wiki) is a statistical matching technique that attempts to estimate the effect of a treatment (e.g., intervention) by accounting for the factors that predict whether an individual would be eligble for receiving the treatment. The wikipedia page provides a good example setting:

Say we are interested in the effects of smoking on health. Here, smoking would be considered the treatment, and the ‘treated’ are simply those who smoke. In order to find a cause-effect relationship, we would need to run an experiment and randomly assign people to smoking and non-smoking conditions. Of course such experiments would be unfeasible and/or unethical, as we can’t ask/force people to smoke when we suspect it may do harm.
We will need to work with observational data instead. Here, we estimate the treatment effect by simply comparing health outcomes (e.g., rate of cancer) between those who smoked and did not smoke. However, this estimation would be biased by any factors that predict smoking (e.g., social economic status). Propensity score matching attempts to control for these differences (i.e., biases) by making the comparison groups (i.e., smoking and non-smoking) more comparable.

Lucy D’Agostino McGowan is a post-doc at Johns Hopkins Bloomberg School of Public Health and co-founder of R-Ladies Nashville. She wrote a very nice blog explaining what propensity score matching is and showing how to apply it to your dataset in R. Lucy demonstrates how you can use propensity scores to weight your observations in such a way that accounts for the factors that correlate with receiving a treatment. Moreover, her explainations are strenghtened by nice visuals that intuitively demonstrate what the weighting does to the “pseudo-populations” used to estimate the treatment effect.

Have a look yourself: https://livefreeordichotomize.com/2019/01/17/understanding-propensity-score-weighting/

Recommended Books on Data Visualization

Disclaimer: This page contains one or more links to Amazon.
Any purchases made through those links provide us with a small commission that helps to host this blog.

Data visualization and the (in)effective communication of information are salient topics on this blog. I just love to read and write about best practices related to data visualization (or bad practices), or to explore novel types of complex graphs. However, I am not always online, and I am equally fond of reading about data visualization offline.

These amazing books about data visualization
are written by some of the leading experts in the dataviz scene:

#MakeoverMonday: Improving How We Visualize and Analyze Data, One Chart at a Time — by Andy Kreibel
Better Presentations: A Guide for Scholars, Researchers, and Wonks — by Jon Schwabish
D3.js in Action: Data Visualization with Javascript — by Elijah Meeks
Data Visualization: A Practical Introduction — by Kieran Healy

Data Visualization: Charts, Maps, and Interactive Graphics — by Robert Grant
Fundamentals of Data Visualization — by Claus Wilke
Information is Beautiful — by David McCandless
Interactive Data Visualization for the Web: An Introduction to Designing with D3 — by Scott Murray
Knowledge is Beautiful — by David McCandless
Storytelling with Data — by Cole Nussbaumer Knaflic
Storytelling with Data: Let’s Practice — by Cole Nussbaumer Knaflic

The Functional Art: An introduction to information graphics and visualization (Voices That Matter) — by Alberto Cairo
The Truthful Art: Data, Charts, and Maps for Communication — by Alberto Cairo
Visual Miscellaneum — by David McCandless
Visualization Analysis and Design — by Tamara Munzner
Visualize This — by Nathan Yau

Happy reading!

If you are also interested in programming and machine learning, have a look at this list of free programming books.

Learn from the Pros: How media companies visualize data

Past months, multiple companies shared their approaches to data visualization and their lessons learned.

Click the companies in the list below to jump to their respective section

The Financial Times
The Britisch Broadcast Corporation
The Economist
FiveThirtyEight

Financial Times

The Financial Times (FT) released a searchable database of the many data visualizations they produced over the years. Some lovely examples include:

Graphic showing what May needs to happen to get her deal over the line when MPs vote on Friday — Data visualization belonging to a recent Brexit piece by the FT, viahttps://www.ft.com/graphics

Dutch housing graphic — Searching the FT database for *European House Prices* via https://www.ft.com/graphics returns this map of the Netherlands.

BBC

The BBC released a free cookbook for data visualization using R programming. Here is the associated Medium post announcing the book.

The BBC data team developed an R package (bbplot) which makes the process of creating publication-ready graphics in their in-house style using R’s ggplot2 library a more reproducible process, as well as making it easier for people new to R to create graphics.

Apart from sharing several best practices related to data visualization, they walk you through the steps and R code to create graphs such as the below:

One of the graphs the BBC cookbook will help you create, via https://bbc.github.io/rcookbook/

Ultimately, you should be able to reproduce these BBC style graphics, viahttps://medium.com/bbc-visual-and-data-journalism/how-the-bbc-visual-and-data-journalism-team-works-with-graphics-in-r-ed0b35693535

Economist

The data team at the Economist also felt a need to share their lessons learned via Medium. They show some of their most misleading, confusing, and failing graphics of the past years, and share the following mistakes and their remedies:

Truncating the scale (image #1 below)
Forcing a relationship by cherry-picking scales
Choosing the wrong visualisation method (image #2 below)
Taking the “mind-stretch” a little too far (image #3 below)
Confusing use of colour (image #4 below)
Including too much detail
Lots of data, not enough space

Moreover, they share the data behind these failing and repaired data visualizations:

Via https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368

FiveThirtyEight

I could not resist including this (older) overview of the 52 best charts FiveThirtyEight claimed they made.

All 538’s data visualizations are just stunningly beautiful and often very
ingenious, using new chart formats to display complex patterns. Moreover, the range of topics they cover is huge. Anything ranging from their traditional background — politics — to great cover stories on sumo wrestling and pricy wine.

Viahttps://fivethirtyeight.com/features/the-52-best-and-weirdest-charts-we-made-in-2016/

Via https://fivethirtyeight.com/features/the-52-best-and-weirdest-charts-we-made-in-2016/ You should definitely check out the original cover story via https://projects.fivethirtyeight.com/sumo/

18 Pitfalls of Data Visualization

Maarten Lambrechts is a data journalist I closely follow online, with great delight. Recently, he shared on Twitter his slidedeck on the 18 most common data visualization pitfalls. You will probably already be familiar with most, but some (like #14) were new to me:

Save pies for dessert
Don’t cut bars
Don’t cut time axes
Label directly
Use colors deliberately
Avoid chart junk
Scale circles by area
Avoid double axes
Correlation is no causality
Don’t do 3D
Sort on the data
Tell the story
1 chart, 1 message
Common scales on small mult’s
#Endrainbow
Normalise data on maps
Sometimes best map is no map
All maps lie

Even though most of these 18 rules below seem quite obvious, even the European Commissions seems to break them every now and then:

Can you spot what’s wrong with this graph?

Play Your Charts Right: Tips for Effective Data Visualization – by Geckoboard

In a world where data really matters, we all want to create effective charts. But data visualization is rarely taught in schools, or covered in on-the-job training. Most of us learn as we go along, and therefore we often make choices or mistakes that confuse and disorient our audience.
From overcomplicating or overdressing our charts, to conveying an entirely inaccurate message, there are common design pitfalls that can easily be avoided. We’ve put together these pointers to help you create simpler charts that effectively get across the meaning of your data.
Geckoboard

Based on work by experts such as Stephen Few, Dona Wong, Albert Cairo, Cole Nussbaumer Knaflic, and Andy Kirk, the authors at Geckoboard wrote down a list of recommendations which I summarize below:

Present the facts

Start your axis at zero whenever possible, to prevent misinterpretation. Particularly bar charts.
The width and height of line and scatter plots influence its messages.
Area and size are hard to interpret. Hence, there’s often a better alternative to the pie chart. Read also this.

Via Geckoboard
Via Geckoboard

Less is more

Use colors for communication, not decoration.
Diminish non-data ink, to draw attention to that which matters.
Do not use the third dimension, unless you are plotting it.
Avoid overselling numerical accuracy with precise decimal values.

Via Geckoboard
Via Geckoboard

Keep it simple

Annotate your plots; include titles, labels or scales.
Avoid squeezing too much information in a small space. For example, avoid a second x- or y-axis whenever possible.
Align your numbers right, literally.
Don’t go for fancy; go for clear. If you have few values, just display the values.

Via Geckoboard
Via Geckoboard

Infographic summary

Avoid bar plots for continuous data! Do this instead:

Tracey Weissgerber, Natasa Milic, Stacey Winham, and Vesna Garovic wrote this interesting 2015 paper on bar graphs. By a systematic review of physiology research, they demonstrate we need to reconsider how we present continuous data in small samples.

Bar and line plots are commonly used to display continuous data. This is problematic, as many different data distributions can lead to the same bar or line graph. Nevertheless, the rarely used scatterplots, box plots, and histograms much better allow users to critically evaluate continuous data.

They provide many interesting visuals that underline their argument.

For instance, the four datasets below (B, C, D, and E) will all result in the same barplot (A), whereas they demonstrate quite different characteristics.

Alternatively, bar plots are often used for to display group means when observations within groups may not be independent. For instance, it could be that the bars below represent two measurement occassians, and that each of our sampled observations occurs in both. In that case, the scatterplots with connected dots may be more suitable. While the bars in plot A would represent datasets B, C, and D, these are clearly different when viewed in scatterplots.

Also, a lot of meaningful information is typically lost in bar plots. For instance, the number of observations in a group. But also the distribution of values. While the former can be added (see B below), the latter can much better be shown in a scatter plot like C (below).

Actually, in a later blog post, lead researcher Tracey Weissgerber shares the below visual. It highlights the distractive irrelevance of bar plot and the information that is lost (becomes invisible) when opting for a bar chart.

Tracey refactored this into a similar visual of her own:

So what can you do instead, you may ask yourself. To this question too, Tracey has an answer, sharing the below overview of alternatives options:

She made another overview which may help you pick the best visual for your data. This one takes your intention behind the visual as a starting point, though is unfortunately a bit low quality: