18 Pitfalls of Data Visualization

Maarten Lambrechts is a data journalist I closely follow online, with great delight. Recently, he shared on Twitter his slidedeck on the 18 most common data visualization pitfalls. You will probably already be familiar with most, but some (like #14) were new to me:

Save pies for dessert
Don’t cut bars
Don’t cut time axes
Label directly
Use colors deliberately
Avoid chart junk
Scale circles by area
Avoid double axes
Correlation is no causality
Don’t do 3D
Sort on the data
Tell the story
1 chart, 1 message
Common scales on small mult’s
#Endrainbow
Normalise data on maps
Sometimes best map is no map
All maps lie

Even though most of these 18 rules below seem quite obvious, even the European Commissions seems to break them every now and then:

Can you spot what’s wrong with this graph?

R Image Art, by Michael Freeman

Michael Freeman — information researcher at the University of Washington — was asked whether he could manipulate images with only R programming and he thought to give it a try. In his blog, Michael demonstrates how he used ggplot2 and the imager packages, among others, to go from this original photo:

Spain, Vitoria-Gasteiz, Graffiti, Painting, Art — Via https://pixabay.com/photos/spain-vitoria-gasteiz-graffiti-88385/

To this dot representation:

And this voronoi diagram:

Awful AI: A curated list of scary usages of artificial intelligence

I found this amazingly horrifying list called Awful Artificial Intelligence:

Artificial intelligence [AI] in its current state is unfair, easily susceptible to attacks and notoriously difficult to control. Nevertheless, more and more concerning the uses of AI technology are appearing in the wild.
[Awful A.I.] aims to track all of them. We hope that Awful AI can be a platform to spur discussion for the development of possible contestational technology (to fight back!).
David Dao on the Awful A.I. github repository

The Awful A.I. list contains a few dozen applications of machine learning where the results were less than optimal for several involved parties. These AI solutions either resulted in discrimination, disinformation (fake news), mass surveillance, or severely violate privacy or ethical issues in many other ways.

We’ve all heard of Cambridge Analytica, but there are many more on this Awful A.I. list:

Deep Fakes – Deep Fakes is an artificial intelligence-based human image synthesis technique. It is used to combine and superimpose existing images and videos onto source images or videos. Deepfakes may be used to create fake celebrity pornographic videos or revenge porn. [AI assisted fake porn][CNN Interactive Report]
David Dao on the Awful A.I. github repository

Social Credit System – Using a secret algorithm, Sesame credit constantly scores people from 350 to 950, and its ratings are based on factors including considerations of “interpersonal relationships” and consumer habits. [summary][Foreign Correspondent (video)][travel ban]
David Dao on the Awful A.I. github repository

SenseTime & Megvii– Based on Face Recognition technology powered by deep learning algorithm, SenseFace and Megvii provides integrated solutions of intelligent video analysis, which functions in target surveillance, trajectory analysis, population management. [summary][forbes][The Economist (video)]
David Dao on the Awful A.I. github repository

Check out the full list here.

David Dao is a PhD student at DS3Lab — the computer science dpt. of Zurich — and maintains the awful AI list. The cover photo was created by LargeStupidity on Drawception

Play Your Charts Right: Tips for Effective Data Visualization – by Geckoboard

In a world where data really matters, we all want to create effective charts. But data visualization is rarely taught in schools, or covered in on-the-job training. Most of us learn as we go along, and therefore we often make choices or mistakes that confuse and disorient our audience.
From overcomplicating or overdressing our charts, to conveying an entirely inaccurate message, there are common design pitfalls that can easily be avoided. We’ve put together these pointers to help you create simpler charts that effectively get across the meaning of your data.
Geckoboard

Based on work by experts such as Stephen Few, Dona Wong, Albert Cairo, Cole Nussbaumer Knaflic, and Andy Kirk, the authors at Geckoboard wrote down a list of recommendations which I summarize below:

Present the facts

Start your axis at zero whenever possible, to prevent misinterpretation. Particularly bar charts.
The width and height of line and scatter plots influence its messages.
Area and size are hard to interpret. Hence, there’s often a better alternative to the pie chart. Read also this.

Via Geckoboard
Via Geckoboard

Less is more

Use colors for communication, not decoration.
Diminish non-data ink, to draw attention to that which matters.
Do not use the third dimension, unless you are plotting it.
Avoid overselling numerical accuracy with precise decimal values.

Via Geckoboard
Via Geckoboard

Keep it simple

Annotate your plots; include titles, labels or scales.
Avoid squeezing too much information in a small space. For example, avoid a second x- or y-axis whenever possible.
Align your numbers right, literally.
Don’t go for fancy; go for clear. If you have few values, just display the values.

Via Geckoboard
Via Geckoboard

Infographic summary

rstudio::conf 2019 summary

Welcome to rstudio::conf 2019

Similar to last year, I was not able to attend rstudio::conf 2019.

Fortunately, so much of the conference is shared on Twitter and media outlets that I still felt included. Here are some things that I liked and learned from, despite the Austin-Tilburg distance.

All presentations are streamed

One great thing about rstudio::conf is that all presentations are streamed and later posted on the RStudio website.

Of what I’ve already reviewed, I really liked Jenny Bryan’s presentation on lazy evaluation, Max Kuhn’s presentation on parsnip, and teaching data science with puzzles by Irene Steves. Also, the gt package is a serious power tool! And I was already a gganimate fanboy, as you know from here and here.

Pass the dots = mindblown. #rstats #RStudioConf #timesaver #elegantcode pic.twitter.com/bV1gCDVZS5
— Jennifer Chunn (@jchunn206) January 18, 2019

One of the insights shared in Jenny Bryan’s talk that can be a life-saver

I think I’m going to watch all talks over the coming weekends!

Slides & Extra Materials

There’s an official rstudio-conf repository on Github hosting many materials in an orderly fashion.

Karl Broman made his own awesome GitHub repository with links to the videos, the slides, and all kinds of extra resources.

Karl’s handy github repo of rstudio::conf

All takeaways in a handy #rstudioconf Shiny app

Garrick Aden-Buie made a fabulous Shiny app that allows you to review all #rstudioconf tweets during and since the conference. It even includes some random statistics about the tweets, and a page with all the shared media.

Some random takeaways

Via this tweet about this rstudio::conf presentation

Data scientists can fail by:
❌not saying no enough
❌not providing anything more than a cursory analysis
❌assuming PM knows enough to ask question in the right way and not collaborating with them
❌caring more about using fancy method than solving business problems#rstudioconf
— Emily Robinson (@robinson_es) January 18, 2019

Some words of wisdom by Emily Robinson (whom we know from here)

Recently heard about #tidytuesday from @drob Keynote at #RStudioConf and want to know how to get started?

Check out the GitHub repo for all the details!

New data drops every Monday morning!https://t.co/8NaXR93uIX
— Tom Mock (@thomas_mock) January 18, 2019

You should consider joining #tidytuesday!

Extra: Online RStudio Webinars

Did you know that RStudio also posts all the webinars they host? There really are some hidden pearls among them. For instance, this presentation by Nathan Stephens on rendering rmarkdown to powerpoint will save me tons of work, and those new to broom will also be astonished by this webinar by Alex Hayes.

Visualization innovations: Waffleplots and Swarmplots

Last year witnessed the creation of many novel types of data visualization. Some lesser known ones, jokingly referred to as xenographics, I already discussed.

Two new visualization formats seem to stick around though. And as always, it was not long before someone created special R packages for them. Get ready to meet waffleplots and swarmplots!

Waffleplot

Waffleplots — also called square pie charts — are very useful in communicating parts of a whole for categorical quantities. Bob Rudis (twitter) — scholar and R developer among many other things — did us all a favor and created the R waffle package.

First, we need to install and load the waffle package.

install.packages("waffle") # install waffle package
library(waffle) # load in package

I will use the famous iris data to demonstrate both plots.

Since waffleplots work with frequencies, I will specifically use the iris$Species data stored as a frequency table.

(spec <- table(iris$Species))

    setosa versicolor  virginica 
        50         50         50

Now let’s produce our first waffle plot

waffle(spec)

Here, we see every single flower in the iris dataset represented by a tile. This provides an immediate visual representation of the group sizes in the dataset. Looks pretty huh!

But we can play around with the display settings, for instance, let’s change the number of rows and the placement of the legend. Building on ggplot2, the waffle package works very intuitive:

waffle(spec, rows = 3, legend_pos = "bottom")

Or, in case we want to highlight a specific value, we could play around with the colors a bit.

waffle(spec, rows = 15, colors = c("lightgrey", "darkgrey", "red"))

The plot is a bit crowded though with each flower as a seperate tile. We can simply reduce the number of tiles by dividing the values in our frequency table, like so:

# do not forget to annotate what each square represents!
w1 <- waffle(spec / 10, rows = 5, xlab = "1 square = 10 flowers")
w1

Finally, you might want to combine multiple waffles into a single visual. This you can do with the accompanied well-named waffle::iron function. Like so:

iron(
  waffle(spec / 5, rows = 5, title = "iron() combines waffles"),
  w1
)

I am definately going to use this package in my daily work. I just love the visual simplicity.

As a final remark, the waffle Github page argues that the argument use_glyph can be used to replace the tiles by pictures from the extrafont package, however, I could not get the code to work.

The visual resulting from the use_glyph waffle example via github.

The ggplot2 waffle extension geom_waffle is being developed as we speak, but is not yet hosted on CRAN yet.

Swarmplot

A second innovation comes in the form of the so-called swarmplot, or beeswarmplot, and is already hosted on CRAN in the form of the ggbeeswarm package, developed by Erik Clarke (I think this guy) and Scott Sherril-Mix.

Some examples hosted on the Github page also use the iris dataset, so you can have a look at those. However, I made novel visuals because I prefer theme_light. Hence, I first install the ggbeeswarm package along with ggplot2, and then set the default theme to theme_light.

install.packages("ggplot2")
install.packages("ggbeeswarm")

library(ggplot2)
library(ggbeeswarm)

theme_set(theme_light()) # sets a default ggplot theme

`geom_beeswarm`

There are two main functions in the ggbeeswarm package, geom_beeswarm and geom_quasirandom. Let’s start with the actual beeswarm.

Generating a beeswarm plot is as simple as any other ggplot:

ggplot(iris, aes(Species, Sepal.Length)) + geom_beeswarm()

As this is an “official” ggplot2 extension, most functionality works the same as in any other geom_*. Thus, adding colors or increasing point size is easy:

ggplot(iris, aes(Species, Sepal.Length, col = Species)) + geom_beeswarm(size = 2)

For larger sizes, you might want to adjust the spacing between the points using the cex argument.

ggplot(iris, aes(Species, Sepal.Length, col = Species)) + geom_beeswarm(size = 3, cex = 3)

Points in a beeswarmplot are automatically plotted side-by-side grouped on the X variable, but you can turn that off with the groupOnX command.

ggplot(iris, aes(Species, Sepal.Length, col = Species)) + geom_beeswarm(groupOnX = FALSE)

Finally, if you have another grouping variable besides those on the axis (e.g., a large Sepal.Length below), you might want to consider using the dodge.width argument to seperate the groups.

ggplot(iris, aes(Species, Sepal.Length, col = Sepal.Length > 5)) + geom_beeswarm(dodge.width = 0.5)

`geom_quasirandom`

The second function in the ggbeeswarm package is geom_quasirandom, an alternative to the original geom_jitter. Basically, it’s a convenient tool to offset points within categories to reduce overplotting.

ggplot(iris, aes(Species, Sepal.Length, col = Species)) + geom_quasirandom()

Instead of the quasirandom offset, the geom allows for many other methods, including a smiley face pattern : )

ggplot(iris, aes(Species, Sepal.Length, col = Species)) + geom_quasirandom(size = 2, method = "smiley")

There is also a earlier package on CRAN, called beeswarm, but it doesn’t seem to be maintained anymore. Moreover, its syntax more or less resembles R’s base::plot, whereas I have a strong preference for ggplot2 personally.