Tag: dataviz

Daily Art by Saskia Freeke

Daily Art by Saskia Freeke

Saskia Freeke (twitter) is a Dutch artist, creative coder, interaction designer, visual designer, and educator working from Amsterdam. She has been creating an awesome digital art piece for every day since January 1st 2015. Her ever-growing collection includes some animated, visual masterpieces.

My personal favorites are Saskia’s moving works, her GIFs:

Saskia uses Processing to create her art. Processing is a Java-based language, also used often by Daniel Shiffmann whom we know from the Coding Train.

18 Pitfalls of Data Visualization

18 Pitfalls of Data Visualization

Maarten Lambrechts is a data journalist I closely follow online, with great delight. Recently, he shared on Twitter his slidedeck on the 18 most common data visualization pitfalls. You will probably already be familiar with most, but some (like #14) were new to me:

  1. Save pies for dessert
  2. Don’t cut bars
  3. Don’t cut time axes
  4. Label directly
  5. Use colors deliberately
  6. Avoid chart junk
  7. Scale circles by area
  8. Avoid double axes
  9. Correlation is no causality
  10. Don’t do 3D
  11. Sort on the data
  12. Tell the story
  13. 1 chart, 1 message
  14. Common scales on small mult’s
  15. #Endrainbow
  16. Normalise data on maps
  17. Sometimes best map is no map
  18. All maps lie

Even though most of these 18 rules below seem quite obvious, even the European Commissions seems to break them every now and then:

Can you spot what’s wrong with this graph?

Play Your Charts Right: Tips for Effective Data Visualization – by Geckoboard

Play Your Charts Right: Tips for Effective Data Visualization – by Geckoboard

In a world where data really matters, we all want to create effective charts. But data visualization is rarely taught in schools, or covered in on-the-job training. Most of us learn as we go along, and therefore we often make choices or mistakes that confuse and disorient our audience.
From overcomplicating or overdressing our charts, to conveying an entirely inaccurate message, there are common design pitfalls that can easily be avoided. We’ve put together these pointers to help you create simpler charts that effectively get across the meaning of your data.

Geckoboard

Based on work by experts such as Stephen Few, Dona Wong, Albert Cairo, Cole Nussbaumer Knaflic, and Andy Kirk, the authors at Geckoboard wrote down a list of recommendations which I summarize below:

Present the facts

  • Start your axis at zero whenever possible, to prevent misinterpretation. Particularly bar charts.
  • The width and height of line and scatter plots influence its messages.
  • Area and size are hard to interpret. Hence, there’s often a better alternative to the pie chart. Read also this.

Less is more

  • Use colors for communication, not decoration.
  • Diminish non-data ink, to draw attention to that which matters.
  • Do not use the third dimension, unless you are plotting it.
  • Avoid overselling numerical accuracy with precise decimal values.

Keep it simple

  • Annotate your plots; include titles, labels or scales.
  • Avoid squeezing too much information in a small space. For example, avoid a second x- or y-axis whenever possible.
  • Align your numbers right, literally.
  • Don’t go for fancy; go for clear. If you have few values, just display the values.

Infographic summary

#100DaysOfCode: Machine Learning & Data Visualization

#100DaysOfCode: Machine Learning & Data Visualization

2018 seemed to be the year of challenges going viral on the web. Most of them were plain stupid and/or dangerous. However, one viral challenge I did like: #100DaysOfCode

1. Code minimum an hour every day for the next 100 days.

2. Tweet your progress every day with the #100DaysOfCode hashtag.

3. Each day, reach out to at least two people on Twitter who are also doing the challenge

100 Days of Code rulebook

Many (aspiring) programming professionals competed in this challenge, sharing their learning journeys in domains from web development, machine learning, or data visualization.

With this blog, I wanted to share two of those learning journeys that stood out for me.

Machine learning

First, there’s Avik Jain’s 100 days of Machine Learning code repository on Github. Avik’s repository contains all learning activities he followed during the 53 days of programming he completed. Some of Avik’s entries really stood out, and I particularly liked his educational infographics:

Just look at the wonderful design and visual aids on this decision tree for dummies infographic, pseudocode and all:

Day 23: Decision trees for dummies. This just looks fabulous right?!

Apart from the infographics, Avik also links to many very well produced tutorials that helped him improve his machine learning skills. Such as the free Python for Data Science Handbook Avik worked through, or this Youtube tutorial on deep learning in Python with Tensorflow and Keras:

Although Avik didn’t seem to have completed the full 100 days, many others did.

Data visualization

I have blogged about Hannah Yan Han‘s 100 days of code project before, but she definately deserves another mention here. Her 100 days revolved around data science, data visualization, and storytelling using both R and Python. You can find her #100DaysOfCode Medium page here, and her associated Github repository here.

For example, one day Hannah explored where instant noodles come from, how they are served, and whether people like them or not.

A different day she would examine which sports are the thoughest:

Or how scientific researchers migrate across the globe:

Hannah used many different plot types in those 100 days. Also some lesser known ones, like these upset plots on TED talk data:

Heck, she even made her own R package to generate Mondriaan-like paintings on one of the days:

What I found so great about Hannah’s project is that she picked a novel dataset every couple of days. Moreover, she used a extremely large variety of different visualization formats. All visuals were equally beautiful, but Hannah made sure to pick the right one for the purpose she was trying to serve. If you are interested in data visualization, you seriously should check out Hannah’s 100DaysOfCode Medium page.

Animated vs. Static Data Visualizations

Animated vs. Static Data Visualizations

GIFs or animations are rising quickly in the data visualization world (see for instance here).

However, in my personal experience, they are not as widely used in business settings. You might even say animations are frowned by, for instance, LinkedIn, which removed the option to even post GIFs on their platform!

Nevertheless, animations can be pretty useful sometimes. For instance, they can display what happens during a process, like a analytical model converging, which can be useful for didactic purposes. Alternatively, they can be great for showing or highlighting trends over time.  

I am curious what you think are the pro’s and con’s of animations. Below, I posted two visualizations of the same data. The data consists of the simulated workforce trends, including new hires and employee attrition over the course of twelve months. 

versus

Would you prefer the static, or the animated version? Please do share your thoughts in the comments below, or on the respective LinkedIn and Twitter posts!


Want to reproduce these plots? Or play with the data? Here’s the R code:

# LOAD IN PACKAGES ####
# install.packages('devtools')
# devtools::install_github('thomasp85/gganimate')
library(tidyverse)
library(gganimate)
library(here)


# SET CONSTANTS ####
# data
HEADCOUNT = 270
HIRE_RATE = 0.12
HIRE_ADDED_SEASONALITY = rep(floor(seq(14, 0, length.out = 6)), 2)
LEAVER_RATE = 0.16
LEAVER_ADDED_SEASONALITY = c(rep(0, 3), 10, rep(0, 6), 7, 12)

# plot
TEXT_SIZE = 12
LINE_SIZE1 = 2
LINE_SIZE2 = 1.1
COLORS = c("darkgreen", "red", "blue")

# saving
PLOT_WIDTH = 8
PLOT_HEIGHT = 6
FRAMES_PER_POINT = 5


# HELPER FUNCTIONS ####
capitalize_string = function(text_string){
paste0(toupper(substring(text_string, 1, 1)), substring(text_string, 2, nchar(text_string)))
}


# SIMULATE WORKFORCE DATA ####
set.seed(1)

# generate random leavers and some seasonality
leavers <- rbinom(length(month.abb), HEADCOUNT, TURNOVER_RATE / length(month.abb)) + LEAVER_ADDED_SEASONALITY

# generate random hires and some seasonality
joiners <- rbinom(length(month.abb), HEADCOUNT, HIRE_RATE / length(month.abb)) + HIRE_ADDED_SEASONALITY

# combine in dataframe
data.frame(
month = factor(month.abb, levels = month.abb, ordered = TRUE)
, workforce = HEADCOUNT - cumsum(leavers) + cumsum(joiners)
, left = leavers
, hires = joiners
) ->
wf

# transform to long format
wf_long <- gather(wf, key = "variable", value = "value", -month)
capitalize the name of variables
wf_long$variable <- capitalize_string(wf_long$variable)


# VISUALIZE & ANIMATE ####
# draw workforce plot
ggplot(wf_long, aes(x = month, y = value, group = variable)) +
geom_line(aes(col = variable, size = variable == "workforce")) +
scale_color_manual(values = COLORS) +
scale_size_manual(values = c(LINE_SIZE2, LINE_SIZE1), guide = FALSE) +
guides(color = guide_legend(override.aes = list(size = c(rep(LINE_SIZE2, 2), LINE_SIZE1)))) +
# theme_PVDL() +
labs(x = NULL, y = NULL, color = "KPI", caption = "paulvanderlaken.com") +
ggtitle("Workforce size over the course of a year") +
NULL ->
workforce_plot

# ggsave(here("workforce_plot.png"), workforce_plot, dpi = 300, width = PLOT_WIDTH, height = PLOT_HEIGHT)

# animate the plot
workforce_plot +
geom_segment(aes(xend = 12, yend = value), linetype = 2, colour = 'grey') +
geom_label(aes(x = 12.5, label = paste(variable, value), col = variable),
hjust = 0, size = 5) +
transition_reveal(variable, along = as.numeric(month)) +
enter_grow() +
coord_cartesian(clip = 'off') +
theme(
plot.margin = margin(5.5, 100, 11, 5.5)
, legend.position = "none"
) ->
animated_workforce

anim_save(here("workforce_animation.gif"),
animate(animated_workforce, nframes = nrow(wf) * FRAMES_PER_POINT,
width = PLOT_WIDTH, height = PLOT_HEIGHT, units = "in", res = 300))

Data Visualization Tools & Resources

There’s this amazing overview of helpful dataviz resources atwww.visualisingdata.com/resources!

Browse through hundreds of helpful data visualization tools, programs, and services. All neatly organized by Andy Kirk in categories: data handling, applications, programming, web-based, qualitative, mapping, specialist, and colour. What a great repository!

A snapshot of www.visualisingdata.com/resource

Looking for expert books on data visualization?
Have a look at these recommendations!

Chatterplots

Chatterplots

I’ve mentioned before that I dislike wordclouds (for instance here, or here) and apparently others share that sentiment. In his recent Medium blog, Daniel McNichol goes as far as to refer to the wordcloud as the pie chart of text data! Among others, Daniel calls wordclouds disorienting, one-dimensional, arbitrary and opaque and he mentions their lack of order, information, and scale. 

Wordcloud of the negative characteristics of wordclouds, via Medium

Instead of using wordclouds, Daniel suggests we revert to alternative approaches. For instance, in their Tidy Text Mining with R book, Julia Silge and David Robinson suggest using bar charts or network graphs, providing the necessary R code. Another alternative is provided in Daniel’s blogthe chatterplot!

While Daniel didn’t invent this unorthodox wordcloud-like plot, he might have been the first to name it a chatterplot. Daniel’s chatterplot uses a full x/y cartesian plane, turning the usually only arbitrary though exploratory wordcloud into a more quantitatively sound, information-rich visualization.

R package ggplot’s geom_text() function — or alternatively ggrepel‘s geom_text_repel() for better legibility — is perfectly suited for making a chatterplot. And interesting features/variables for the axis — apart from the regular word frequencies — can be easily computed using the R tidytext package. 

Here’s an example generated by Daniel, plotting words simulatenously by their frequency of occurance in comments to Hacker News articles (y-axis) as well as by the respective popularity of the comments the word was used in (log of the ranking, on the x-axis).

[CHATTERPLOTs arelike a wordcloud, except there’s actual quantitative logic to the order, placement & aesthetic aspects of the elements, along with an explicit scale reference for each. This allows us to represent more, multidimensional information in the plot, & provides the viewer with a coherent visual logic& direction by which to explore the data.

Daniel McNichol via Medium

I highly recommend the use of these chatterplots over their less-informative wordcloud counterpart, and strongly suggest you read Daniel’s original blog, in which you can also find the R code for the above visualizations.