Tag: animation

7 Reasons You Should Use Dot Graphs, by Maarten Lambrechts

7 Reasons You Should Use Dot Graphs, by Maarten Lambrechts

In my data visualization courses, I often refer to the hierarchy of visual encoding proposed by Cleveland and McGill. In their 1984 paper, Cleveland and McGill proposed the table below, demonstrating to what extent different visual encodings of data allow readers of data visualizations to accurately assess differences between data values.

DOI: 10.2307/2288400

Since then, this table has been used and copied by many data visualization experts, and adapted to more visually appealing layouts. Like this one by Alberto Cairo, referred to in a blog by Maarten Lambrechts:

cleveland_mcgill_cairo
Via http://www.thefunctionalart.com/

Now, this brings me to the point of this current blog, in which I want to share an older post by Maarten Lambrechts. I came across Maarten’s post only yesterday, but it touches on many topics and content that I’ve covered earlier on my own website or during my courses. It’s mainly about the relative effectiveness and efficiency of using dots/points in data visualizations.

Basically, dots are often the most accurate and to the point (pun intended). With the latter, I mean in terms of inkt used, dots/points are more efficient than bars, or as Maarten says:

Points go beyond where lines and bars stop. Sounds weird, especially for those who remember from their math classes that a line is an infinite collection of points. But in visualization, points can do so much more then lines. Here are seven reasons why you should use more dot graphs, with some examples.

http://www.maartenlambrechts.com/2015/05/03/to-the-point-7-reasons-you-should-use-dot-graphs.html

Maarten touches on the research of Cleveland and McGill, on a PLOS article advocating avoiding bars for continuous data, and on how to redesign charts to make use of more efficiënt dot/point encodings. I really loved this one redesign example Maarten shares. Unfortunately, it is in Dutch, but both graphs show pretty much the same data, though the simpler one better communicates the main message.

Do have a look at the rest of Maarten’s original blog post. I love how he ends it with some practical advice: A nice lookup table for those looking how to efficiently use points/dots to represent their n-dimensional data:

  • For comparisons of a single dimension across many categories: 1-dimensional scatterplot.
  • For detecting of skewed or bimodal distributions in 2 variables: connect 1-dimensional scatterplots (slopegraphs)
  • For showing relationships between 2 variables: 2-dimensional scatterplots.
  • For representing 4-dimensional data (3 numeric, 1 categorical or 4 numerical): bubble charts. Can also be used for 3 numerical dimensions or 2 numeric and 1 categorical value.
  • For representing 4-dimensional data +  time: animated bubble chart (aka Rosling-graph)
Making GIFs with Processing

Making GIFs with Processing

Processing is a flexible software sketchbook and a language for learning how to code within the context of the visual arts. It’s open-source, there are many online materials, and the language itself is very accessible.

I recently stumbled upon 17-year-old Joseff Nic from Cardiff who has been making GIFs in Processing only since 2018, but which are turning out fantastic already. You can hire him here, and have a look at his Twitter, Tumbler or Drimble channels for the originals:

Here are some more of Joseff’s creations:

Wave Swirl loop design processing gif animation

Some of Joseff’s work seems inspired by David Whyte, a graphic designer from Ireland. His portfolio is quite impressive as well, very visually pleasing, and you can hire him here.

Bees & Bombs Roulette gif processing web web design
StudioPhi spiral gif processing geometry motion
Knot geometry motion design processing gif

If you’re interesting in learning Processing, Daniel Shiffman demonstrates how to create the most amazing things in Processing via his Youtube channel the Coding Train, which I’ve covered before.

ROC, AUC, precision, and recall visually explained

ROC, AUC, precision, and recall visually explained

A receiver operating characteristic (ROC) curve displays how well a model can classify binary outcomes. An ROC curve is generated by plotting the false positive rate of a model against its true positive rate, for each possible cutoff value. Often, the area under the curve (AUC) is calculated and used as a metric showing how well a model can classify data points.

If you’re interest in learning more about ROC and AUC, I recommend this short Medium blog, which contains this neat graphic:

Dariya Sydykova, graduate student at the Wilke lab at the University of Texas at Austin, shared some great visual animations of how model accuracy and model cutoffs alter the ROC curve and the AUC metric. The quotes and animations are from the associated github repository.

ROC & AUC

The plot on the left shows the distributions of predictors for the two outcomes, and the plot on the right shows the ROC curve for these distributions. The vertical line that travels left-to-right is the cutoff value. The red dot that travels along the ROC curve corresponds to the false positive rate and the true positive rate for the cutoff value given in the plot on the left.

The traveling cutoff demonstrates the trade-off between trying to classify one outcome correctly and trying to classify the other outcome correcly. When we try to increase the true positive rate, we also increase the false positive rate. When we try to decrease the false positive rate, we decrease the true positive rate.

cutoff.gif

The shape of an ROC curve changes when a model changes the way it classifies the two outcomes.

The animation [below] starts with a model that cannot tell one outcome from the other, and the two distributions completely overlap (essentially a random classifier). As the two distributions separate, the ROC curve approaches the left-top corner, and the AUC value of the curve increases. When the model can perfectly separate the two outcomes, the ROC curve forms a right angle and the AUC becomes 1.

Precision-Recall

Two other metrics that are often used to quantify model performance are precision and recall.

Precision (also called positive predictive value) is defined as the number of true positives divided by the total number of positive predictions. Hence, precision quantifies what percentage of the positive predictions were correct: How correct your model’s positive predictions were.

Recall (also called sensitivity) is defined as the number of true positives divided by the total number of true postives and false negatives (i.e. all actual positives). Hence, recall quantifies what percentage of the actual positives you were able to identify: How sensitive your model was in identifying positives.

Dariya also made some visualizations of precision-recall curves:

Precision-recall curves also displays how well a model can classify binary outcomes. However, it does it differently from the way an ROC curve does. Precision-recall curve plots true positive rate (recall or sensitivity) against the positive predictive value (precision). 

In the middle, here below, the ROC curve with AUC. On the right, the associated precision-recall curve.

Similarly to the ROC curve, when the two outcomes separate, precision-recall curves will approach the top-right corner. Typically, a model that produces a precision-recall curve that is closer to the top-right corner is better than a model that produces a precision-recall curve that is skewed towards the bottom of the plot.

Class imbalance

Class imbalance happens when the number of outputs in one class is different from the number of outputs in another class. For example, one of the distributions has 1000 observations and the other has 10. An ROC curve tends to be more robust to class imbalanace that a precision-recall curve. 

In this animation [below], both distributions start with 1000 outcomes. The blue one is then reduced to 50. The precision-recall curve changes shape more drastically than the ROC curve, and the AUC value mostly stays the same. We also observe this behaviour when the other disribution is reduced to 50. 

Here’s the same, but now with the red distribution shrinking to just 50 samples.

Dariya invites you to use these visualizations for educational purposes:

Please feel free to use the animations and scripts in this repository for teaching or learning. You can directly download the gif files for any of the animations, or you can recreate them using these scripts. Each script is named according to the animation it generates (i.e. animate_ROC.r generates ROC.gifanimate_SD.r generates SD.gif, etc.).

Want to learn more about the different evaluation metrics for machine learning? Here’s a nice how-to guide by Neptune.ai demonstrating different metrics applied in Python.

Daily Art by Saskia Freeke

Daily Art by Saskia Freeke

Saskia Freeke (twitter) is a Dutch artist, creative coder, interaction designer, visual designer, and educator working from Amsterdam. She has been creating an awesome digital art piece for every day since January 1st 2015. Her ever-growing collection includes some animated, visual masterpieces.

My personal favorites are Saskia’s moving works, her GIFs:

Saskia uses Processing to create her art. Processing is a Java-based language, also used often by Daniel Shiffmann whom we know from the Coding Train.

Animated Snow in R, 2.0: gganimate API update

Animated Snow in R, 2.0: gganimate API update

Last year, inspired by a tweet from Ilya Kashnitsky, I wrote a snow animation which you can read all about here

Now, this year, the old code no longer worked due to an update to the gganimate API. Hence, I was about to only refactor the code, but decided to give the whole thing a minor update. Below, you find the 2.0 version of my R snow animation.

# PACKAGES ####
pkg <- c("here", "tidyverse", "gganimate", "animation")
sapply(pkg, function(x){
if (!x %in% installed.packages()){install.packages(x)}
library(x, character.only = TRUE)
})

# CUSTOM FUNCTIONS ####
map_to_range <- function(x, from, to) {
# Shifting the vector so that min(x) == 0
x <- x - min(x)
# Scaling to the range of [0, 1]
x <- x / max(x)
# Scaling to the needed amplitude
x <- x * (to - from)
# Shifting to the needed level
x + from
}

# CONSTANTS ####
N <- 500 # number of flakes
TIMES <- 100 # number of loops
XPOS_DELTA <- 0.01
YSPEED_MIN = 0.005
YSPEED_MAX = 0.03
FLAKE_SIZE_COINFLIP = 5
FLAKE_SIZE_COINFLIP_PROB = 0.1
FLAKE_SIZE_MIN = 4
FLAKE_SIZE_MAX = 20

# INITIALIZE DATA ####
set.seed(1)

size <- runif(N) + rbinom(N, FLAKE_SIZE_COINFLIP, FLAKE_SIZE_COINFLIP_PROB) # random flake size
yspeed <- map_to_range(size, YSPEED_MIN, YSPEED_MAX)

# create storage vectors
xpos <- rep(NA, N * TIMES)
ypos <- rep(NA, N * TIMES)

# loop through simulations
for(i in seq(TIMES)){
if(i == 1){
# initiate values
xpos[1:N] <- runif(N, min = -0.1, max = 1.1)
ypos[1:N] <- runif(N, min = 1.1, max = 2)
} else {
# specify datapoints to update
first_obs <- (N * i - N + 1)
last_obs <- (N * i)
# update x position
# random shift
xpos[first_obs:last_obs] <- xpos[(first_obs-N):(last_obs-N)] - runif(N, min = -XPOS_DELTA, max = XPOS_DELTA)
# update y position
# lower by yspeed
ypos[first_obs:last_obs] <- ypos[(first_obs-N):(last_obs-N)] - yspeed
# reset if passed bottom screen
xpos <- ifelse(ypos < -0.1, runif(N), xpos) # restart at random x
ypos <- ifelse(ypos < -0.1, 1.1, ypos) # restart just above top
}
}


# VISUALIZE DATA ####
cbind.data.frame(ID = rep(1:N, TIMES)
,x = xpos
,y = ypos
,s = size
,t = rep(1:TIMES, each = N)) %>%
# create animation
ggplot() +
geom_point(aes(x, y, size = s, alpha = s), color = "white", pch = 42) +
scale_size_continuous(range = c(FLAKE_SIZE_MIN, FLAKE_SIZE_MAX)) +
scale_alpha_continuous(range = c(0.2, 0.8)) +
coord_cartesian(c(0, 1), c(0, 1)) +
theme_void() +
theme(legend.position = "none",
panel.background = element_rect("black")) +
transition_time(t) +
ease_aes('linear') ->
snow_plot

snow_anim <- animate(snow_plot, nframes = TIMES, width = 600, height = 600)

If you want to see some spin-offs of last years code:

Animated vs. Static Data Visualizations

Animated vs. Static Data Visualizations

GIFs or animations are rising quickly in the data visualization world (see for instance here).

However, in my personal experience, they are not as widely used in business settings. You might even say animations are frowned by, for instance, LinkedIn, which removed the option to even post GIFs on their platform!

Nevertheless, animations can be pretty useful sometimes. For instance, they can display what happens during a process, like a analytical model converging, which can be useful for didactic purposes. Alternatively, they can be great for showing or highlighting trends over time.  

I am curious what you think are the pro’s and con’s of animations. Below, I posted two visualizations of the same data. The data consists of the simulated workforce trends, including new hires and employee attrition over the course of twelve months. 

versus

Would you prefer the static, or the animated version? Please do share your thoughts in the comments below, or on the respective LinkedIn and Twitter posts!


Want to reproduce these plots? Or play with the data? Here’s the R code:

# LOAD IN PACKAGES ####
# install.packages('devtools')
# devtools::install_github('thomasp85/gganimate')
library(tidyverse)
library(gganimate)
library(here)


# SET CONSTANTS ####
# data
HEADCOUNT = 270
HIRE_RATE = 0.12
HIRE_ADDED_SEASONALITY = rep(floor(seq(14, 0, length.out = 6)), 2)
LEAVER_RATE = 0.16
LEAVER_ADDED_SEASONALITY = c(rep(0, 3), 10, rep(0, 6), 7, 12)

# plot
TEXT_SIZE = 12
LINE_SIZE1 = 2
LINE_SIZE2 = 1.1
COLORS = c("darkgreen", "red", "blue")

# saving
PLOT_WIDTH = 8
PLOT_HEIGHT = 6
FRAMES_PER_POINT = 5


# HELPER FUNCTIONS ####
capitalize_string = function(text_string){
paste0(toupper(substring(text_string, 1, 1)), substring(text_string, 2, nchar(text_string)))
}


# SIMULATE WORKFORCE DATA ####
set.seed(1)

# generate random leavers and some seasonality
leavers <- rbinom(length(month.abb), HEADCOUNT, TURNOVER_RATE / length(month.abb)) + LEAVER_ADDED_SEASONALITY

# generate random hires and some seasonality
joiners <- rbinom(length(month.abb), HEADCOUNT, HIRE_RATE / length(month.abb)) + HIRE_ADDED_SEASONALITY

# combine in dataframe
data.frame(
month = factor(month.abb, levels = month.abb, ordered = TRUE)
, workforce = HEADCOUNT - cumsum(leavers) + cumsum(joiners)
, left = leavers
, hires = joiners
) ->
wf

# transform to long format
wf_long <- gather(wf, key = "variable", value = "value", -month)
capitalize the name of variables
wf_long$variable <- capitalize_string(wf_long$variable)


# VISUALIZE & ANIMATE ####
# draw workforce plot
ggplot(wf_long, aes(x = month, y = value, group = variable)) +
geom_line(aes(col = variable, size = variable == "workforce")) +
scale_color_manual(values = COLORS) +
scale_size_manual(values = c(LINE_SIZE2, LINE_SIZE1), guide = FALSE) +
guides(color = guide_legend(override.aes = list(size = c(rep(LINE_SIZE2, 2), LINE_SIZE1)))) +
# theme_PVDL() +
labs(x = NULL, y = NULL, color = "KPI", caption = "paulvanderlaken.com") +
ggtitle("Workforce size over the course of a year") +
NULL ->
workforce_plot

# ggsave(here("workforce_plot.png"), workforce_plot, dpi = 300, width = PLOT_WIDTH, height = PLOT_HEIGHT)

# animate the plot
workforce_plot +
geom_segment(aes(xend = 12, yend = value), linetype = 2, colour = 'grey') +
geom_label(aes(x = 12.5, label = paste(variable, value), col = variable),
hjust = 0, size = 5) +
transition_reveal(variable, along = as.numeric(month)) +
enter_grow() +
coord_cartesian(clip = 'off') +
theme(
plot.margin = margin(5.5, 100, 11, 5.5)
, legend.position = "none"
) ->
animated_workforce

anim_save(here("workforce_animation.gif"),
animate(animated_workforce, nframes = nrow(wf) * FRAMES_PER_POINT,
width = PLOT_WIDTH, height = PLOT_HEIGHT, units = "in", res = 300))