Category: statistics

Logistic regression is not fucked, by Jake Westfall

Logistic regression is not fucked, by Jake Westfall

Recently, I came across a social science paper that had used linear probability regression. I had never heard of linear probability models (LPM), but it seems just an application of ordinary least squares regression but to a binomial dependent variable.

According to some, LPM is a commonly used alternative for logistic regression, which is what I was learned to use when the outcome is binary.

Potentially because of my own social science background (HRM), using linear regression without a link transformation on binary data just seems very unintuitive and error-prone to me. Hence, I sought for more information.

I particularly liked this article by Jake Westfall, which he dubbed “Logistic regression is not fucked”, following a series of blogs in which he talks about methods that are fucked and not useful.

Jake explains the classification problem and both methods inner workings in a very straightforward way, using great visual aids. He shows how LMP would differ from logistic models, and why its proposed benefits are actually not so beneficial. Maybe I’m in my bubble, but Jake’s arguments resonated.

Read his article yourself:
http://jakewestfall.org/blog/index.php/2018/03/12/logistic-regression-is-not-fucked/

Here’s the summary:
Arguments against the use of logistic regression due to problems with “unobserved heterogeneity” proceed from two distinct sets of premises. The first argument points out that if the binary outcome arises from a latent continuous outcome and a threshold, then observed effects also reflect latent heteroskedasticity. This is true, but only relevant in cases where we actually care about an underlying continuous variable, which is not usually the case. The second argument points out that logistic regression coefficients are not collapsible over uncorrelated covariates, and claims that this precludes any substantive interpretation. On the contrary, we can interpret logistic regression coefficients perfectly well in the face of non-collapsibility by thinking clearly about the conditional probabilities they refer to. 

GIF visualizations of Type 1 and Type 2 error in relation to sample size

GIF visualizations of Type 1 and Type 2 error in relation to sample size

On twitter, I came across the tweet below showing some great GIF visualizations on the dangers of taking small samples.

Created by Andrew Stewart, and tweeted by John Holbein, the visuals show samples taken from a normal distributed variable with a mean of 10 and a standard deviation of 2. In the left section, Andrew took several samples of 20. In the right section, the sample size was increased to 500.

Just look at how much the distribution and the estimated mean change for small samples!

Andrew shared his code via Github, so I was able to download and tweak it a bit to make my own version.

Andrew’s version seems to be concerned with potential Type 1 errors when small samples are taken. A type 1 error occurs when you reject your null hypothesis (you reject “there is no effect”) while you should not have (“there is actually no effect”).

You can see this in the distributions Andrew sampled from in the tweet above. The data for conditions A (red) and B (blue) are sampled from the same distribution, with mean 10 and standard deviation 2. While there should thus be no difference between the groups, small samples may cause researchers to erroneously conclude that there is a difference between conditions A and B due to the observed data.

We could use Andrew’s basic code and tweak it a bit to simulate a setting in which Type 2 errors could occur. A type 2 error occurs when you do not reject your null hypothesis (you maintain “there is no effect”) whereas there is actually an effect, which you thus missed.

To illustrate this, I adapted Andrew’s code: I sampled data for condition B using a normal distribution with a slightly higher mean value of 11, as opposed to the mean of 10 for condition A. The standard deviation remained the same in both conditions (2).

Next, I drew 10 data samples from both conditions, for various sample sizes: 10, 20, 50, 100, 250, 500, and even 1000. After drawing these samples for both conditions, I ran a simple t-test to compare their means, and estimate whether any observed difference could be considered significant (at the alpha = 0.05 level [95%]).

In the end, I visualized the results in a similar fashion as Andrew did. Below are the results.

As you can see, only in 1 of our 10 samples with size 10 were we able to conclude that there was a difference in means. This means that we are 90% incorrect.

After increasing the sample size to 100, we strongly decrease our risk of Type 2 errors. Now we are down to 20% incorrect conclusions.

At this point though, I decided to rework Andrew’s code even more, to clarify the message.

I was not so much interested in the estimated distribution, which currently only distracts. Similarly, the points and axes can be toned down a bit. Moreover, I’d like to be able to see when my condition samples have significant different means, so let’s add a 95% confidence interval, and some text. Finally, let’s increase the number of drawn samples per sample size to, say, 100, to reduce the influence that chance may have on our Type 2 error rate estimations.

Let’s rerun the code and generate some GIFs!

The below demonstrates that small samples of only 10 observations per condition have only about a 11% probability of detecting the difference in means when the true difference is 1 (or half the standard deviation [i.e., 2]). In other words, there is a 89% chance of a Type 2 error occuring, where we fail to reject the null hypothesis due to sampling error.

Doubling the sample size to 20, more than doubles our detection rate. We now correctly identify the difference 28% of the time.

With 50 observations the Type 2 error rate drops to 34%.

Finally, with sample sizes of 100+ our results become somewhat reliable. We are now able to correctly identify the true difference over 95% of the times.

With a true difference of half the standard deviation, further increases in the sample size start to lose their added value. For instance, a sample size of 250 already uncovers the effect in all 100 samples, so doubling to 500 would not make sense.

I hope you liked the visuals. If you are interested in these kind of analysis, or want to estimate how large of a sample you need in your own study, have a look at power analysis. These analysis can help you determine the best setup for your own research initiatives.


If you’d like to reproduce or change the graphics above, here is the R code. Note that it is strongly inspired by Andrew’s original code.

# setup -------------------------------------------------------------------

# The new version of gganimate by Thomas Lin Pedersen - @thomasp85 may not yet be on CRAN so use devtools
# devtools::install_github('thomasp85/gganimate')

library(ggplot2)
library(dplyr)
library(glue)
library(magrittr)
library(gganimate)




# main function to create and save the animation --------------------------

save_created_animation = function(sample_size, 
                                  samples = 100, 
                                  colors = c("red", "blue"), 
                                  Amean = 10, Asd = 2, 
                                  Bmean = 11, Bsd = 2, 
                                  seed = 1){
  
  ### generate the data
  
  # set the seed
  set.seed(seed)

  # set the names of our variables
  cnames <- c("Score", "Condition", "Sample") 

  # create an empty data frame to store our simulated samples
  df <- data.frame(matrix(rep(NA_character_, samples * sample_size * 2 * length(cnames)), ncol = length(cnames), dimnames = list(NULL, cnames)), stringsAsFactors = FALSE)
  
  # create an empty vector to store whether t.test identifies significant difference in means
  result <- rep(NA_real_, samples)
  
  # run a for loop to iteratively simulate the samples
  for (i in seq_len(samples)) {
    # draw random samples for both conditions
    a <- rnorm(sample_size, mean = Amean, sd = Asd) 
    b <- rnorm(sample_size, mean = Bmean, sd = Bsd) 
    # test whether there the difference in the means of samples is significant 
    result[i] = t.test(a, b)$p.value < 0.05
    # add the identifiers for both conditions, and for the sample iteration
    a <- cbind(a, rep(glue("A\n(μ={Amean}; σ={Asd})"), sample_size), rep(i, sample_size))
    b <- cbind(b, rep(glue("B\n(μ={Bmean}; σ={Bsd})"), sample_size), rep(i, sample_size))
    # bind the two sampled conditions together in a single matrix and set its names
    ab <- rbind(a, b)
    colnames(ab) <- cnames
    # push the matrix into its reserved spot in the reserved dataframe
    df[((i - 1) * sample_size * 2 + 1):((i * (sample_size * 2))), ] <- ab
  }
  
  
  
  ### prepare the data
  
  # create a custom function to calculate the standard error
  se <- function(x) sd(x) / sqrt(length(x))
  
  df %>%
    # switch data types for condition and score
    mutate(Condition = factor(Condition)) %>%
    mutate(Score = as.numeric(Score)) %>%
    # calculate the mean and standard error to be used in the error bar
    group_by(Condition, Sample) %>%
    mutate(Score_Mean = mean(Score)) %>% 
    mutate(Score_SE = se(Score)) ->
    df
  
  # create a new dataframe storing the result per sample 
  df_result <- data.frame(Sample = unique(df$Sample), Result = result, stringsAsFactors = FALSE)
  
  # and add this result to the dataframe
  df <- left_join(df, df_result, by = "Sample")
  
  # identify whether not all but also not zero samples identified the difference in means
  # if so, store the string "only ", later to be added into the subtitle
  result_mention_adj <- ifelse(sum(result) != 0 & sum(result) < length(result), "only ", "")


  
  ### create a custom theme
  
  textsize <- 16
  
  my_theme <- theme(
    text = element_text(size = textsize),
    axis.title.x = element_text(size = textsize),
    axis.title.y = element_text(size = textsize),
    axis.text.y = element_text(hjust = 0.5, vjust = 0.75),
    axis.text = element_text(size = textsize),
    legend.title = element_text(size = textsize),
    legend.text =  element_text(size = textsize),
    legend.position = "right",
    plot.title = element_text(lineheight = .8, face = "bold", size = textsize),
    panel.border = element_blank(),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank(),
    axis.line = element_line(color = "grey", size = 0.5, linetype = "solid"),
    axis.ticks = element_line(color = "grey")
  )
  
  # store the chosen colors in a named vector for use as palette, 
  # and add the colors for (in)significant results
  COLORS = c(colors, "black", "darkgrey")
  names(COLORS) = c(levels(df$Condition), "1", "0")
  
  
  ### create the animated plot
  
  df %>%
    ggplot(aes(y = Score, x = Condition, fill = Condition, color = Condition)) +
    geom_point(aes(y = Score), position = position_jitter(width = 0.25), alpha = 0.20, stroke = NA, size = 1) +
    geom_errorbar(aes(ymin = Score_Mean - 1.96 * Score_SE, ymax = Score_Mean + 1.96 * Score_SE), width = 0.10, size = 1.5) +
    geom_text(data = . %>% filter(as.numeric(Condition) == 1), 
              aes(x = levels(df$Condition)[1], y = Result * 10 + 5, 
                  label = ifelse(Result == 1, "Significant!", "Insignificant!"),
                  col = as.character(Result)), position = position_nudge(x = -0.5), size = 5) +
    transition_states(Sample, transition_length = 1, state_length = 2) +
    guides(fill = FALSE) +
    guides(color = FALSE) +
    scale_x_discrete(limits = rev(levels(df$Condition)), breaks = rev(levels(df$Condition))) +
    scale_y_continuous(limits = c(0, 20), breaks = seq(0, 20, 5)) +
    scale_color_manual(values = COLORS) +
    scale_fill_manual(values = COLORS) +
    coord_flip() +
    theme_minimal() +
    my_theme +
    labs(x = "Condition") +
    labs(y = "Dependent variable") +
    labs(title = glue("When drawing {samples} samples of {sample_size} observations per condition")) +
    labs(subtitle = glue("The difference in means is identified in {result_mention_adj}{sum(result)} of {length(result)} samples")) +
    labs(caption = "paulvanderlaken.com | adapted from github.com/ajstewartlang") ->
    ani
  
  ### save the animated plot
  
  anim_save(paste0(paste("sampling_error", sample_size, sep = "_"), ".gif"), 
            animate(ani, nframes = samples * 10, duration = samples, width = 600, height = 400))
  
}




# call animation function for different sample sizes ----------------------

# !!! !!! !!!
# the number of samples is set to 100 by default
# if left at 100, each function call will take a long time!
# add argument `samples = 10` to get quicker results, like so:
# save_created_animation(10, samples = 10)
# !!! !!! !!!

save_created_animation(10)
save_created_animation(20)
save_created_animation(50)
save_created_animation(100)
save_created_animation(250)
save_created_animation(500)

Propensity Score Matching Explained Visually

Propensity Score Matching Explained Visually

Propensity score matching (wiki) is a statistical matching technique that attempts to estimate the effect of a treatment (e.g., intervention) by accounting for the factors that predict whether an individual would be eligble for receiving the treatment. The wikipedia page provides a good example setting:

Say we are interested in the effects of smoking on health. Here, smoking would be considered the treatment, and the ‘treated’ are simply those who smoke. In order to find a cause-effect relationship, we would need to run an experiment and randomly assign people to smoking and non-smoking conditions. Of course such experiments would be unfeasible and/or unethical, as we can’t ask/force people to smoke when we suspect it may do harm.
We will need to work with observational data instead. Here, we estimate the treatment effect by simply comparing health outcomes (e.g., rate of cancer) between those who smoked and did not smoke. However, this estimation would be biased by any factors that predict smoking (e.g., social economic status). Propensity score matching attempts to control for these differences (i.e., biases) by making the comparison groups (i.e., smoking and non-smoking) more comparable.

Lucy D’Agostino McGowan is a post-doc at Johns Hopkins Bloomberg School of Public Health and co-founder of R-Ladies Nashville. She wrote a very nice blog explaining what propensity score matching is and showing how to apply it to your dataset in R. Lucy demonstrates how you can use propensity scores to weight your observations in such a way that accounts for the factors that correlate with receiving a treatment. Moreover, her explainations are strenghtened by nice visuals that intuitively demonstrate what the weighting does to the “pseudo-populations” used to estimate the treatment effect.

Have a look yourself: https://livefreeordichotomize.com/2019/01/17/understanding-propensity-score-weighting/

How to find two identical Skittles packs?

How to find two identical Skittles packs?

In a hilarious experiment the anonymous mathematician behind the website Possibly Wrong estimated that s/he only needed to open “about 400-500” packs of Skittles to find an identifical pack.

From January 12th up to April 6th, s/he put it to the test and counted the contents of an astonishing 468 packs, containing over 27.000 individual Skittles! Read all about the experiment here.

Overview of the contents of the Skittles packs, the duplicates encircled.
Via https://possiblywrong.wordpress.com/2019/04/06/follow-up-i-found-two-identical-packs-of-skittles-among-468-packs-with-a-total-of-27740-skittles/
Contents of the two duplicate Skittles packs.
Via https://possiblywrong.wordpress.com/2019/04/06/follow-up-i-found-two-identical-packs-of-skittles-among-468-packs-with-a-total-of-27740-skittles/
Animating causal inference methods

Animating causal inference methods

Some time back the animations below went sort of viral in the statistical programming community. In them, economics professor Nick Huntington-Klein demonstrates step-by-step how statistical tests estimate effect sizes.

You will find several other animations in Nick’s original blog, and the associatedtwitter thread.

Moreover, if you are interested in the R code to generate these animations, have a look at this github repository for the causalgraphs.

Controlling for a variable

Matching on a Variable

Differences in differences

Link to the Twitter thread:

StatQuest: Statistical concepts, clearly explained

StatQuest: Statistical concepts, clearly explained

Josh Starmer is assistant professor at the genetics department of the University of North Carolina at Chapel Hill.

But more importantly:
Josh is the mastermind behind StatQuest!

StatQuest is a Youtube channel (and website) dedicated to explaining complex statistical concepts — like data distributions, probability, or novel machine learning algorithms — in simple terms.

Once you watch one of Josh’s “Stat-Quests”, you immediately recognize the effort he put into this project. Using great visuals, a just-about-right pace, and relateable examples, Josh makes statistics accessible to everyone. For instance, take this series on logistic regression:

And do you really know what happens under the hood when you run a principal component analysis? After this video you will:

Or are you more interested in learning the fundamental concepts behind machine learning, then Josh has some videos for you, for instance on bias and variance or gradient descent:

With nearly 200 videos and counting, StatQuest is truly an amazing resource for students ‘and teachers on topics related to statistics and data analytics. For some of the concepts, Josh even posted videos running you through the analysis steps and results interpretation in the R language.


StatQuest started out as an attempt to explain statistics to my co-workers – who are all genetics researchers at UNC-Chapel Hill. They did these amazing experiments, but they didn’t always know what to do with the data they generated. That was my job. But I wanted them to understand that what I do isn’t magic – it’s actually quite simple. It only seems hard because it’s all wrapped up in confusing terminology and typically communicated using equations. I found that if I stripped away the terminology and communicated the concepts using pictures, it became easy to understand.

Over time I made more and more StatQuests and now it’s my passion on YouTube.

Josh Starmer via https://statquest.org/about/