Tag: stats

Bayesian Statistics using R, Python, and Stan

For a year now, this course on Bayesian statistics has been on my to-do list. So without further ado, I decided to share it with you already.

Richard McElreath is an evolutionary ecologist who is famous in the stats community for his work on Bayesian statistics.

At the Max Planck Institute for Evolutionary Anthropology, Richard teaches Bayesian statistics, and he was kind enough to put his whole course on Statistical Rethinking: Bayesian statistics using R & Stan open access online.

You can find the video lectures here on Youtube, and the slides are linked to here:

Richard also wrote a book that accompanies this course:

For more information abou the book, click here.

For the Python version of the code examples, click here.

Probability Distributions mapped and explained by their relationships

Sean Owen created this handy cheat sheet that shows the most common probability distributions mapped by their underlying relationships.

Probability distributions are fundamental to statistics, just like data structures are to computer science. They’re the place to start studying if you mean to talk like a data scientist.
Sean Owen (via)

Owen argues that the probability distributions relate to each other in intuitive and interesting ways that makes it easier for you to recall them. For instance, several follow naturally from the Bernoulli distribution. Having this map by hand should thus help you really understand what these distributions imply.

On top of that, it’s just a nice geeky network poster!

Sean’s map of the relationships between probability distributions (via)

Now, Sean didn’t just make a fancy map. In the original blog he also explains each of the distributions and how it relates to the others. Having this knowledge is vital to being a good data scientist / analyst.

You can sometimes get away with simple analysis using R or scikit-learn without quite understanding distributions, just like you can manage a Java program without understanding hash functions. But it would soon end in tears, bugs, bogus results, or worse: sighs and eye-rolling from stats majors.
Sean Owen (via)

For instance, here’s Sean explaining the Binomial distribution:

The binomial distribution may be thought of as the sum of outcomes of things that follow a Bernoulli distribution. Toss a fair coin 20 times; how many times does it come up heads? This count is an outcome that follows the binomial distribution. Its parameters are n, the number of trials, and p, the probability of a “success” (here: heads, or 1). Each flip is a Bernoulli-distributed outcome, or trial. Reach for the binomial distribution when counting the number of successes in things that act like a coin flip, where each flip is independent and has the same probability of success.
Sean Owen (via)

Header image via Alison-Static

Visualizing Sampling Distributions in ggplot2: Adding area under the curve

Thank you ggplot2tutor for solving one of my struggles. Apparently this is all it takes:

ggplot(NULL, aes(x = c(-3, 3))) +
  stat_function(fun = dnorm, geom = "line")

I can’t begin to count how often I have wanted to visualize a (normal) distribution in a plot. For instance to show how my sample differs from expectations, or to highlight the skewness of the scores on a particular variable. I wish I’d known earlier that I could just add one simple geom to my ggplot!

Want a different mean and standard deviation, just add a list to the args argument:

ggplot(NULL, aes(x = c(0, 20))) +
  stat_function(fun = dnorm,
                geom = "area",
                args = list(
                  mean = 10,
                  sd = 3
                ))

Need a different distribution? Just pass a different distribution function to stat_function. For instance, an F-distribution, with the df function:

ggplot(NULL, aes(x = c(0, 5))) +
  stat_function(fun = df,
                geom = "area",
                args = list(
                  df1 = 2,
                  df2 = 10
                ))

You can make it is complex as you want. The original ggplot2tutor blog provides this example:

ggplot(NULL, aes(x = c(-3, 5))) +
  stat_function(
    fun = dnorm,
    geom = "area",
    fill = "steelblue",
    alpha = .3
  ) +
  stat_function(
    fun = dnorm,
    geom = "area",
    fill = "steelblue",
    xlim = c(qnorm(.95), 4)
  ) +
  stat_function(
    fun = dnorm,
    geom = "line",
    linetype = 2,
    fill = "steelblue",
    alpha = .5,
    args = list(
      mean = 2
    )
  ) +
  labs(
    title = "Type I Error",
    x = "z-score",
    y = "Density"
  ) +
  scale_x_continuous(limits = c(-3, 5))

Have a look at the original blog here: https://ggplot2tutor.com/sampling_distribution/sampling_distribution/

ROC, AUC, precision, and recall visually explained

A receiver operating characteristic (ROC) curve displays how well a model can classify binary outcomes. An ROC curve is generated by plotting the false positive rate of a model against its true positive rate, for each possible cutoff value. Often, the area under the curve (AUC) is calculated and used as a metric showing how well a model can classify data points.

If you’re interest in learning more about ROC and AUC, I recommend this short Medium blog, which contains this neat graphic:

Dariya Sydykova, graduate student at the Wilke lab at the University of Texas at Austin, shared some great visual animations of how model accuracy and model cutoffs alter the ROC curve and the AUC metric. The quotes and animations are from the associated github repository.

ROC & AUC

The plot on the left shows the distributions of predictors for the two outcomes, and the plot on the right shows the ROC curve for these distributions. The vertical line that travels left-to-right is the cutoff value. The red dot that travels along the ROC curve corresponds to the false positive rate and the true positive rate for the cutoff value given in the plot on the left.
The traveling cutoff demonstrates the trade-off between trying to classify one outcome correctly and trying to classify the other outcome correcly. When we try to increase the true positive rate, we also increase the false positive rate. When we try to decrease the false positive rate, we decrease the true positive rate.

The shape of an ROC curve changes when a model changes the way it classifies the two outcomes.
The animation [below] starts with a model that cannot tell one outcome from the other, and the two distributions completely overlap (essentially a random classifier). As the two distributions separate, the ROC curve approaches the left-top corner, and the AUC value of the curve increases. When the model can perfectly separate the two outcomes, the ROC curve forms a right angle and the AUC becomes 1.

Precision-Recall

Two other metrics that are often used to quantify model performance are precision and recall.

Precision (also called positive predictive value) is defined as the number of true positives divided by the total number of positive predictions. Hence, precision quantifies what percentage of the positive predictions were correct: How correct your model’s positive predictions were.

Recall (also called sensitivity) is defined as the number of true positives divided by the total number of true postives and false negatives (i.e. all actual positives). Hence, recall quantifies what percentage of the actual positives you were able to identify: How sensitive your model was in identifying positives.

Dariya also made some visualizations of precision-recall curves:

Precision-recall curves also displays how well a model can classify binary outcomes. However, it does it differently from the way an ROC curve does. Precision-recall curve plots true positive rate (recall or sensitivity) against the positive predictive value (precision).

In the middle, here below, the ROC curve with AUC. On the right, the associated precision-recall curve.

Similarly to the ROC curve, when the two outcomes separate, precision-recall curves will approach the top-right corner. Typically, a model that produces a precision-recall curve that is closer to the top-right corner is better than a model that produces a precision-recall curve that is skewed towards the bottom of the plot.

Class imbalance

Class imbalance happens when the number of outputs in one class is different from the number of outputs in another class. For example, one of the distributions has 1000 observations and the other has 10. An ROC curve tends to be more robust to class imbalanace that a precision-recall curve.
In this animation [below], both distributions start with 1000 outcomes. The blue one is then reduced to 50. The precision-recall curve changes shape more drastically than the ROC curve, and the AUC value mostly stays the same. We also observe this behaviour when the other disribution is reduced to 50.

Here’s the same, but now with the red distribution shrinking to just 50 samples.

Dariya invites you to use these visualizations for educational purposes:

Please feel free to use the animations and scripts in this repository for teaching or learning. You can directly download the gif files for any of the animations, or you can recreate them using these scripts. Each script is named according to the animation it generates (i.e. animate_ROC.r generates ROC.gif, animate_SD.r generates SD.gif, etc.).

Want to learn more about the different evaluation metrics for machine learning? Here’s a nice how-to guide by Neptune.ai demonstrating different metrics applied in Python.

5 Quick Tips for Coding in the Classroom, by Kelly Bodwin

Kelly Bodwin is an Assistant Professor of Statistics at Cal Poly (San Luis Obispo) and teaches multiple courses in statistical programming. Based on her experiences, she compiled this great shortlist of five great tips to teach programming.

Kelly truly mentions some best practices, so have a look at the original article, which she summarized as follows:

1. Define your terms

Establish basic coding vocabulary early on.

What is the console, a script, the environment?
What is a function a variable, a dataframe?
What are strings, characters, and integers?

2. Be deliberate about teaching versus bypassing peripheral skills

Use tools like RStudio Cloud, R Markdown, and the usethis package to shelter students from setup.

Personally, this is what kept me from learning Python for a long time — the issues with starting up.

Kelly provides this personal checklist of peripherals skills including which ones she includes in her introductory courses:

Course Type	Install/Update R and RStudio	R Markdown fluency	Package management	Data management	File and folder organization	GitHub
Intro Stat for Non-Majors	⚠️	⚠️	❌	❌	❌	❌
Intro Stat for Majors	✅	✅	⚠️	⚠️	⚠️	⚠️
Advanced Statistics	✅	✅	✅	✅	⚠️	⚠️
Intro to Statistical Computation	✅	✅	✅	✅	✅	✅

✅ = required course skill
⚠️ = optional, proceed with caution
❌ = avoid entirely
via https://teachdatascience.com/teaching_programming_tips/

3. Read code like English

The best way to debug is to read your process out loud as a sentence.

Basically Kelly argues that you should learn students to be able to translate their requirements into (R) code.

When you continuously read out your code as step-by-step computer instructions, students will learn to translate their own desires to computer instructions.

4. Require good coding practices from Day One

Kelly refers to this great talk by Jenny Bryan on “good” code and how to recognize it.

Kelly’s personal best practice included:

Clear code formatting
Object names follow consistent conventions
Lack of unnecessary code repetition
Reproducibility
Unit tests before large calculations
Commenting and/or documentation

For more R style guides, see my R resources overview.

5. Leave room for creativity

Open-ended questions (like “here’s a dataset, do a cool analysis“) let students explore and shine.

Large parts of the above were copied from this original article by Kelly Boldwin. I highly recommend you have a look at the original, and at the website hosting it: teachdatascience.com

Cover picture by freecodecamp.org.

StatQuest: Statistical concepts, clearly explained

Josh Starmer is assistant professor at the genetics department of the University of North Carolina at Chapel Hill.

But more importantly:
Josh is the mastermind behind StatQuest!

StatQuest is a Youtube channel (and website) dedicated to explaining complex statistical concepts — like data distributions, probability, or novel machine learning algorithms — in simple terms.

Once you watch one of Josh’s “Stat-Quests”, you immediately recognize the effort he put into this project. Using great visuals, a just-about-right pace, and relateable examples, Josh makes statistics accessible to everyone. For instance, take this series on logistic regression:

And do you really know what happens under the hood when you run a principal component analysis? After this video you will:

Or are you more interested in learning the fundamental concepts behind machine learning, then Josh has some videos for you, for instance on bias and variance or gradient descent:

With nearly 200 videos and counting, StatQuest is truly an amazing resource for students ‘and teachers on topics related to statistics and data analytics. For some of the concepts, Josh even posted videos running you through the analysis steps and results interpretation in the R language.

StatQuest started out as an attempt to explain statistics to my co-workers – who are all genetics researchers at UNC-Chapel Hill. They did these amazing experiments, but they didn’t always know what to do with the data they generated. That was my job. But I wanted them to understand that what I do isn’t magic – it’s actually quite simple. It only seems hard because it’s all wrapped up in confusing terminology and typically communicated using equations. I found that if I stripped away the terminology and communicated the concepts using pictures, it became easy to understand.
Over time I made more and more StatQuests and now it’s my passion on YouTube.
Josh Starmer via https://statquest.org/about/