Tag: distribution

Probability Distributions mapped and explained by their relationships

Sean Owen created this handy cheat sheet that shows the most common probability distributions mapped by their underlying relationships.

Probability distributions are fundamental to statistics, just like data structures are to computer science. They’re the place to start studying if you mean to talk like a data scientist.
Sean Owen (via)

Owen argues that the probability distributions relate to each other in intuitive and interesting ways that makes it easier for you to recall them. For instance, several follow naturally from the Bernoulli distribution. Having this map by hand should thus help you really understand what these distributions imply.

On top of that, it’s just a nice geeky network poster!

Sean’s map of the relationships between probability distributions (via)

Now, Sean didn’t just make a fancy map. In the original blog he also explains each of the distributions and how it relates to the others. Having this knowledge is vital to being a good data scientist / analyst.

You can sometimes get away with simple analysis using R or scikit-learn without quite understanding distributions, just like you can manage a Java program without understanding hash functions. But it would soon end in tears, bugs, bogus results, or worse: sighs and eye-rolling from stats majors.
Sean Owen (via)

For instance, here’s Sean explaining the Binomial distribution:

The binomial distribution may be thought of as the sum of outcomes of things that follow a Bernoulli distribution. Toss a fair coin 20 times; how many times does it come up heads? This count is an outcome that follows the binomial distribution. Its parameters are n, the number of trials, and p, the probability of a “success” (here: heads, or 1). Each flip is a Bernoulli-distributed outcome, or trial. Reach for the binomial distribution when counting the number of successes in things that act like a coin flip, where each flip is independent and has the same probability of success.
Sean Owen (via)

Header image via Alison-Static

Visualizing Sampling Distributions in ggplot2: Adding area under the curve

Thank you ggplot2tutor for solving one of my struggles. Apparently this is all it takes:

ggplot(NULL, aes(x = c(-3, 3))) +
  stat_function(fun = dnorm, geom = "line")

I can’t begin to count how often I have wanted to visualize a (normal) distribution in a plot. For instance to show how my sample differs from expectations, or to highlight the skewness of the scores on a particular variable. I wish I’d known earlier that I could just add one simple geom to my ggplot!

Want a different mean and standard deviation, just add a list to the args argument:

ggplot(NULL, aes(x = c(0, 20))) +
  stat_function(fun = dnorm,
                geom = "area",
                args = list(
                  mean = 10,
                  sd = 3
                ))

Need a different distribution? Just pass a different distribution function to stat_function. For instance, an F-distribution, with the df function:

ggplot(NULL, aes(x = c(0, 5))) +
  stat_function(fun = df,
                geom = "area",
                args = list(
                  df1 = 2,
                  df2 = 10
                ))

You can make it is complex as you want. The original ggplot2tutor blog provides this example:

ggplot(NULL, aes(x = c(-3, 5))) +
  stat_function(
    fun = dnorm,
    geom = "area",
    fill = "steelblue",
    alpha = .3
  ) +
  stat_function(
    fun = dnorm,
    geom = "area",
    fill = "steelblue",
    xlim = c(qnorm(.95), 4)
  ) +
  stat_function(
    fun = dnorm,
    geom = "line",
    linetype = 2,
    fill = "steelblue",
    alpha = .5,
    args = list(
      mean = 2
    )
  ) +
  labs(
    title = "Type I Error",
    x = "z-score",
    y = "Density"
  ) +
  scale_x_continuous(limits = c(-3, 5))

Have a look at the original blog here: https://ggplot2tutor.com/sampling_distribution/sampling_distribution/

How to find two identical Skittles packs?

In a hilarious experiment the anonymous mathematician behind the website Possibly Wrong estimated that s/he only needed to open “about 400-500” packs of Skittles to find an identifical pack.

From January 12th up to April 6th, s/he put it to the test and counted the contents of an astonishing 468 packs, containing over 27.000 individual Skittles! Read all about the experiment here.