Tag: statistics

ggstatsplot: Creating graphics including statistical details

This pearl had been resting in my inbox for quite a while before I was able to add it to my R resources list. Citing its GitHub page, ggstatsplot is an extension of ggplot2 package for creating graphics with details from statistical tests included in the plots themselves and targeted primarily at behavioral sciences community to provide a one-line code to produce information-rich plots. The package is currently maintained and still under development by Indrajeet Patil. Nevertheless, its functionality is already quite impressive. You can download the latest stable version via:

utils::install.packages(pkgs = "ggstatsplot")

Or download the development version via:

devtools::install_github(
  repo = "IndrajeetPatil/ggstatsplot", # package path on GitHub
  dependencies = TRUE,                 # installs packages which ggstatsplot depends on
  upgrade_dependencies = TRUE          # updates any out of date dependencies
)

The package currently supports many different statistical plots, including:

?ggbetweenstats
?ggscatterstats
?gghistostats
?ggpiestats
?ggcorrmat
?ggcoefstats
?combine_plots
?grouped_ggbetweenstats
?grouped_ggscatterstats
?grouped_gghistostats
?grouped_ggpiestats
?grouped_ggcorrmat

Let’s take a closer look at the first one:

ggbetweenstats

This function creates either a violin plot, a box plot, or a mix of two for between-group or between-condition comparisons and additional detailed results from statistical tests can be added in the subtitle. The simplest function call looks like the below, but much more complex information can be added and specified.

set.seed(123) # to get reproducible results

# the functions work approximately the same as ggplot2
ggstatsplot::ggbetweenstats(
  data = datasets::iris, 
  x = Species, 
  y = Sepal.Length,
  messages = FALSE
) +   
# and can be adjusted using the same, orginal function calls
  ggplot2::coord_cartesian(ylim = c(3, 8)) + 
  ggplot2::scale_y_continuous(breaks = seq(3, 8, by = 1))

All pictures copied from the GitHub page of ggstatsplot [original]

ggscatterstats

Not all plots are ggplot2-compatible though, for instance, ggscatterstats is not. Nevertheless, it produces a very powerful plot in my opinion.

ggstatsplot::ggscatterstats(
  data = datasets::iris, 
  x = Sepal.Length, 
  y = Petal.Length,
  title = "Dataset: Iris flower data set",
  messages = FALSE
)

ggcormat

ggcorrmat is also quite impressive, producing correlalograms with only minimal amounts of code as it wraps around ggcorplot. The defaults already produces publication-ready correlation matrices:

ggstatsplot::ggcorrmat(
  data = datasets::iris,
  corr.method = "spearman",
  sig.level = 0.005,
  cor.vars = Sepal.Length:Petal.Width,
  cor.vars.names = c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width"),
  title = "Correlalogram for length measures for Iris species",
  subtitle = "Iris dataset by Anderson",
  caption = expression(
    paste(
      italic("Note"),
      ": X denotes correlation non-significant at ",
      italic("p "),
      "< 0.005; adjusted alpha"
    )
  )
)

ggcoefstats

Finally, ggcoefstats is a wrapper around GGally::ggcoef, creating a plot with the regression coefficients’ point estimates as dots with confidence interval whiskers. Here’s an example with some detailed specifications:

ggstatsplot::ggcoefstats(
  x = stats::lm(formula = mpg ~ am * cyl,
                data = datasets::mtcars),
  point.color = "red",
  vline.color = "#CC79A7",
  vline.linetype = "dotdash",
  stats.label.size = 3.5,
  stats.label.color = c("#0072B2", "#D55E00", "darkgreen"),
  title = "Car performance predicted by transmission and cylinder count",
  subtitle = "Source: 1974 Motor Trend US magazine"
) +                                    
  ggplot2::scale_y_discrete(labels = c("transmission", "cylinders", "interaction")) +
  ggplot2::labs(x = "regression coefficient",
                y = NULL)

I for one am very curious to see how Indrajeet will further develop this package, and whether academics will start using it as a default in publishing.

Transitioning from Excel to R: Dictionary of common functions

Alyssa Columbus published maintains this GitHub repository with a great tutorial on how to move from Excel to R. Very useful for beginning useRs, the repository’s tutorial includes a translation table between Excel and R functions:

Excel Formula	R Function	Type
ABS	`abs`	Arithmetic
ADDRESS	`assign`	Essentials
AND	`&`,`&&`,`all`	Boolean
AVERAGE, AVG, AVERAGEIF	`mean`	Arithmetic
BETADIST	`pbeta`	Statistics
BETAINV	`qbeta`	Statistics
BINOMDIST	`pbinom` when cumulative,`dbinom` when not	Statistics
CEILING	`ceiling`	Arithmetic
CELL	`str` has the same idea	Essentials
CHIDIST, CHISQDIST	`pchisq`	Statistics
CHIINV, CHISQINV	`qchisq`	Statistics
CHITEST	`chisq.test`	Statistics
CHOOSE	`switch`	Essentials
CLEAN	`gsub`	Text
COLS, COLUMNS	`ncol`	Essentials
COLUMN	`col`,`:`,`seq`	Essentials
COMBIN	`choose`	Essentals
CONCATENATE	`paste`	Text
CONFIDENCE	`-qnorm(alpha/2)*std/sqrt(n)`	Statistics
CORREL	`cor`	Statistics
COUNT, COUNTIF	`length`	Arithmetic
COVAR	`cov`	Statistics
CRITBINOM	`qbinom`	Statistics
DELTA	`identical`	Boolean
EXACT	`==`	Boolean
EXP	`exp`	Arithmetic
EXPONDIST	`pexp` when cumulative,`dexp` when not	Statistics
FACT	`factorial`	Arithmetic
FACTDOUBLE	`dfactorial` in the `phangorn` package	Arithmetic
FDIST	`pf`	Statistics
FIND	`regexpr`,`grepl`,`grep`	Text
FINV	`qf`	Statistics
FISHER	`atanh`	Arithmetic
FISHERINV	`tanh`	Arithmetic
FIXED	`format`,`sprintf`,`formatC`	Essentials
FLOOR	`floor`	Arithmetic
FORECAST	`predict` on an `lm` object	Statistics
FREQUENCY	`cut`,`table`	Arithmetic
FTEST	`var.test`	Statistics
GAMMADIST	`pgamma` if last argument T,`dgamma` if last arg. F	Statistics
GAMMAINV	`qgamma`	Statistics
GAMMALN	`lgamma`	Statistics
GAUSS	`pnorm(x) - 0.5`	Statistics
GCD	`gcd`	Arithmetic
GEOMEAN	`exp(mean(log(x)))`	Arithmetic
GESTEP	`>=`	Boolean
HARMEAN	`harmonic.mean` in the `psych` package	Arithmetic
HLOOKUP	`match`,`merge`	Essentials
HYPGEOMDIST	`dhyper`	Statistics
IF	`if`,`ifelse`	Essentials
IFERROR	`try`,`tryCatch`	Essentials
INDEX	`x[y,z]`	Essentials
INDIRECT	`get`	Essentials
INT	`as.integer`(not for negative numbers),`floor`	Arithmetic
INTERCEPT	first element of `coef` of an `lm` object	Statistics
ISLOGICAL	`is.logical`	Boolean
ISNA	`is.na`	Boolean
ISNUMBER	`is.numeric`	Boolean
ISTEXT	`is.character`	Boolean
KURT	`kurtosis` in the `moments` package	Statistics
LARGE	`sort`	Statistics
LCM	`scm` in the `schoolmath` package	Arithmetic
LEFT	`substr`	Text
LEN, LENGTH	`nchar`	Text
LINEST	`lm`	Statistics
LN, LOG	`log`	Arithmetic
LOG10	`log10`	Arithmetic
LOGINV	`qlnorm`	Statistics
LOGNORMDIST	`plnorm`	Statistics
LOWER	`tolower`	Text
MATCH	`match`,`which`	Essentials
MAX	`max` (sometimes `pmax`)	Arithmetic
MDETERM	`det`	Arithmetic
MEDIAN	`median`	Arithmetic
MID	`substr`	Text
MIN	`min` (sometimes `pmin`)	Arithmetic
MINVERSE	`solve`	Arithmetic
MMULT	`%*%`	Arithmetic
MOD	`%%`	Arithmetic
MODE	`as.numeric(names(which.max(table(x))))`	Arithmetic
MUNIT	`diag`	Arithmetic
N	`as.numeric`	Arithmetic
NEGBINOMDIST	`dnbinom`	Statistics
NORMDIST, NORMSDIST	`pnorm` when cumulative,`dnorm` when not	Statistics
NORMINV, NORMSINV	`qnorm`	Statistics
NOT	`!`	Boolean
NOW	`date`,`Sys.time`	Essentials
OR	`	`,`
PEARSON	`cor`	Statistics
PERCENTILE	`quantile`	Statistics
PERCENTRANK	`ecdf`	Statistics
PERMUT	`function(n,k) {choose(n,k)*factorial(k)}`	Arithmetic
PERMUTATIONA	`n^k`	Arithmetic
PHI	`dnorm`	Statistics
POISSON	`ppois` when cumulatic,`dpois` when not	Statistics
POWER	`^`	Arithmetic
PROB	`ecdf`	Statistics
PRODUCT	`prod`	Arithmetic
PROPER	`toupper`	Text
QUARTILE	`quantile`	Arithmetic
QUOTIENT	`%/%`	Arithmetic
RAND	`runif`	Arithmetic
RANDBETWEEN	`sample`	Arithmetic
RANK	`rank`	Essentials
REPLACE	`sub`,`gsub`	Text
REPT	`rep` and `paste` or `paste0`	Text
RIGHT	`substring`	Text
ROUND	`round`	Arithmetic
ROUNDDOWN	`floor`	Arithmetic
ROUNDUP	`ceiling`	Arithmetic
ROW	`row`,`:`,`seq`	Essentials
ROWS	`nrow`	Essentials
RSQ	`summary` of `lm` object	Statistics
SEARCH	`regexpr`,`grep`	Text
SIGN	`sign`	Arithmetic
SKEW	`skewness` in the `moments` package	Statistics
SLOPE	in `coef` of `lm` object	Statistics
SMALL	`sort`	Arithmetic
SQRT	`sqrt`	Arithmetic
STANDARDIZE	`scale`	Statitics
STD, STDEV	`sd`	Arithmetic
STEYX	`predict` on an `lm` object	Statistics
STRING	`format`,`sprintf`,`formatC`	Text
SUBSTITUTE	`sub`,`gsub`,`paste`	Essentials
SUM, SUMIF	`sum`	Arithmetic
SUMPRODUCT	`crossprod`	Arithmetic
TDIST	`pt`	Statistics
TEXT	`format`,`sprintf`,`formatC`	Text
TINV	`abs(qt(x/2,data))`	Statistics
TODAY	`Sys.Date`	Essentials
TRANSPOSE	`t`	Arithmetic
TREND	`fitted` of an `lm` object	Statistics
TRIM	`sub`	Essentials
TRIMMEAN	`mean(x,trim=tr/2)`	Arithmetic
TRUNC	`trunc`	Essentials
TTEST	`t.test`	Statistics
TYPE	`typeof`,`mode`,`class`	Essentials
UPPER	`toupper`	Text
VALUE	`as.numeric`	Arithmetic
VAR	`var`	Essentials
VLOOKUP	`match`,`merge`	Essentials
WEEKDAY	`weekdays`	Essentials
WEIBULL	`pweibull` when cumulative,`dweibull` when not	Statistics
ZTEST	`pnorm`	Statistics

The House Always Wins: Simulating 5,000,000 Games of Baccarat a.k.a. Punto Banco

The associated GitHub repository with R code.

Past weekend, I visited the casino with some friends. Of all games, we enjoy North-American-style Baccarat the most. This type of Baccarat is often called Punto Banco. In short, Punto Banco is a card game in which two hands compete: the “player” and the “banker“. During each coup (a round of play), both hands get dealt either 2 or 3 cards, depending on a complex drawing schema, and all cards have a certain value. Put simply, the hand with the highest total value of cards wins the coup, after which a new one starts. Before each coup, gamblers may bet which of the hands will win. Neither hand is in any way associated with the actual house or player/gambler, so bets may be placed on both. All in all, three different bets can be placed in a game of Punto Banco:

The player hand has the highest total value, in which case the player wins (Punto);
The banker hand has the highest total value, in which case the banker wins (Banco);
The player and banker hands have equal total value, in which case there is a tie (Egalité).

If a gambler correctly bets either Punto or Banco, their bets get a 100% payoff. However, a house tax will often be applied to Banco wins. For instance, Banco wins may only pay off 95% or specific Banco wins (e.g., total card value of 5) may pay off less (e.g., 50%). Depending on house rules, a correct bet on a tie (Egalité) will pay off either 800% or 900%. A wrong bet on Punto or Banco stands in case Egalité is dealt. In all other cases of wrong bets, the house takes the money.

My friends and I like Punto Banco because it is completely random but seems “gameable”. Punto Banco is played with six or eight decks so there is no way to know which cards will be next. Moreover, the card-drawing rules are quite complex, so you never really know what’s going to happen. Sometimes both Punto and Banco get only two cards, at other times, the hand you bet on will get its third card, which might just turn things around. Punto Banco’s perceived gameability comes through our human fallacies to see patterns in randomness. Often, casino’s will place a monitor with the last fifty-so results (see below) to tempt gamblers to (erroneously) spot and bet on patterns. Alternatively, you might think it’s smart to bet against the table (play Punto when everybody else goes for Banco) or play on whatever bet won last hand. As the hands are dealt quite quickly in succession, and the minimal bet is often 10+ euro/dollar, Punto Banco is a quick way to find out how lucky you are.

Image result for punto banco scoreboard — Examples of Baccarat monitors, often placed next to a table.

So back to last weekend’s trip to the casino. Unfortunately, my friends and I lost quite some money at the Punto Banco table. We know the house has an edge (though smaller than in other games) but normally we are quite lucky. We often discuss what would be good strategies to minimize this houses’ edge. Obviously, you want to play as few games as possible, but that’s as far as we got in terms of strategy. Normally, we just test our luck and randomly bet Punto or Banco, and sparsely on Egalité.

As a statistical programmer, I thought it might be interesting to simulate the game and its odds from the bottom up. On the one hand, I wanted to get a sense of how favorable the odds are to the house. On the other hand, I was curious as to what extent strategies may be more or less successful in retaining at least some of your hard-earned cash.

In my simulations, I follow the Holland Casino Punto Banco rules, meaning a six-deck shoe and a Banco win with 5 pays out 50%. I did adopt the more lenient 9-1 payoff for Egalité though. Several hours of programming and some million simulated Baccarat hands later, here are the results:

Do not play Baccarat / Punto Banco if you do not want to lose your money. Obviously, it’s best to not set foot in the casino if you can’t afford to lose some money. However, I eagerly pay for the entertainment value I get from it.
You lose least if you stick to Banco. Despite having only a 50% payoff when Banco wins with 5, the odds are best for Banco due to the drawing rules. Indeed, according to the Wizard of Odds, the house edge for Banco (1.06%) is slightly lower than that of Punto (1.24%).
Whatever you do, do not bet on Egalité. Because most casino’s pay out 8 to 1 in case of a correctly predicted tie, betting on one seems about the worst gambling strategy out there. With a house edge of over 14%, you are better off playing most other games (Wizard of Odds). Although casino’s paying out a tie 9 to 1 decrease the house edge to just below 5%, this is still way worse than playing either Punto or Banco.

The figure below shows the results of the five strategies I tested using 50,000 simulations of 100 consecutive hands. Based on the results, I was reluctant to develop and test other strategies as results look quite straightforward: play Banco. Additionally, Wikipedia cites Thorp (1984, original reference unknown) who suggested that there are no strategies that will really result in any significant player advantage, except maybe for the endgame of a deck, which presumably requires a lot of card counting. If you nevertheless want to test other strategies, please be my guest, here are my five:

Punto: Always bet on Punto.
Banco: Always bet on Banco.
Egalité: Always bet on Egalité.
LastHand: Bet on the outcome of the last hand/coup.
LastHand_PB: Bet on the outcome of the last hand/coup, only if this was Punto or Banco.

The above figure depicts the expected value of each strategy over a series of consecutive hands played. Clearly, the payoff is quite linear, independent of your strategy. The more hands you play, the more you lose. However, also clear is that some strategies outperform others. After 100 hands of Baccarat, playing only Banco will on average result in a total loss below the amount you wager. For example, if you bet 10 euro every hand, you will have a loss of about 9 euro’s after 100 rounds, on average. This is in line with the ~1% house edge reported by the Wizard of Odds. Similarly, betting only Punto will result in a loss of about 130% of the bet amount, which is also conform the ~1.4% house edge reported by the Wizard of Odds. Betting on Punto or Banco based on whichever won last (LastHand_PB) performs somewhere in between these two strategies, losing just over 100% of the bet amount in 100 hands. Your expected losses increase when you just bet on whichever outcome came last, including Egalité, resulting in around ~-150% after 100 hands. This is mainly because betting on Egalité, which seems about the worst strategy ever, will result in a remarkable 493.9% loss after 100 hands.

Apart from these average or expected values, I was also interested in the spread of outcomes of our thousands of simulations. Particularly because gamblers on a lucky streak may win much more when betting on Egalité, as the payoff is larger (8-1 or 9-1). The figure below shows that any strategy including Egalité will indeed result in a wider spread of outcomes. Betting on Egalité may thus be a good strategy if you are by some miracle divinely lucky, have information on which cards are coming next, or have an agreement with the dealer (disclaimer: this is a joke, please do not ever bet on Egalité with the intention of making money or try to cheat at the casino).

If you want to know how I programmed these simulations, please visit the associated github repository or reach out. I intend on simulating the payoff for various other casino games in the near future (first up: BlackJack), so if you are interested keep an eye on my website or twitter.

Join 385 other subscribers

Robust Effect Sizes for Independent Group Comparisons

Guillaume Rousselet explains how and when group comparisons with Cohen’s d fail, and what robust statistics one could use instead:

basic statistics

When I was an undergrad, I was told that beyond a certain sample size (n=30 if I recall correctly), t-tests and ANOVAs are fine. This was a lie. I wished I had been taught robust methods and that t-tests and ANOVAs on means are only a few options among many alternatives. Indeed, t-tests and ANOVAs on means are not robust to outliers, skewness, heavy-tails, and for independent groups, differences in skewness, variance (heteroscedasticity) and combinations of these factors (Wilcox & Keselman, 2003; Wilcox, 2012). The main consequence is a lack of statistical power. For this reason, it is often advised to report a measure of effect size to determine, for instance, if a non-significant effect (based on some arbitrary p value threshold) could be due to lack of power, or reflect a genuine lack of effect. The rationale is that an effect could be associated with a sufficiently large effect…

View original post 3,911 more words

Advanced GIFs in R

Rafa Irizarry is a biostatistics professor and one of the three people behind SimplyStatistics.org (the others are Jeff Leek, Roger Peng). They post ideas that they find interesting and their blog contributes greatly to discussion of science/popular writing.

Rafa is the creator of many data visualization GIFs that have recently trended on the web, and in a recent post he provides all the source code behind the beautiful imagery. I sincerely recommend you check out the orginal blog if you want to find out more, but here are the GIFS:

Simpson’s paradox is a statistical phenomenon where an observed relationship within a population reverses within all subgroups that make up that population. Rafa visualized it wonderfully in a GIF that took only twenty-some lines of R code:

A different statistical phenomenon is discussed at the end of the original blog: namely the ecological fallacy. It occurs when correlations that occur on the group-level are erroneously extrapolated to the individual-level. Rafa used the gapminder data included in the dslabs package to illustrate the fallacy: there is a very high correlation at the region level and a lower correlation at the individual country level:

The gapminder data is also used in the next GIF. This mimics Hans Rosling’s famous animation during his talk on New Insights on Poverty, but then made with R and gganimate by Rafa:

A next visualization demonstrates how the UN voting data (of Erik Voeten and Anton Strezhnev) can be used to examine different voting behaviors. It seems to reduce the voting data to a two-dimensional factor structure, and seemingly there are three distinct groups of voters these days, with particularly the USA and Israel far removed from other members:

The next GIFs are more statistical. The one below demonstrates how the local regression (LOESS) works. Simply speaking, LOESS determines the relationship for a local subset of the population and when you iteratively repeat this for all local subsets in a population you get a nicely fitting LOESS curve, the red line in Rafa’s GIF:

Not quite sure how to interpret the next one, but Rafa explains it visualized a random forest’s predictions using only one predictor variable. I think that different trees would then provide different predictions because they leverage different training samples, and an ensemble of those trees would then improve predictive accuracy?

The next one is my favorite I think. This animation illustrates how a highly accurate test would function in a population with low prevalence of true values (e.g., disease, applicant success). More details are in the original blog or here.

The blog ends with a rather funny animation of the only good use of pie charts, according to Rafa:

Libratus: A Texas Hold-Em Poker AI

Four of the best professional poker players in the world – Dong Kim, Jason Les, Jimmy Chou, and Daniel McAulay – recently got beat by Libratus, a poker-playing AI developed at the Pittsburgh Supercomputing Center. During a period of 20 days of continuous play (10h/day), each of these four professionals lost to Libratus heads-up in a whopping total of 120.000 hands of No Limit Texas Hold-em Poker.

A player may face 10 to the power of 160 different situations in Texas Hold-em Poker: more than the number of atoms in the universe. It took extensive machine learning to compute and prioritize the computation of the most rewarding actions in these situations. Libratus works by running extensive simulations, taking into account the way the professionals play, and figuring out the best counter strategy. Although it is not without flaws, any “holes” the players found in Libratus’ strategy could not be exploited for long, as the algorithm would quickly learn and adapt to prevent further exploitation. The experience was completely different from playing a human player, the professionals argue, as Libratus would make both tiny and huge bets and would continuously change its strategy and plays.

The video below provides more detailed information and also shows the million-dollar margin by which Libratus won at the end of the twenty day poker (training) marathon: