Tag: Excel # Create a publication-ready correlation matrix, with significance levels, in R

In most (observational) research papers you read, you will probably run into a correlation matrix. Often it looks something like this:

In Social Sciences, like Psychology, researchers like to denote the statistical significance levels of the correlation coefficients, often using asterisks (i.e., *). Then the table will look more like this:

Regardless of my personal preferences and opinions, I had to make many of these tables for the scientific (non-)publications of my Ph.D..

I remember that, when I first started using R, I found it quite difficult to generate these correlation matrices automatically.

Yes, there is the `cor` function, but it does not include significance levels.

Then there the (in)famous `Hmisc` package, with its `rcorr` function. But this tool provides a whole new range of issues.

What’s this `storage.mode`, and what are we trying to coerce again?

Soon you figure out that `Hmisc::rcorr` only takes in matrices (thus with only numeric values). Hurray, now you can run a correlation analysis on your dataframe, you think…

Yet, the output is all but publication-ready!

You wanted one correlation matrix, but now you have two… Double the trouble?

To spare future scholars the struggle of the early day R programming, I would like to share my custom function `correlation_matrix`.

My `correlation_matrix` takes in a dataframe, selects only the numeric (and boolean/logical) columns, calculates the correlation coefficients and p-values, and outputs a fully formatted publication-ready correlation matrix!

You can specify many formatting options in `correlation_matrix`.

For instance, you can use only 2 decimals. You can focus on the lower triangle (as the lower and upper triangle values are identical). And you can drop the diagonal values:

Or maybe you are interested in a different type of correlation coefficients, and not so much in significance levels:

For other formatting options, do have a look at the source code below.

Now, to make matters even more easy, I wrote a second function (`save_correlation_matrix`) to directly save any created correlation matrices:

Once you open your new correlation matrix file in Excel, it is immediately ready to be copy-pasted into Word!

If you are looking for ways to visualize your correlations do have a look at the packages `corrr` and `corrplot`.

I hope my functions are of help to you!

Do reach out if you get to use them in any of your research papers!

I would be super interested and feel honored.

## `correlation_matrix`

``````#' correlation_matrix
#' Creates a publication-ready / formatted correlation matrix, using `Hmisc::rcorr` in the backend.
#'
#' @param df dataframe; containing numeric and/or logical columns to calculate correlations for
#' @param type character; specifies the type of correlations to compute; gets passed to `Hmisc::rcorr`; options are `"pearson"` or `"spearman"`; defaults to `"pearson"`
#' @param digits integer/double; number of decimals to show in the correlation matrix; gets passed to `formatC`; defaults to `3`
#' @param decimal.mark character; which decimal.mark to use; gets passed to `formatC`; defaults to `.`
#' @param use character; which part of the correlation matrix to display; options are `"all"`, `"upper"`, `"lower"`; defaults to `"all"`
#' @param show_significance boolean; whether to add `*` to represent the significance levels for the correlations; defaults to `TRUE`
#' @param replace_diagonal boolean; whether to replace the correlations on the diagonal; defaults to `FALSE`
#' @param replacement character; what to replace the diagonal and/or upper/lower triangles with; defaults to `""` (empty string)
#'
#' @return a correlation matrix
#' @export
#'
#' @examples
#' `correlation_matrix(iris)`
#' `correlation_matrix(mtcars)`
correlation_matrix <- function(df,
type = "pearson",
digits = 3,
decimal.mark = ".",
use = "all",
show_significance = TRUE,
replace_diagonal = FALSE,
replacement = ""){

# check arguments
stopifnot({
is.numeric(digits)
digits >= 0
use %in% c("all", "upper", "lower")
is.logical(replace_diagonal)
is.logical(show_significance)
is.character(replacement)
})
# we need the Hmisc package for this
require(Hmisc)

# retain only numeric and boolean columns
isNumericOrBoolean = vapply(df, function(x) is.numeric(x) | is.logical(x), logical(1))
if (sum(!isNumericOrBoolean) > 0) {
cat('Dropping non-numeric/-boolean column(s):', paste(names(isNumericOrBoolean)[!isNumericOrBoolean], collapse = ', '), '\n\n')
}
df = df[isNumericOrBoolean]

# transform input data frame to matrix
x <- as.matrix(df)

# run correlation analysis using Hmisc package
correlation_matrix <- Hmisc::rcorr(x, type = type)
R <- correlation_matrix\$r # Matrix of correlation coeficients
p <- correlation_matrix\$P # Matrix of p-value

# transform correlations to specific character format
Rformatted = formatC(R, format = 'f', digits = digits, decimal.mark = decimal.mark)

# if there are any negative numbers, we want to put a space before the positives to align all
if (sum(!is.na(R) & R < 0) > 0) {
Rformatted = ifelse(!is.na(R) & R > 0, paste0(" ", Rformatted), Rformatted)
}

# add significance levels if desired
if (show_significance) {
# define notions for significance levels; spacing is important.
stars <- ifelse(is.na(p), "", ifelse(p < .001, "***", ifelse(p < .01, "**", ifelse(p < .05, "*", ""))))
Rformatted = paste0(Rformatted, stars)
}

# make all character strings equally long
max_length = max(nchar(Rformatted))
Rformatted = vapply(Rformatted, function(x) {
current_length = nchar(x)
difference = max_length - current_length
return(paste0(x, paste(rep(" ", difference), collapse = ''), sep = ''))
}, FUN.VALUE = character(1))

# build a new matrix that includes the formatted correlations and their significance stars
Rnew <- matrix(Rformatted, ncol = ncol(x))
rownames(Rnew) <- colnames(Rnew) <- colnames(x)

# replace undesired values
if (use == 'upper') {
Rnew[lower.tri(Rnew, diag = replace_diagonal)] <- replacement
} else if (use == 'lower') {
Rnew[upper.tri(Rnew, diag = replace_diagonal)] <- replacement
} else if (replace_diagonal) {
diag(Rnew) <- replacement
}

return(Rnew)
}``````

## `save_correlation_matrix`

``````#' save_correlation_matrix
#' Creates and save to file a fully formatted correlation matrix, using `correlation_matrix` and `Hmisc::rcorr` in the backend
#' @param df dataframe; passed to `correlation_matrix`
#' @param filename either a character string naming a file or a connection open for writing. "" indicates output to the console; passed to `write.csv`
#' @param ... any other arguments passed to `correlation_matrix`
#'
#' @return NULL
#'
#' @examples
#' `save_correlation_matrix(df = iris, filename = 'iris-correlation-matrix.csv')`
#' `save_correlation_matrix(df = mtcars, filename = 'mtcars-correlation-matrix.csv', digits = 3, use = 'lower')`
save_correlation_matrix = function(df, filename, ...) {
return(write.csv2(correlation_matrix(df, ...), file = filename))
}
``````

Sign up to keep up to date on the latest R, Data Science & Tech content:

# Data Visualization Tools & Resources

There’s this amazing overview of helpful dataviz resources atwww.visualisingdata.com/resources!

Browse through hundreds of helpful data visualization tools, programs, and services. All neatly organized by Andy Kirk in categories: data handling, applications, programming, web-based, qualitative, mapping, specialist, and colour. What a great repository!

Looking for expert books on data visualization?
Have a look at these recommendations! # Transitioning from Excel to R: Dictionary of common functions

Alyssa Columbus published maintains this GitHub repository with a great tutorial on how to move from Excel to R. Very useful for beginning useRs, the repository’s tutorial includes a translation table between Excel and R functions:

Excel Formula R Function Type
ABS `abs` Arithmetic
ADDRESS `assign` Essentials
AND `&`,`&&`,`all` Boolean
AVERAGE, AVG, AVERAGEIF `mean` Arithmetic
BETADIST `pbeta` Statistics
BETAINV `qbeta` Statistics
BINOMDIST `pbinom` when cumulative,`dbinom` when not Statistics
CEILING `ceiling` Arithmetic
CELL `str` has the same idea Essentials
CHIDIST, CHISQDIST `pchisq` Statistics
CHIINV, CHISQINV `qchisq` Statistics
CHITEST `chisq.test` Statistics
CHOOSE `switch` Essentials
CLEAN `gsub` Text
COLS, COLUMNS `ncol` Essentials
COLUMN `col`,`:`,`seq` Essentials
COMBIN `choose` Essentals
CONCATENATE `paste` Text
CONFIDENCE `-qnorm(alpha/2)*std/sqrt(n)` Statistics
CORREL `cor` Statistics
COUNT, COUNTIF `length` Arithmetic
COVAR `cov` Statistics
CRITBINOM `qbinom` Statistics
DELTA `identical` Boolean
EXACT `==` Boolean
EXP `exp` Arithmetic
EXPONDIST `pexp` when cumulative,`dexp` when not Statistics
FACT `factorial` Arithmetic
FACTDOUBLE `dfactorial` in the `phangorn` package Arithmetic
FDIST `pf` Statistics
FIND `regexpr`,`grepl`,`grep` Text
FINV `qf` Statistics
FISHER `atanh` Arithmetic
FISHERINV `tanh` Arithmetic
FIXED `format`,`sprintf`,`formatC` Essentials
FLOOR `floor` Arithmetic
FORECAST `predict` on an `lm` object Statistics
FREQUENCY `cut`,`table` Arithmetic
FTEST `var.test` Statistics
GAMMADIST `pgamma` if last argument T,`dgamma` if last arg. F Statistics
GAMMAINV `qgamma` Statistics
GAMMALN `lgamma` Statistics
GAUSS `pnorm(x) - 0.5` Statistics
GCD `gcd` Arithmetic
GEOMEAN `exp(mean(log(x)))` Arithmetic
GESTEP `>=` Boolean
HARMEAN `harmonic.mean` in the `psych` package Arithmetic
HLOOKUP `match`,`merge` Essentials
HYPGEOMDIST `dhyper` Statistics
IF `if`,`ifelse` Essentials
IFERROR `try`,`tryCatch` Essentials
INDEX `x[y,z]` Essentials
INDIRECT `get` Essentials
INT `as.integer`(not for negative numbers),`floor` Arithmetic
INTERCEPT first element of `coef` of an `lm` object Statistics
ISLOGICAL `is.logical` Boolean
ISNA `is.na` Boolean
ISNUMBER `is.numeric` Boolean
ISTEXT `is.character` Boolean
KURT `kurtosis` in the `moments` package Statistics
LARGE `sort` Statistics
LCM `scm` in the `schoolmath` package Arithmetic
LEFT `substr` Text
LEN, LENGTH `nchar` Text
LINEST `lm` Statistics
LN, LOG `log` Arithmetic
LOG10 `log10` Arithmetic
LOGINV `qlnorm` Statistics
LOGNORMDIST `plnorm` Statistics
LOWER `tolower` Text
MATCH `match`,`which` Essentials
MAX `max` (sometimes `pmax`) Arithmetic
MDETERM `det` Arithmetic
MEDIAN `median` Arithmetic
MID `substr` Text
MIN `min` (sometimes `pmin`) Arithmetic
MINVERSE `solve` Arithmetic
MMULT `%*%` Arithmetic
MOD `%%` Arithmetic
MODE `as.numeric(names(which.max(table(x))))` Arithmetic
MUNIT `diag` Arithmetic
N `as.numeric` Arithmetic
NEGBINOMDIST `dnbinom` Statistics
NORMDIST, NORMSDIST `pnorm` when cumulative,`dnorm` when not Statistics
NORMINV, NORMSINV `qnorm` Statistics
NOT `!` Boolean
NOW `date`,`Sys.time` Essentials
OR ` `,`
PEARSON `cor` Statistics
PERCENTILE `quantile` Statistics
PERCENTRANK `ecdf` Statistics
PERMUT `function(n,k) {choose(n,k)*factorial(k)}` Arithmetic
PERMUTATIONA `n^k` Arithmetic
PHI `dnorm` Statistics
POISSON `ppois` when cumulatic,`dpois` when not Statistics
POWER `^` Arithmetic
PROB `ecdf` Statistics
PRODUCT `prod` Arithmetic
PROPER `toupper` Text
QUARTILE `quantile` Arithmetic
QUOTIENT `%/%` Arithmetic
RAND `runif` Arithmetic
RANDBETWEEN `sample` Arithmetic
RANK `rank` Essentials
REPLACE `sub`,`gsub` Text
REPT `rep` and `paste` or `paste0` Text
RIGHT `substring` Text
ROUND `round` Arithmetic
ROUNDDOWN `floor` Arithmetic
ROUNDUP `ceiling` Arithmetic
ROW `row`,`:`,`seq` Essentials
ROWS `nrow` Essentials
RSQ `summary` of `lm` object Statistics
SEARCH `regexpr`,`grep` Text
SIGN `sign` Arithmetic
SKEW `skewness` in the `moments` package Statistics
SLOPE in `coef` of `lm` object Statistics
SMALL `sort` Arithmetic
SQRT `sqrt` Arithmetic
STANDARDIZE `scale` Statitics
STD, STDEV `sd` Arithmetic
STEYX `predict` on an `lm` object Statistics
STRING `format`,`sprintf`,`formatC` Text
SUBSTITUTE `sub`,`gsub`,`paste` Essentials
SUM, SUMIF `sum` Arithmetic
SUMPRODUCT `crossprod` Arithmetic
TDIST `pt` Statistics
TEXT `format`,`sprintf`,`formatC` Text
TINV `abs(qt(x/2,data))` Statistics
TODAY `Sys.Date` Essentials
TRANSPOSE `t` Arithmetic
TREND `fitted` of an `lm` object Statistics
TRIM `sub` Essentials
TRIMMEAN `mean(x,trim=tr/2)` Arithmetic
TRUNC `trunc` Essentials
TTEST `t.test` Statistics
TYPE `typeof`,`mode`,`class` Essentials
UPPER `toupper` Text
VALUE `as.numeric` Arithmetic
VAR `var` Essentials
VLOOKUP `match`,`merge` Essentials
WEEKDAY `weekdays` Essentials
WEIBULL `pweibull` when cumulative,`dweibull` when not Statistics
ZTEST `pnorm` Statistics