In most (observational) research papers you read, you will probably run into a **correlation matrix**. Often it looks something like this:

In Social Sciences, like Psychology, researchers like to denote the **statistical significance levels** of the correlation coefficients, often using asterisks (i.e., *). Then the table will look more like this:

Regardless of my personal preferences and opinions, I had to make many of these tables for the scientific (non-)publications of my Ph.D..

I remember that, when I first started using R, I found it quite difficult to generate these correlation matrices automatically.

**Yes**, there is the `cor`

function, but it does not include significance levels.

Then there the (in)famous `Hmisc`

package, with its `rcorr`

function. But this tool provides a **whole new range of issues**.

What’s this `storage.mode`

, and what are we trying to coerce again?

Soon you figure out that `Hmisc::rcorr`

only takes in matrices *(thus with only numeric values)*. **Hurray**, now you can run a correlation analysis on your *dataframe*, you think…

Yet, the output is **all but publication-ready**!

You wanted one correlation matrix, but now you have two… **Double the trouble?**

To **spare future scholars the struggle** of the early day R programming, I would like to share my *custom function* `correlation_matrix`

.

My `correlation_matrix`

takes in a *dataframe*, selects only the numeric (and boolean/logical) columns, calculates the correlation coefficients and p-values, and outputs a **fully formatted publication-ready correlation matrix**!

You can specify **many formatting options** in `correlation_matrix`

.

For instance, you can use only 2 decimals. You can focus on the lower triangle *(as the lower and upper triangle values are identical)*. And you can drop the diagonal values:

Or maybe you are interested in a **different type of correlation coefficients**, and not so much in significance levels:

For other formatting options, do have a look at the **source code below**.

Now, to make matters **even more easy**, I wrote a second function (`save_correlation_matrix`

) to directly save any created correlation matrices:

Once you open your new correlation matrix file in Excel, it is **immediately ready** to be copy-pasted into Word!

If you are looking for ways to **visualize **your correlations do have a look at the packages `corrr`

and `corrplot`

.

**I hope my functions are of help to you!**

Do reach out if you get to use them in any of your research papers!

I would be super interested and feel honored.

`correlation_matrix`

```
#' correlation_matrix
#' Creates a publication-ready / formatted correlation matrix, using `Hmisc::rcorr` in the backend.
#'
#' @param df dataframe; containing numeric and/or logical columns to calculate correlations for
#' @param type character; specifies the type of correlations to compute; gets passed to `Hmisc::rcorr`; options are `"pearson"` or `"spearman"`; defaults to `"pearson"`
#' @param digits integer/double; number of decimals to show in the correlation matrix; gets passed to `formatC`; defaults to `3`
#' @param decimal.mark character; which decimal.mark to use; gets passed to `formatC`; defaults to `.`
#' @param use character; which part of the correlation matrix to display; options are `"all"`, `"upper"`, `"lower"`; defaults to `"all"`
#' @param show_significance boolean; whether to add `*` to represent the significance levels for the correlations; defaults to `TRUE`
#' @param replace_diagonal boolean; whether to replace the correlations on the diagonal; defaults to `FALSE`
#' @param replacement character; what to replace the diagonal and/or upper/lower triangles with; defaults to `""` (empty string)
#'
#' @return a correlation matrix
#' @export
#'
#' @examples
#' `correlation_matrix(iris)`
#' `correlation_matrix(mtcars)`
correlation_matrix <- function(df,
type = "pearson",
digits = 3,
decimal.mark = ".",
use = "all",
show_significance = TRUE,
replace_diagonal = FALSE,
replacement = ""){
# check arguments
stopifnot({
is.numeric(digits)
digits >= 0
use %in% c("all", "upper", "lower")
is.logical(replace_diagonal)
is.logical(show_significance)
is.character(replacement)
})
# we need the Hmisc package for this
require(Hmisc)
# retain only numeric and boolean columns
isNumericOrBoolean = vapply(df, function(x) is.numeric(x) | is.logical(x), logical(1))
if (sum(!isNumericOrBoolean) > 0) {
cat('Dropping non-numeric/-boolean column(s):', paste(names(isNumericOrBoolean)[!isNumericOrBoolean], collapse = ', '), '\n\n')
}
df = df[isNumericOrBoolean]
# transform input data frame to matrix
x <- as.matrix(df)
# run correlation analysis using Hmisc package
correlation_matrix <- Hmisc::rcorr(x, type = type)
R <- correlation_matrix$r # Matrix of correlation coeficients
p <- correlation_matrix$P # Matrix of p-value
# transform correlations to specific character format
Rformatted = formatC(R, format = 'f', digits = digits, decimal.mark = decimal.mark)
# if there are any negative numbers, we want to put a space before the positives to align all
if (sum(!is.na(R) & R < 0) > 0) {
Rformatted = ifelse(!is.na(R) & R > 0, paste0(" ", Rformatted), Rformatted)
}
# add significance levels if desired
if (show_significance) {
# define notions for significance levels; spacing is important.
stars <- ifelse(is.na(p), "", ifelse(p < .001, "***", ifelse(p < .01, "**", ifelse(p < .05, "*", ""))))
Rformatted = paste0(Rformatted, stars)
}
# make all character strings equally long
max_length = max(nchar(Rformatted))
Rformatted = vapply(Rformatted, function(x) {
current_length = nchar(x)
difference = max_length - current_length
return(paste0(x, paste(rep(" ", difference), collapse = ''), sep = ''))
}, FUN.VALUE = character(1))
# build a new matrix that includes the formatted correlations and their significance stars
Rnew <- matrix(Rformatted, ncol = ncol(x))
rownames(Rnew) <- colnames(Rnew) <- colnames(x)
# replace undesired values
if (use == 'upper') {
Rnew[lower.tri(Rnew, diag = replace_diagonal)] <- replacement
} else if (use == 'lower') {
Rnew[upper.tri(Rnew, diag = replace_diagonal)] <- replacement
} else if (replace_diagonal) {
diag(Rnew) <- replacement
}
return(Rnew)
}
```

`save_correlation_matrix`

```
#' save_correlation_matrix
#' Creates and save to file a fully formatted correlation matrix, using `correlation_matrix` and `Hmisc::rcorr` in the backend
#' @param df dataframe; passed to `correlation_matrix`
#' @param filename either a character string naming a file or a connection open for writing. "" indicates output to the console; passed to `write.csv`
#' @param ... any other arguments passed to `correlation_matrix`
#'
#' @return NULL
#'
#' @examples
#' `save_correlation_matrix(df = iris, filename = 'iris-correlation-matrix.csv')`
#' `save_correlation_matrix(df = mtcars, filename = 'mtcars-correlation-matrix.csv', digits = 3, use = 'lower')`
save_correlation_matrix = function(df, filename, ...) {
return(write.csv2(correlation_matrix(df, ...), file = filename))
}
```

**Sign up to keep up to date on the latest R, Data Science & Tech content:**

Fantastic, finally!

THANK YOU!

LikeLiked by 1 person

Please one worked put example or vignette, thanks

LikeLike

There are examples listed in the article and in the code Duleep. They work with the datasets already included in R: iris or mtcars.

LikeLike

This is so cool! Thank you.

LikeLiked by 1 person

Great, thank you! One question, can you use this package to calculate the correlation matrix split by groups? I have some experimental results with different treatments, and I am interested in the correlation matrix by treatment.

Thanks!

Gerardo

LikeLiked by 1 person

Hi Gerardo. I am not currently in reach of a computer. Yet I think something like `lapply(split(df, group_var), correlation_matrix)` should work.

Split first splits your df into a list with seperate dfs per group on the group_var, and then lapply applies the correlation_matrix function to each list element (split df), returning seperate correlation matrices in a list. Have a look at the split and lapply base function documentation for how they precisely work.

LikeLike

Is there a problem with columns containing only one level or value such as `correlation_matrix(dplyr::mutate(mtcars, aa=0) %>% as.data.frame())`?

LikeLike

Hi Jimbou! I had accounted for error-handling in case of missing correlations (when there is no variation in one of the variables).

I have changed the code and the function can now handle such cases. The respective correlation matrix column/row will contain NaNs. Moreover, I’ve improved the function slightly to make all correlation value strings equally long.

I should really stop posting code on my website and open github repositories with change requests ; )

Thanks for noticing this error though!

LikeLike