In most (observational) research papers you read, you will probably run into a **correlation matrix**. Often it looks something like this:

In Social Sciences, like Psychology, researchers like to denote the **statistical significance levels** of the correlation coefficients, often using asterisks (i.e., *). Then the table will look more like this:

Regardless of my personal preferences and opinions, I had to make many of these tables for the scientific (non-)publications of my Ph.D..

I remember that, when I first started using R, I found it quite difficult to generate these correlation matrices automatically.

**Yes**, there is the `cor`

function, but it does not include significance levels.

Then there the (in)famous `Hmisc`

package, with its `rcorr`

function. But this tool provides a **whole new range of issues**.

What’s this `storage.mode`

, and what are we trying to coerce again?

Soon you figure out that `Hmisc::rcorr`

only takes in matrices *(thus with only numeric values)*. **Hurray**, now you can run a correlation analysis on your *dataframe*, you think…

Yet, the output is **all but publication-ready**!

You wanted one correlation matrix, but now you have two… **Double the trouble?**

To **spare future scholars the struggle** of the early day R programming, I would like to share my *custom function* `correlation_matrix`

.

My `correlation_matrix`

takes in a *dataframe*, selects only the numeric (and boolean/logical) columns, calculates the correlation coefficients and p-values, and outputs a **fully formatted publication-ready correlation matrix**!

You can specify **many formatting options** in `correlation_matrix`

.

For instance, you can use only 2 decimals. You can focus on the lower triangle *(as the lower and upper triangle values are identical)*. And you can drop the diagonal values:

Or maybe you are interested in a **different type of correlation coefficients**, and not so much in significance levels:

For other formatting options, do have a look at the **source code below**.

Now, to make matters **even more easy**, I wrote a second function (`save_correlation_matrix`

) to directly save any created correlation matrices:

Once you open your new correlation matrix file in Excel, it is **immediately ready** to be copy-pasted into Word!

If you are looking for ways to **visualize **your correlations do have a look at the packages `corrr`

and `corrplot`

.

**I hope my functions are of help to you!**

Do reach out if you get to use them in any of your research papers!

I would be super interested and feel honored.

`correlation_matrix`

```
#' correlation_matrix
#' Creates a publication-ready / formatted correlation matrix, using `Hmisc::rcorr` in the backend.
#'
#' @param df dataframe; containing numeric and/or logical columns to calculate correlations for
#' @param type character; specifies the type of correlations to compute; gets passed to `Hmisc::rcorr`; options are `"pearson"` or `"spearman"`; defaults to `"pearson"`
#' @param digits integer/double; number of decimals to show in the correlation matrix; gets passed to `formatC`; defaults to `3`
#' @param decimal.mark character; which decimal.mark to use; gets passed to `formatC`; defaults to `.`
#' @param use character; which part of the correlation matrix to display; options are `"all"`, `"upper"`, `"lower"`; defaults to `"all"`
#' @param show_significance boolean; whether to add `*` to represent the significance levels for the correlations; defaults to `TRUE`
#' @param replace_diagonal boolean; whether to replace the correlations on the diagonal; defaults to `FALSE`
#' @param replacement character; what to replace the diagonal and/or upper/lower triangles with; defaults to `""` (empty string)
#'
#' @return a correlation matrix
#' @export
#'
#' @examples
#' `correlation_matrix(iris)`
#' `correlation_matrix(mtcars)`
correlation_matrix <- function(df,
type = "pearson",
digits = 3,
decimal.mark = ".",
use = "all",
show_significance = TRUE,
replace_diagonal = FALSE,
replacement = ""){
# check arguments
stopifnot({
is.numeric(digits)
digits >= 0
use %in% c("all", "upper", "lower")
is.logical(replace_diagonal)
is.logical(show_significance)
is.character(replacement)
})
# we need the Hmisc package for this
require(Hmisc)
# retain only numeric and boolean columns
isNumericOrBoolean = vapply(df, function(x) is.numeric(x) | is.logical(x), logical(1))
if (sum(!isNumericOrBoolean) > 0) {
cat('Dropping non-numeric/-boolean column(s):', paste(names(isNumericOrBoolean)[!isNumericOrBoolean], collapse = ', '), '\n\n')
}
df = df[isNumericOrBoolean]
# transform input data frame to matrix
x <- as.matrix(df)
# run correlation analysis using Hmisc package
correlation_matrix <- Hmisc::rcorr(x, type = type)
R <- correlation_matrix$r # Matrix of correlation coeficients
p <- correlation_matrix$P # Matrix of p-value
# transform correlations to specific character format
Rformatted = formatC(R, format = 'f', digits = digits, decimal.mark = decimal.mark)
# if there are any negative numbers, we want to put a space before the positives to align all
if (sum(!is.na(R) & R < 0) > 0) {
Rformatted = ifelse(!is.na(R) & R > 0, paste0(" ", Rformatted), Rformatted)
}
# add significance levels if desired
if (show_significance) {
# define notions for significance levels; spacing is important.
stars <- ifelse(is.na(p), "", ifelse(p < .001, "***", ifelse(p < .01, "**", ifelse(p < .05, "*", ""))))
Rformatted = paste0(Rformatted, stars)
}
# make all character strings equally long
max_length = max(nchar(Rformatted))
Rformatted = vapply(Rformatted, function(x) {
current_length = nchar(x)
difference = max_length - current_length
return(paste0(x, paste(rep(" ", difference), collapse = ''), sep = ''))
}, FUN.VALUE = character(1))
# build a new matrix that includes the formatted correlations and their significance stars
Rnew <- matrix(Rformatted, ncol = ncol(x))
rownames(Rnew) <- colnames(Rnew) <- colnames(x)
# replace undesired values
if (use == 'upper') {
Rnew[lower.tri(Rnew, diag = replace_diagonal)] <- replacement
} else if (use == 'lower') {
Rnew[upper.tri(Rnew, diag = replace_diagonal)] <- replacement
} else if (replace_diagonal) {
diag(Rnew) <- replacement
}
return(Rnew)
}
```

`save_correlation_matrix`

```
#' save_correlation_matrix
#' Creates and save to file a fully formatted correlation matrix, using `correlation_matrix` and `Hmisc::rcorr` in the backend
#' @param df dataframe; passed to `correlation_matrix`
#' @param filename either a character string naming a file or a connection open for writing. "" indicates output to the console; passed to `write.csv`
#' @param ... any other arguments passed to `correlation_matrix`
#'
#' @return NULL
#'
#' @examples
#' `save_correlation_matrix(df = iris, filename = 'iris-correlation-matrix.csv')`
#' `save_correlation_matrix(df = mtcars, filename = 'mtcars-correlation-matrix.csv', digits = 3, use = 'lower')`
save_correlation_matrix = function(df, filename, ...) {
return(write.csv2(correlation_matrix(df, ...), file = filename))
}
```

**Sign up to keep up to date on the latest R, Data Science & Tech content:**

Fantastic, finally!

THANK YOU!

LikeLiked by 1 person

Please one worked put example or vignette, thanks

LikeLike

There are examples listed in the article and in the code Duleep. They work with the datasets already included in R: iris or mtcars.

LikeLike

This is so cool! Thank you.

LikeLiked by 1 person

Great, thank you! One question, can you use this package to calculate the correlation matrix split by groups? I have some experimental results with different treatments, and I am interested in the correlation matrix by treatment.

Thanks!

Gerardo

LikeLiked by 1 person

Hi Gerardo. I am not currently in reach of a computer. Yet I think something like `lapply(split(df, group_var), correlation_matrix)` should work.

Split first splits your df into a list with seperate dfs per group on the group_var, and then lapply applies the correlation_matrix function to each list element (split df), returning seperate correlation matrices in a list. Have a look at the split and lapply base function documentation for how they precisely work.

LikeLike

Is there a problem with columns containing only one level or value such as `correlation_matrix(dplyr::mutate(mtcars, aa=0) %>% as.data.frame())`?

LikeLike

Hi Jimbou! I had accounted for error-handling in case of missing correlations (when there is no variation in one of the variables).

I have changed the code and the function can now handle such cases. The respective correlation matrix column/row will contain NaNs. Moreover, I’ve improved the function slightly to make all correlation value strings equally long.

I should really stop posting code on my website and open github repositories with change requests ; )

Thanks for noticing this error though!

LikeLike

Hey there, great work on this! I just had one issue, similar to some of your images (e.g. https://paulvanderlaken.files.wordpress.com/2020/07/image-9.png) I am having the symbols ” ” appear in all fields. Do you know what causes this or how to remove them?

LikeLike

Hi BM, could you share the exact code you used? I can’t see the exact issue based on this information.

LikeLike

I copied the provided code for the function into an r markdown sheet and used to following code to produce the correlation matrix:

correlation_matrix(mydata.BM.morning.sleep, show_significance = TRUE, digits = 2, use = “lower”, replace_diagonal = TRUE, replacement = “”)

LikeLike

I still can’t deduce the issue here. What happens if you try using the function without providing all the arguments? If you give it just your data? The data is numeric right?

LikeLike

Yeah, I’ve tried it without any arguments and it’s exactly the same. It looks like these images:

The “” don’t appear when using save_correlation_matrix though.

LikeLike

Yeah those “” are supposed to appear, as they indicate that the correlation coefficients are stored as textual (character) values in R. That is necessary as they are a combination of the numerical coefficients and the textual significance indicators (***). Once you export them to Excel, the “” dissappear as Excel does not use them to indicate that data is textual. Does this clarify your issue?

If you want to create a correlation table in R markdown without the “” you can look into further manipulating the output correlation matrix using the gt package to turn it into a pretty table.

LikeLike

Ah, yep! That completely makes sense and it’s not an issue as it doesn’t appear in the saved output. Thank you so much for coding this – it’s great!

LikeLiked by 1 person

Hi Paul, this is so great! Was searching for a easy-to-use method for this issue and this one is fantastic! Thanks a lot for this work and especially for sharing with others, this is how it works!

LikeLike

Hi, this works amazing, thank you! However, I had to tweak one little thing and let you know about it: to my understanding, the fuction does not feed “type” adequately into rcorr, which made me unable to run spearman correlations. I changed “[…] type = )” to “[…]type = type)” and then it worked. đŸ™‚

LikeLike