Markdown is a great tool for integrating data analysis and report writing. Rosanna van Hespen wrote a great five-blog guide on how to write your thesis in R Markdown:
Tag: markdown
Harry Plotter: Part 2 – Hogwarts Houses and their Stereotypes
Two weeks ago, I started the Harry Plotter project to celebrate the 20th anniversary of the first Harry Potter book. I could not have imagined that the first blog would be so well received. It reached over 4000 views in a matter of days thanks to the lovely people in the data science and #rstats community that were kind enough to share it (special thanks to MaraAverick and DataCamp). The response from the Harry Potter community, for instance on reddit, was also just overwhelming
Part 2: Hogwarts Houses
All in all, I could not resist a sequel and in this second post we will explore the four houses of Hogwarts: Gryffindor, Hufflepuff, Ravenclaw, and Slytherin. At the end of today’s post we will end up with visualizations like this:

Various stereotypes exist regarding these houses and a textual analysis seemed a perfect way to uncover their origins. More specifically, we will try to identify which words are most unique, informative, important or otherwise characteristic for each house by means of ratio and tf-idf statistics. Additionally, we will try to estime a personality profile for each house using these characteristic words and the emotions they relate to. Again, we rely strongly on ggplot2 for our visualizations, but we will also be using the treemaps of treemapify. Moreover, I have a special surprise this second post, as I found the orginal Harry Potter font, which will definately make the visualizations feel more authentic. Of course, we will conduct all analyses in a tidy manner using tidytext and the tidyverse.
I hope you will enjoy this blog and that you’ll be back for more. To be the first to receive new content, please subscribe to my website www.paulvanderlaken.com, follow me on Twitter, or add me on LinkedIn. Additionally, if you would like to contribute to, collaborate on, or need assistance with a data science project or venture, please feel free to reach out.
R Setup
All analysis were performed in RStudio, and knit using rmarkdown so that you can follow my steps.
In term of setup, we will be needing some of the same packages as last time. Bradley Boehmke gathered the text of the Harry Potter books in his harrypotter package. We need devtools to install that package the first time, but from then on can load it in as usual. We need plyr for ldply(). We load in most other tidyverse packages in a single bundle and add tidytext. Finally, I load the Harry Potter font and set some default plotting options.
# SETUP ####
# LOAD IN PACKAGES
# library(devtools)
# devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
library(plyr)
library(tidyverse)
library(tidytext)
# VIZUALIZATION SETTINGS
# custom Harry Potter font
# http://www.fontspace.com/category/harry%20potter
library(extrafont)
font_import(paste0(getwd(),"/fontomen_harry-potter"), prompt = F) # load in custom Harry Potter font
windowsFonts(HP = windowsFont("Harry Potter"))
theme_set(theme_light(base_family = "HP")) # set default ggplot theme to light
default_title = "Harry Plotter: References to the Hogwarts houses" # set default title
default_caption = "www.paulvanderlaken.com" # set default caption
dpi = 600 # set default dpi
Importing and Transforming Data
Before we import and transform the data in one large piping chunk, I need to specify some variables.
First, I tell R the house names, which we are likely to need often, so standardization will help prevent errors. Next, my girlfriend was kind enough to help me (colorblind) select the primary and secondary colors for the four houses. Here, the ggplot2 color guide in my R resources list helped a lot! Finally, I specify the regular expression (tutorials) which we will use a couple of times in order to identify whether text includes either of the four house names.
# DATA PREPARATION ####
houses <- c('gryffindor', 'ravenclaw', 'hufflepuff', 'slytherin') # define house names
houses_colors1 <- c("red3", "yellow2", "blue4", "#006400") # specify primary colors
houses_colors2 <- c("#FFD700", "black", "#B87333", "#BCC6CC") # specify secondary colors
regex_houses <- paste(houses, collapse = "|") # regular expression
Import Data and Tidy
Ok, let’s import the data now. You may recognize pieces of the code below from last time, but this version runs slightly smoother after some optimalization. Have a look at the current data format.
# LOAD IN BOOK TEXT
houses_sentences <- list(
`Philosophers Stone` = philosophers_stone,
`Chamber of Secrets` = chamber_of_secrets,
`Prisoner of Azkaban` = prisoner_of_azkaban,
`Goblet of Fire` = goblet_of_fire,
`Order of the Phoenix` = order_of_the_phoenix,
`Half Blood Prince` = half_blood_prince,
`Deathly Hallows` = deathly_hallows
) %>%
# TRANSFORM TO TOKENIZED DATASET
ldply(cbind) %>% # bind all chapters to dataframe
mutate(.id = factor(.id, levels = unique(.id), ordered = T)) %>% # identify associated book
unnest_tokens(sentence, `1`, token = 'sentences') %>% # seperate sentences
filter(grepl(regex_houses, sentence)) %>% # exclude sentences without house reference
cbind(sapply(houses, function(x) grepl(x, .$sentence)))# identify references
# examine
max.char = 30 # define max sentence length
houses_sentences %>%
mutate(sentence = ifelse(nchar(sentence) > max.char, # cut off long sentences
paste0(substring(sentence, 1, max.char), "..."),
sentence)) %>%
head(5)
## .id sentence gryffindor
## 1 Philosophers Stone "well, no one really knows unt... FALSE
## 2 Philosophers Stone "and what are slytherin and hu... FALSE
## 3 Philosophers Stone everyone says hufflepuff are a... FALSE
## 4 Philosophers Stone "better hufflepuff than slythe... FALSE
## 5 Philosophers Stone "there's not a single witch or... FALSE
## ravenclaw hufflepuff slytherin
## 1 FALSE TRUE TRUE
## 2 FALSE TRUE TRUE
## 3 FALSE TRUE FALSE
## 4 FALSE TRUE TRUE
## 5 FALSE FALSE TRUE
Transform to Long Format
Ok, looking great, but not tidy yet. We need gather the columns and put them in a long dataframe. Thinking ahead, it would be nice to already capitalize the house names for which I wrote a custom Capitalize() function.
# custom capitalization function
Capitalize = function(text){
paste0(substring(text,1,1) %>% toupper(),
substring(text,2))
}
# TO LONG FORMAT
houses_long <- houses_sentences %>%
gather(key = house, value = test, -sentence, -.id) %>%
mutate(house = Capitalize(house)) %>% # capitalize names
filter(test) %>% select(-test) # delete rows where house not referenced
# examine
houses_long %>%
mutate(sentence = ifelse(nchar(sentence) > max.char, # cut off long sentences
paste0(substring(sentence, 1, max.char), "..."),
sentence)) %>%
head(20)
## .id sentence house
## 1 Philosophers Stone i've been asking around, and i... Gryffindor
## 2 Philosophers Stone "gryffindor," said ron. Gryffindor
## 3 Philosophers Stone "the four houses are called gr... Gryffindor
## 4 Philosophers Stone you might belong in gryffindor... Gryffindor
## 5 Philosophers Stone " brocklehurst, mandy" went to... Gryffindor
## 6 Philosophers Stone "finnigan, seamus," the sandy-... Gryffindor
## 7 Philosophers Stone "gryffindor!" Gryffindor
## 8 Philosophers Stone when it finally shouted, "gryf... Gryffindor
## 9 Philosophers Stone well, if you're sure -- better... Gryffindor
## 10 Philosophers Stone he took off the hat and walked... Gryffindor
## 11 Philosophers Stone "thomas, dean," a black boy ev... Gryffindor
## 12 Philosophers Stone harry crossed his fingers unde... Gryffindor
## 13 Philosophers Stone resident ghost of gryffindor t... Gryffindor
## 14 Philosophers Stone looking pleased at the stunned... Gryffindor
## 15 Philosophers Stone gryffindors have never gone so... Gryffindor
## 16 Philosophers Stone the gryffindor first years fol... Gryffindor
## 17 Philosophers Stone they all scrambled through it ... Gryffindor
## 18 Philosophers Stone nearly headless nick was alway... Gryffindor
## 19 Philosophers Stone professor mcgonagall was head ... Gryffindor
## 20 Philosophers Stone over the noise, snape said, "a... Gryffindor
Visualize House References
Woohoo, so tidy! Now comes the fun part: visualization. The following plots how often houses are mentioned overall, and in each book seperately.
# set plot width & height
w = 10; h = 6
# PLOT REFERENCE FREQUENCY
houses_long %>%
group_by(house) %>%
summarize(n = n()) %>% # count sentences per house
ggplot(aes(x = desc(house), y = n)) +
geom_bar(aes(fill = house), stat = 'identity') +
geom_text(aes(y = n / 2, label = house, col = house), # center text
size = 8, family = 'HP') +
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = 'none') +
labs(title = default_title,
subtitle = "Combined references in all Harry Potter books",
caption = default_caption,
x = '', y = 'Name occurence') +
coord_flip()

# PLOT REFERENCE FREQUENCY OVER TIME
houses_long %>%
group_by(.id, house) %>%
summarize(n = n()) %>% # count sentences per house per book
ggplot(aes(x = .id, y = n, group = house)) +
geom_line(aes(col = house), size = 2) +
scale_color_manual(values = houses_colors1) +
theme(legend.position = 'bottom',
axis.text.x = element_text(angle = 15, hjust = 0.5, vjust = 0.5)) + # rotate x axis text
labs(title = default_title,
subtitle = "References throughout the Harry Potter books",
caption = default_caption,
x = NULL, y = 'Name occurence', color = 'House')

The Harry Potter font looks wonderful, right?
In terms of the data, Gryffindor and Slytherin definitely play a larger role in the Harry Potter stories. However, as the storyline progresses, Slytherin as a house seems to lose its importance. Their downward trend since the Chamber of Secrets results in Ravenclaw being mentioned more often in the final book (Edit – this is likely due to the diadem horcrux, as you will see later on).
I can’t but feel sorry for house Hufflepuff, which never really gets to involved throughout the saga.
Retrieve Reference Words & Data
Let’s dive into the specific words used in combination with each house. The following code retrieves and counts the single words used in the sentences where houses are mentioned.
# IDENTIFY WORDS USED IN COMBINATION WITH HOUSES
words_by_houses <- houses_long %>%
unnest_tokens(word, sentence, token = 'words') %>% # retrieve words
mutate(word = gsub("'s", "", word)) %>% # remove possesive determiners
group_by(house, word) %>%
summarize(word_n = n()) # count words per house
# examine
words_by_houses %>% head()
## # A tibble: 6 x 3
## # Groups: house [1]
## house word word_n
## <chr> <chr> <int>
## 1 Gryffindor 104 1
## 2 Gryffindor 22nd 1
## 3 Gryffindor a 251
## 4 Gryffindor abandoned 1
## 5 Gryffindor abandoning 1
## 6 Gryffindor abercrombie 1
Visualize Word-House Combinations
Now we can visualize which words relate to each of the houses. Because facet_wrap() has trouble reordering the axes (because words may related to multiple houses in different frequencies), I needed some custom functionality, which I happily recycled from dgrtwo’s github. With these reorder_within() and scale_x_reordered() we can now make an ordered barplot of the top-20 most frequent words per house.
# custom functions for reordering facet plots
# https://github.com/dgrtwo/drlib/blob/master/R/reorder_within.R
reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
new_x <- paste(x, within, sep = sep)
reorder(new_x, by, FUN = fun)
}
scale_x_reordered <- function(..., sep = "___") {
reg <- paste0(sep, ".+$")
ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
}
# set plot width & height
w = 10; h = 7;
# PLOT MOST FREQUENT WORDS PER HOUSE
words_per_house = 20 # set number of top words
words_by_houses %>%
group_by(house) %>%
arrange(house, desc(word_n)) %>%
mutate(top = row_number()) %>% # count word top position
filter(top <= words_per_house) %>% # retain specified top number
ggplot(aes(reorder_within(word, -top, house), # reorder by minus top number
word_n, fill = house)) +
geom_col(show.legend = F) +
scale_x_reordered() + # rectify x axis labels
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
facet_wrap(~ house, scales = "free_y") + # facet wrap and free y axis
coord_flip() +
labs(title = default_title,
subtitle = "Words most commonly used together with houses",
caption = default_caption,
x = NULL, y = 'Word Frequency')

Unsurprisingly, several stop words occur most frequently in the data. Intuitively, we would rerun the code but use dplyr::anti_join() on tidytext::stop_words to remove stop words.
# PLOT MOST FREQUENT WORDS PER HOUSE
# EXCLUDING STOPWORDS
words_by_houses %>%
anti_join(stop_words, 'word') %>% # remove stop words
group_by(house) %>%
arrange(house, desc(word_n)) %>%
mutate(top = row_number()) %>% # count word top position
filter(top <= words_per_house) %>% # retain specified top number
ggplot(aes(reorder_within(word, -top, house), # reorder by minus top number
word_n, fill = house)) +
geom_col(show.legend = F) +
scale_x_reordered() + # rectify x axis labels
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
facet_wrap(~ house, scales = "free") + # facet wrap and free scales
coord_flip() +
labs(title = default_title,
subtitle = "Words most commonly used together with houses, excluding stop words",
caption = default_caption,
x = NULL, y = 'Word Frequency')

However, some stop words have a different meaning in the Harry Potter universe. points are for instance quite informative to the Hogwarts houses but included in stop_words.
Moreover, many of the most frequent words above occur in relation to multiple or all houses. Take, for instance, Harry and Ron, which are in the top-10 of each house, or words like table, house, and professor.
We are more interested in words that describe one house, but not another. Similarly, we only want to exclude stop words which are really irrelevant. To this end, we compute a ratio-statistic below. This statistic displays how frequently a word occurs in combination with one house rather than with the others. However, we need to adjust this ratio for how often houses occur in the text as more text (and thus words) is used in reference to house Gryffindor than, for instance, Ravenclaw.
words_by_houses <- words_by_houses %>%
group_by(word) %>% mutate(word_sum = sum(word_n)) %>% # counts words overall
group_by(house) %>% mutate(house_n = n()) %>%
ungroup() %>%
# compute ratio of usage in combination with house as opposed to overall
# adjusted for house references frequency as opposed to overall frequency
mutate(ratio = (word_n / (word_sum - word_n + 1) / (house_n / n())))
# examine
words_by_houses %>% select(-word_sum, -house_n) %>% arrange(desc(word_n)) %>% head()
## # A tibble: 6 x 4
## house word word_n ratio
## <chr> <chr> <int> <dbl>
## 1 Gryffindor the 1057 2.373115
## 2 Slytherin the 675 1.467926
## 3 Gryffindor gryffindor 602 13.076218
## 4 Gryffindor and 477 2.197259
## 5 Gryffindor to 428 2.830435
## 6 Gryffindor of 362 2.213186
# PLOT MOST UNIQUE WORDS PER HOUSE BY RATIO
words_by_houses %>%
group_by(house) %>%
arrange(house, desc(ratio)) %>%
mutate(top = row_number()) %>% # count word top position
filter(top <= words_per_house) %>% # retain specified top number
ggplot(aes(reorder_within(word, -top, house), # reorder by minus top number
ratio, fill = house)) +
geom_col(show.legend = F) +
scale_x_reordered() + # rectify x axis labels
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
facet_wrap(~ house, scales = "free") + # facet wrap and free scales
coord_flip() +
labs(title = default_title,
subtitle = "Most informative words per house, by ratio",
caption = default_caption,
x = NULL, y = 'Adjusted Frequency Ratio (house vs. non-house)')

# PS. normally I would make a custom ggplot function
# when I plot three highly similar graphs
This ratio statistic (x-axis) should be interpreted as follows: night is used 29 times more often in combination with Gryffindor than with the other houses.
Do you think the results make sense:
- Gryffindors spent dozens of hours during their afternoons, evenings, and nights in the, often empty, tower room, apparently playing chess? Nevile Longbottom and Hermione Granger are Gryffindors, obviously, and Sirius Black is also on the list. The sword of Gryffindor is no surprise here either.
- Hannah Abbot, Ernie Macmillan and Cedric Diggory are Hufflepuffs. Were they mostly hot curly blondes interested in herbology? Nevertheless, wild and aggresive seem unfitting for Hogwarts most boring house.
- A lot of names on the list of Helena Ravenclaw’s house. Roger Davies, Padma Patil, Cho Chang, Miss S. Fawcett, Stewart Ackerley, Terry Boot, and Penelope Clearwater are indeed Ravenclaws, I believe. Ravenclaw’s Diadem was one of Voldemort horcruxes. AlectoCarrow, Death Eater by profession, was apparently sent on a mission by Voldemort to surprise Harry in Rawenclaw’s common room (source), which explains what she does on this list. Can anybody tell me what bust, statue and spot have in relation to Ravenclaw?
- House Slytherin is best represented by Gregory Goyle, one of the members of Draco Malfoy’s gang along with Vincent Crabbe. Pansy Parkinson also represents house Slytherin. Slytherin are famous for speaking Parseltongue and their house’s gem is an emerald. House Gaunt were pure-blood descendants from Salazar Slytherin and apparently Viktor Krum would not have misrepresented the Slytherin values either. Oh, and only the heir of Slytherin could control the monster in the Chamber of Secrets.
Honestly, I was not expecting such good results! However, there is always room for improvement.
We may want to exclude words that only occur once or twice in the book (e.g., Alecto) as well as the house names. Additionally, these barplots are not the optimal visualization if we would like to include more words per house. Fortunately, Hadley Wickham helped me discover treeplots. Let’s draw one using the ggfittext and the treemapify packages.
# set plot width & height
w = 12; h = 8;
# PACKAGES FOR TREEMAP
# devtools::install_github("wilkox/ggfittext")
# devtools::install_github("wilkox/treemapify")
library(ggfittext)
library(treemapify)
# PLOT MOST UNIQUE WORDS PER HOUSE BY RATIO
words_by_houses %>%
filter(word_n > 3) %>% # filter words with few occurances
filter(!grepl(regex_houses, word)) %>% # exclude house names
group_by(house) %>%
arrange(house, desc(ratio), desc(word_n)) %>%
mutate(top = seq_along(ratio)) %>%
filter(top <= words_per_house) %>% # filter top n words
ggplot(aes(area = ratio, label = word, subgroup = house, fill = house)) +
geom_treemap() + # create treemap
geom_treemap_text(aes(col = house), family = "HP", place = 'center') + # add text
geom_treemap_subgroup_text(aes(col = house), # add house names
family = "HP", place = 'center', alpha = 0.3, grow = T) +
geom_treemap_subgroup_border(colour = 'black') +
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
theme(legend.position = 'none') +
labs(title = default_title,
subtitle = "Most informative words per house, by ratio",
caption = default_caption)

A treemap can display more words for each of the houses and displays their relative proportions better. New words regarding the houses include the following, but do you see any others?
- Slytherin girls laugh out loud whereas Ravenclaw had a few little, pretty girls?
- Gryffindors, at least Harry and his friends, got in trouble often, that is a fact.
- Yellow is the color of house Hufflepuff whereas Slytherin is green indeed.
- Zacherias Smith joined Hufflepuff and Luna Lovegood Ravenclaw.
- Why is Voldemort in camp Ravenclaw?!
In the earlier code, we specified a minimum number of occurances for words to be included, which is a bit hacky but necessary to make the ratio statistic work as intended. Foruntately, there are other ways to estimate how unique or informative words are to houses that do not require such hacks.
TF-IDF
tf-idf similarly estimates how unique / informative words are for a body of text (for more info: Wikipedia). We can calculate a tf-idf score for each word within each document (in our case house texts) by taking the product of two statistics:
- TF or term frequency, meaning the number of times the word occurs in a document.
- IDF or inverse document frequency, specifically the logarithm of the inverse number of documents the word occurs in.
A high tf-idf score means that a word occurs relatively often in a specific document and not often in other documents. Different weighting schemes can be used to td-idf’s performance in different settings but we used the simple default of tidytext::bind_tf_idf().
An advantage of tf-idf over the earlier ratio statistic is that we no longer need to specify a minimum frequency: low frequency words will have low tf and thus low tf-idf. A disadvantage is that tf-idf will automatically disregard words occur together with each house, be it only once: these words have zero idf (log(4/4)) so zero tf-idf.
Let’s run the treemap gain, but not on the computed tf-idf scores.
words_by_houses <- words_by_houses %>%
# compute term frequency and inverse document frequency
bind_tf_idf(word, house, word_n)
# examine
words_by_houses %>% select(-house_n) %>% head()
## # A tibble: 6 x 8
## house word word_n word_sum ratio tf idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Gryffindor 104 1 1 2.671719 6.488872e-05 1.3862944
## 2 Gryffindor 22nd 1 1 2.671719 6.488872e-05 1.3862944
## 3 Gryffindor a 251 628 1.774078 1.628707e-02 0.0000000
## 4 Gryffindor abandoned 1 1 2.671719 6.488872e-05 1.3862944
## 5 Gryffindor abandoning 1 2 1.335860 6.488872e-05 0.6931472
## 6 Gryffindor abercrombie 1 1 2.671719 6.488872e-05 1.3862944
## # ... with 1 more variables: tf_idf <dbl>
# PLOT MOST UNIQUE WORDS PER HOUSE BY TF_IDF
words_per_house = 30
words_by_houses %>%
filter(tf_idf > 0) %>% # filter for zero tf_idf
group_by(house) %>%
arrange(house, desc(tf_idf), desc(word_n)) %>%
mutate(top = seq_along(tf_idf)) %>%
filter(top <= words_per_house) %>%
ggplot(aes(area = tf_idf, label = word, subgroup = house, fill = house)) +
geom_treemap() + # create treemap
geom_treemap_text(aes(col = house), family = "HP", place = 'center') + # add text
geom_treemap_subgroup_text(aes(col = house), # add house names
family = "HP", place = 'center', alpha = 0.3, grow = T) +
geom_treemap_subgroup_border(colour = 'black') +
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
theme(legend.position = 'none') +
labs(title = default_title,
subtitle = "Most informative words per house, by tf-idf",
caption = default_caption)

This plot looks quite different from its predecessor. For instance, Marcus Flint and Adrian Pucey are added to house Slytherin and Hufflepuff’s main color is indeed not just yellow, but canary yellow. Severus Snape’s dual role is also nicely depicted now, with him in both house Slytherin and house Gryffindor. Do you notice any other important differences? Did we lose any important words because they occured in each of our four documents?
House Personality Profiles (by NRC Sentiment Analysis)
We end this second Harry Plotter blog by examining to what the extent the stereotypes that exist of the Hogwarts Houses can be traced back to the books. To this end, we use the NRC sentiment dictionary, see also the the previous blog, with which we can estimate to what extent the most informative words for houses (we have over a thousand for each house) relate to emotions such as anger, fear, or trust.
The code below retains only the emotion words in our words_by_houses dataset and multiplies their tf-idf scores by their relative frequency, so that we retrieve one score per house per sentiment.
# PLOT SENTIMENT OF INFORMATIVE WORDS (TFIDF)
words_by_houses %>%
inner_join(get_sentiments("nrc"), by = 'word') %>%
group_by(house, sentiment) %>%
summarize(score = sum(word_n / house_n * tf_idf)) %>% # compute emotion score
ggplot(aes(x = house, y = score, group = house)) +
geom_col(aes(fill = house)) + # create barplots
geom_text(aes(y = score / 2, label = substring(house, 1, 1), col = house),
family = "HP", vjust = 0.5) + # add house letter in middle
facet_wrap(~ Capitalize(sentiment), scales = 'free_y') + # facet and free y axis
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
theme(legend.position = 'none', # tidy dataviz
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
strip.text.x = element_text(colour = 'black', size = 12)) +
labs(title = default_title,
subtitle = "Sentiment (nrc) related to houses' informative words (tf-idf)",
caption = default_caption,
y = "Sentiment score", x = NULL)

The results to a large extent confirm the stereotypes that exist regarding the Hogwarts houses:
- Gryffindors are full of anticipation and the most positive and trustworthy.
- Hufflepuffs are the most joyous but not extraordinary on any other front.
- Ravenclaws are distinguished by their low scores. They are super not-angry and relatively not-anticipating, not-negative, and not-sad.
- Slytherins are the angriest, the saddest, and the most feared and disgusting. However, they are also relatively joyous (optimistic?) and very surprising (shocking?).
Conclusion and future work
With this we have come to the end of the second part of the Harry Plotter project, in which we used tf-idf and ratio statistics to examine which words were most informative / unique to each of the houses of Hogwarts. The data was retrieved using the harrypotter package and transformed using tidytext and the tidyverse. Visualizations were made with ggplot2 and treemapify, using a Harry Potter font.
I have several ideas for subsequent posts and I’d love to hear your preferences or suggestions:
- I would like to demonstrate how regular expressions can be used to retrieve (sub)strings that follow a specific format. We could use regex to examine, for instance, when, and by whom, which magical spells are cast.
- I would like to use network analysis to examine the interactions between the characters. We could retrieve networks from the books and conduct sentiment analysis to establish the nature of relationships. Similarly, we could use unsupervised learning / clustering to explore character groups.
- I would like to use topic models, such as latent dirichlet allocation, to identify the main topics in the books. We could, for instance, try to summarize each book chapter in single sentence, or examine how topics (e.g., love or death) build or fall over time.
- Finally, I would like to build an interactive application / dashboard in Shiny (another hobby of mine) so that readers like you can explore patterns in the books yourself. Unfortunately, the free on shinyapps.io only 25 hosting hours per month : (
For now, I hope you enjoyed this blog and that you’ll be back for more. To receive new content first, please subscribe to my website www.paulvanderlaken.com, follow me on Twitter, or add me on LinkedIn.
If you would like to contribute to, collaborate on, or need assistance with a data science project or venture, please feel free to reach out
R resources (free courses, books, tutorials, & cheat sheets)
Help yourself to these free books, tutorials, packages, cheat sheets, and many more materials for R programming. There’s a separate overview for handy R programming tricks. If you have additions, please comment below or contact me!
LAST UPDATED: 2021-09-24
Table of Contents (clickable)
- Beginner
- Advanced
- Cheat sheets
- Data manipulation
- Data visualization
- Dashboards & Shiny
- Markdown
- Database connections
- Machine learning
- Text mining
- Geospatial analysis
- Bioinformatics
- R IDEs
- Software & language connections
- Help
- Blogs
- Conferences, Events, & Groups
- Jobs
- Other tips & tricks
Completely new to R? → Start learning here!
Introductory R
Introductory Books
- Introduction to R (R Core Team, 1999)
- R Language Definition (Manual) (R Core Team, 2000)
- Data Import/Export (R Core Team, 2000)
- SimpleR (Verzani, 2001-2)
- R for Beginners (Paradis, 2002)
- Introduction to R (Spector, 2004)
- Ecological Models and Data in R (Bolker, 2007)
- Software for Data Analysis: Programming with R (Chambers, 2008)
- Econometrics in R (Farnsworth, 2008)
- The Art of R Programming (Matloff, 2009)
- R in a Nutshell (Adler, 2010)
- R in Action: Data Analysis and Graphics with R (Kabacoff, 2011)
- R for Psychology Experiments and Questionnaires (Baron, 2011)
- The R Inferno (Burns, 2011)
- Cookbook for R (Chang, ???)
- The R Book (Crawley, 2013)
- Introduction to Data Technologies (Murrel, 2013)
- Introduction to Statistical Thought (Lavine, 2013)
- A (very) short introduction to R (Torfs & Bauer, 2014)***
- Advanced R (Wickham, 2014)
- Introduction to R (Vaidyanathan, 2014)
- Learning statistics with R (Navarro, 2014)
- Programming for Psychologists (Crump, 2014)
- IPSUR: Introduction to Probability and Statistics Using R (Kerns, 2014)
- Hands-On Programming with R (Grolemund, 2014)
- Getting used to R, RStudio, and R Markdown (2016)
- Introduction to R (Venables, Smith, & R Core Team, 2017)
- The R Language Definition (R Core Team, 2017)
- Functional Programming and Unit Testing for Data Munging with R (Rodrigues, 2017)
- YaRrr! The Pirate’s Guide to R (Phillips, 2017)***
- R for Data Science (Grolemund & Wickham, 2017)***
- An Introduction to Statistical and Data Sciences via R (Ismay & Kim, 2018) by ModernDive
- Answering questions with data (Crump, 2018)
- Statistical Thinking for the 21st Century (Poldrack, 2018)
- R Notes for Professionals book (Goalkicker, 2018)
- Learning Statistics with R (Navarro, 2019)
- R Graphics Cookbook – 2nd edition (Chang, 2019)
- Introduction to Open Data Science (The Ocean Health Index Team, 2019)
- Data Science with R: A Resource Compendium (Monkman, 2019)
- R in Action: Third Edition (Kabacoff, 2019)
- A Practical Extension of Introductory Statistics in Psychology using R (Pongpipat, Miranda, & Kmiecik, 2019)
- R for Marketing Students (Samuel Franssens, ????)
Online Courses
swirl()***- Join #TidyTuesday***
- Try R by Code School
- Learn R by R-Exercises
- R Tutorial by Cyclismo & DataCamp
- 100 Tutorials for Learning R
- Introduction to R by DataCamp
- YaRrr! The Pirate’s Guide to R (Video)
- R for Cats
- Chromebook Data Science (CBDS) – Introduction to R
- Learning R by Doing – A Learning Experiment in RStudio and GitHub
- R Bootcamp by Jared Knowles
- Hands-on Introduction to Statistics with R by DataCamp.com
- R Course in Statistics by PagePiccini.com
- Data Analysis in R by DataQuest.io
- Data Science: R Basics @edX
- Introduction to R for Data Science @edX
- Introduction to R workshop by Chris Bilder
- Data Analysis and Visualization Using R @VarianceExplained
- Programming @Coursera*** by Roger Peng, Jeff Leek, & Brian Caffo
- Intro to R by Bradley Boehmke
- Intermediate R by Bradley Boehmke
- Advanced R by Bradley Boehmke
- Wrangling data in the Tidyverse – useR! 2018 tutorial by Simon Jackson
- Winter R Bootcamp 2015 by Sean Cross
- RStudio’s Data Science Course in a Box
- Data Carpentry Social Science Workshop in R
- Youtube R classes by Chris Bilder
- 37 Youtube R Tutorials by Flavio Azevedo***
- Essential R tutorials by Gilad Feldman
- Data Carpentry Social Science in R
- Statistics and R, by Rafael Irizarry and Michael Love
- Learn R via R-coder.com
- A Psychologist’s Guide to R (pdf) by Sean Chris Murphy
- Learn R for Psychology research: A crash course
- Social Sciences: Critically Analyze Research and Results Using R by Coursera
- LearnR – R Programming for Behavioral Scientists
- Data Science for Social Scientists
- Boston University workshop – Tricks for cleaning your data
- University of Oklahoma – Econometrics lab sessions by Tyler Ransom
- University of British Colombia – STAT 545A and 547M – Data wrangling, exploration, and analysis with R
- University of California – Business Analytics R Programming Guide
- University of Oregon – Summer School 2018 R Bootcamp by Jessica Kosie
- University of Oregon – Data science for economists
- Oregon Health & Science University – CS631 Principles & Practice of Data Visualization
- Brooklyn College of CUNY – Psych7709 Using R for Reproducible Research
- University of Illinois – A language, not a letter: Learning Statistics in R
- University College London – Statistical Computing with R Programming Language: A Gentle Introduction
- University of Glasgow – psyTeachR Teaching Reproducible Research
- GitHub repository rstats-ed, including many additional courses and learning materials
- Gilad Feldman’s How To R Guides
Style Guides
- Google’s R style guide
- Tidyverse style guide by Hadley Wickham
- Advanced R style guide by Hadley Wickham
- R style guide for stat405 by Hadley Wickham
- R style guide by Collin Gillespie
- Best practices for R Coding by Arnaud Amsellem / The R Trader
- The State of Naming Conventions in R (Bååth, 2012)
- A guide for switching from base R to the
tidyverse
Advanced R
- Advanced R – 1st ed. (Wickham, 2014)
- Advanced R – 2nd ed. (Wickham, 2018)***
- Efficient R Programming (Gillespie & Lovelace, 2017)
- Writing R extensions
- Happy Git and GitHub for the useR (Jenny Bryan, 2017)
- RStudio addins by Dean Attali
Package Development
- Mastering Software Development in R (Peng, Kross, & Anderson, 2017)
- R Packages (Wickham & Bryan, ???)
- rOpenSci Packages: Development, Maintenance, and Peer Review
- How to develop good R packages (for open science) by Maëlle Salmon
- Tutorial on creating R packages by Friedrich Leisch
- Developing R Packages by Jeff Leek
- Writing an R package from scratch by Hilary Parker
- Write your own R package by STAT545
- Making an R Package, by R.M. Ripley
- Prepare your package for CRAN
- Introduction to
roxygen2by Hadley Wickham - How to build package vignettes with
knitrby Yihui Xie knitrin a nutshell: a minimal tutorial by Karl Broman- Rtools: Building R for Windows by Brian Ripley, Duncan Murdoch, and Jeroen Ooms
devtools– tools to make an R developer’s life easierroxygen2– tools for describing functions in comments next to their definitionsRd2roxygen– tools for converting Rd toroxygendocumentationtestthat– tools that simplify the testing of R packages
Non-standard Evaluation
- Tidy evaluation explained in 5 minutes via YouTube
- Tidy evaluation (Henry & Wickham, 2018)
- Tidy evaluation webinar by RStudio
- IV metaprogramming chapters of Advanced R (Wickham, 2014)
tidyevaltutorial by Ian Lyttle
Functional Programming
- Writing Functions in R by Hadley Wickham via DataCamp.com
- R for Data Science chapters on Functions and Iteration
(Grolemund & Wickham, 2018)*** - Advanced R chapter on Functions (Wickham, 2014)
- Lesson on writing, testing, and documenting custom functions by Software-Carpentry.org
- User-defined R fuctions tutorial by Carlo Fanara via DataCamp.com
- Functional programming lecture by Duke University
purrrtutorial by Jenny Bryan***- Intro to
purrrtutorial by Emorie Beck - Learn
purrrtutorial by Dan Ovando purrrcheat sheet by RStudio
Cheat Sheets
- Getting started in R, by Saghir Bashir***
- Base R cheat sheet by Mhairi McNeill***
- Base R functions cheat sheet by Tom Short
- Basic R cheat sheet by Quandl.com
- R function abbreviations cheat sheet by Jeromy Anglim
- RStudio cheat sheet by RStudio
- RStudio keyboard shortcuts by RStudio***
- Data management in R cheat sheet
data.tablecheat sheet by Erik Petrovskidata.tablewide cheat sheet by DataCampdata.tablelong cheat sheet by DataCamp- Advanced R cheat sheet by Arianne Colton & Sean Chen
tidyversecheat sheet by DataCamp- Data import cheat sheet by RStudio with
readr,tibble, andtidyr - Factor manipulation with
forcatscheat sheet by Lise Vaudor - Data transformation cheat sheet by RStudio with
dplyr - Data transformation cheat sheet 2 by Daniel Lüdecke with
dplyrandsjmisc - Data visualization cheat sheet by RStudio with
ggplot2 - Data wrangling cheat sheet by RStudio with
dplyrandtidyr - Automate random assignment and sampling cheat sheet with
randomizrby Alex Coppock. - Cheat sheet for the
mosaicpackage teaching math, stats, computation, and modelling, by Michael Laviolette - Character string manipulation cheat sheet by RStudio with
stringr - Dates and times cheat sheet by RStudio with
lubridate - Split-Apply-Combine cheat sheet by Ernest Adrogue Calvera
purrrfunctional programming cheat sheet by RStudio- Tidy evaluation cheet sheet by Edwin Thoen
cartographycheat sheet by Timothee Giraudbayesplotcheat sheet by Edward Roualdes- R package development cheat sheet with
devtools - R syntax comparison cheat sheet by Amelia McNamara
xtscheat sheet for time series by DataCamp- RStudio cheat sheet GitHub
reticulatecheat sheet by RStudio
Many of the above cheat sheets are hosted in the official RStudio cheat sheet overview.
Data Manipulation
- Introduction to
data.table - Comparison between
data.table&dplyrcode data.tablecheat sheet by Erik Petrovskidata.tablewide cheat sheet by DataCampdata.tablelong cheat sheet by DataCampdplyrcheat sheet by RStudio- Pipes in R Tutorial For Beginners
Data Visualization
- R graph gallery & code examples***
- R charts: A collection of charts and graphs made with R code
- Fundamentals of Data Visualization (Wilke, 2018)
- Exploratory Data Analysis and Visualization (Bogart & Robbins, 2018)
- R base plots wiki reference guide
- Guide to (base) graphics and visualization in R, by StatisticsGlobe
- CRAN Task View – Graphics & Visualization
- R graphical parameters cheat sheet by Flowingdata.com
- MPA 635: Data Visualization course by Andrew Heiss
Colors
- R Color Guide***
colourpicker– widget that allows users to choose colourspaletteer– comprehensive collection of color palettes in R***- ggplot2 colour guide***
- Canva’s 100 color palette included in
ggthemes::scale_color_canva - Wes Anderson color palettes
- Multicolored annotated text in ggplot2 by Andrew Whitby & Visuelle Data
- Picular.co – Google, but for colors
Interactive / HTML / JavaScript widgets
- R HTML Widgets Gallery***
plotly– interactive plotsbillboarder– easy interface to billboard.js, a JavaScript chart library based on D3d3heatmap– interactive D3 heatmapsaltair– Vega-Lite visualizations via PythonDT– interactive tablesDiagrammeR– interactive diagrams (DiagrammeR cheat sheet)dygraphs– interactive time series plotsformattable– formattable data structuresggvis– interactive ggplot2highcharter– interactive Highcharts plotsleaflet– interactive mapsmetricsgraphics– interactive JavaScript bare-bones line, scatterplot and bar chartsnetworkD3– interative D3 network graphsscatterD3– interactive scatterplots with D3rbokeh– interactive Bokeh plotsrCharts– interactive Javascript chartsrcdimple– interactive JavaScript bar charts and othersrglwidget– interactive 3d plotsthreejs– interactive 3d plots and globesvisNetwork– interactive network graphswordcloud2– interface to wordcloud2.js.timevis– interactive timelines
ggplot2
- Code examples of top-50 ggplot2 visualizations***
- ggplot2 Cheatsheet by RStudio
- ggplot2 Quick Reference Guide
- ggplot2 Code Snippets
- ggplot2 Code Snippets 2
- Hitchhiker’s Guide to ggplot2 in R (Burchell & Vargas, 2016)
- A practical introduction with R and ggplot2 (Healy, 2017)
- Data Vizualization: A practical introduction (Healy, 2018)
- Complete ggplot2 Tutorial
- Principles & Practice of Data Visualization CS631 at Oregon Health & Science University
- Data visualization cheat sheet by RStudio with
ggplot2 - Setting custom ggplot themes with
ggthemr - Creating custom, reproducible color palettes by Simon Jackson
- Rearranging values within ggplot2 facets
- Combine plots using
patchworkorcowplot equisse– RStudio addin to interactively explore data with ggplot2 without coding
ggplot2 extensions
- ggplot2 extensions overview***
ggthemes– plot style themeshrbrthemes– opinionated, typographic-centric themesggmap– maps with Google Maps, Open Street Maps, etc.ggiraph– interactive ggplotsgghighight– highlight lines or values, see vignetteggstance– horizontal versions of common plotsGGally– scatterplot matricesggalt– additional coordinate systems, geoms, etc.ggbeeswarm– column scatter plots or voilin scatter plotsggforce– additional geoms, see visual guideggrepel– prevent plot labels from overlappingggraph– graphs, networks, trees and moreggpmisc– photo-biology related extensionsgeomnet– network visualizationggExtra– marginal histograms for a plotgganimate– animations, see also the gganimate wiki pageggpage– pagestyled visualizations of text based dataggpmisc– useful additionalgeom_*andstat_*functionsggstatsplot– include details from statistical tests in plotsggspectra– tools for plotting light spectraggnetwork– geoms to plot networksggpoindensity– cross between a scatter plot and a 2D density plotggradar– radar chartsggsurvplot (survminer)– survival curvesggseas– seasonal adjustment toolsggthreed– (evil) 3D geomsggtech– style themes for plotsggtern– ternary diagramsggTimeSeries– time series visualizationsggtree– tree visualizationstreemapify– wilcox’s treemapsseewave– spectograms
Miscellaneous
coefplot– visualizes model statisticscirclize– circular visualizations for categorical dataclustree– visualize clustering analysisquantmod– candlestick financial chartsdabestr– Data Analysis using Bootstrap-Coupled ESTimationdevoutsvg– an SVG graphics device (with pattern fills)devoutpdf– an PDF graphics devicecartography– create and integrate maps in your R workflowcolorspace– HSL based color palettesviridis– Matplotlib viridis color pallete for Rmunsell– Munsell color palettes for RCairo– high-quality display outputigraph– Network Analysis and Visualizationgraphlayouts– new layout algorithms for network visualizationlattice– Trellis graphicstmap– thematic mapstrelliscopejs– interactive alternative forfacet_wraprgl– interactive 3D plotscorrplot– graphical display of a correlation matrixgoogleVis– Google Charts APIplotROC– interactive ROC plotsextrafont– fonts in R graphicsrvg– produces Vector Graphics that allow further editing in PowerPoint or Excelshowtext– text using system fontsanimation– animated graphics using ImageMagick.misc3d– 3d plots, isosurfaces, etc.xkcd– xkcd style graphicsimager– CImg library to work with imagesungeviz– tools for visualize uncertaintywaffle– square pie charts a.k.a. waffle charts- Creating spectograms in R with
hht,warbleR,soundgen,signal,seewave, orphonTools
Shiny, Dashboards, & Apps
- Shiny Cheat Sheet by RStudio
- Shiny Tutorial
- A collection of links to Shiny applications that have been shared on Twitter.
- Enterprise-ready dashboards with Shiny and databases
- Several packages to upgrade your Shiny dashboards
- More Shiny Resources by Rob Gilmore
- More Shiny Resources for Statistics by Yingjie Hu
- Building Shiny apps – an interactive tutorial by Dean Attali
- Advanced Shiny tips & tricks by Dean Attali (version 2)
flexdashboard– dashboard creation simplifiedcolourpicker– widget that allows users to choose coloursbrighter– toolbox with helpful functions for shiny developmentDesktopDeployR– self-contained R-based desktop applications
Markdown & Other Output Formats
- R Markdown cheat sheet by RStudio
- R Markdown reference guide by RStudio
- R Markdown Basics
- R Markdown tutorial by RStudio
- R Markdown gallery by RStudio
- The
knitrbook (Xie, 2015) - Getting used to R, RStudio, and R Markdown (2016)
- R Markdown: The Definitive Guide (Xie, Allaire, & Grolemund, 2018)
- Introduction to R Markdown (Clark, 2018)
- R Markdown for Scientists (Tierney, 2019)
- R Markdown Tips and Tricks
- Pimp my RMD by Holtz Yan
- Pandoc syntax highlighting examples by Garrick Aden-Buie
- Creating slides with R Markdown (Video) by Brian Caffo
- Introduction to
xaringanby Yihui Xie - A quick demonstration of
xarigan - General Markdown cheat sheet
blogdownwebsites with R Markdown (Xie, Thomas, & Hill, 2018)blogdowntutorials- How to build a website with
blogdownin R, by Storybench - radix – online publication format designed for scientific and technical communication
- A template RStudio project with data analysis and manuscript writing by Thomas Julou
- Multiple reports from a single Markdown file (example 1) (example2)
tidystats– automating updating of model statisticspapaja– preparing APA journal articlesblogdown– build websites with Markdown & Hugohuxtable– create Excel, html, & LaTeX tablesxaringan– make slideshows via remark.js and markdownsummarytools– produces neat, quick data summary tablescitr– RStudio Addin to Insert Markdown Citations
Cloud, Server, & Database
- Access and manage Google spreadsheets from R with
googlesheets - Tutorial: Database Queries with R
- Introduction to
sparklyrby DataCamp - Running R on AWS
- AWS EC2 Tutorial For Beginners
- Using RStudio on Amazon EC2 under the Free Usage Tier
- Getting started with databases using R, by RStudio
RMySQL– connects to MySQL and MariaDBRPostgreSQL– connects to Postgres and Redshift.RSQLite– embeds a SQLite database.odbc– connects to many commercial databases via the open database connectivity protocol.bigrquery– connects to Google’s BigQuery.DBI– separates the connectivity to the DBMS into a “front-end” and a “back-end”.dbplot– leveragesdplyrto process calculations of plot inside databasedplyr– also works with remote on-disk data stored in databasestidypredict– run predictions inside the database
Statistical Modeling & Machine Learning
- Machine Learning with R: An Irresponsibly Fast Tutorial by Will Stanton***
- CRAN Task View – Machine Learning & Statistical Learning
- R Packages for Machine Learning by Joseph Misiti
- Introduction to Data Science with R (Video)
- 100 Tutorials for Learning R
- Machine Learning Algorithms R Implementation by Ajitesh Kumar
- R Data Mining: Examples & Case Studies (Zhao, 2015)
- Statistical modelling in R (Zhao, 2015) @RDataMining
- Predictive modelling in R with
caret - R interface to Keras
- Tensorflow for R gallery
- Image featurization
- R for Data Science Online Learning Community
- R statistical programming resources by Michael Clark
Books
- Elements of Statistical Learning (Hastie, Tibshirani, & Friedman, 2001)
- Introduction to Statistical Learning (James, Witten, Hastie, & Tibshirani, 2013)
- Machine Learning with R (Lantz, 2013)
- Regression Models for Data Science in R (Caffo, 2015)
- R Programming for Data Science (Peng, 2016)
- Data Science Live Book (Casas, 2017)
- Statistical Foundations of Machine Learning (Bontempi & Taieb, 2017)
- R for Data Science (Grolemund & Wickham, 2017)
- Introduction to Data Science (Irizarry, 2018)
Courses
- Introduction to Statistical Learning*** at Stanford University by Trevor Hastie and Rob Tibshirani
- Introduction to R for Data Science @Microsoft
- Introduction to R for Data Science @FutureLearn by Hadley Wickham
- PSY2002: Advanced Statistics at University of Toronto by Elizabeth Page-Gould
- STAT 450/870: Regression Analysis at University of Nebraska-Lincoln by Chris Bilder
- STAT 850: Computing Tools for Statisticians at University of Nebraska-Lincoln by Chris Bilder
- STAT 873: Applied Multivariate Statistical Analysis at University of Nebraska-Lincoln by Chris Bilder
- STAT 875: Categorical Data Analysis at University of Nebraska-Lincoln by Chris Bilder
- STAT 950: Computational Statistics at University of Nebraska-Lincoln by Chris Bilder
- Joint Statistical Meetings: Analysis of Categorical Data by Chris Bilder
Cheat sheets
- R functions for regression analysis cheat sheet by Vito Ricci
- Machine Learning modeling cheat sheet by Arnaud Amsellem
- Machine Learning with
mlrcheat sheet by Aaron Coley - Cheat sheet for h20’s algorithms for big data and parallel computing in R by Juan Telleria
- Deep Learning with
kerascheat sheet by RStudio - Machine Learning with
caretcheat sheet by Max Kuhn - Nonlinear cointegrating autoregressive distributed lag models with
nardlcheat sheet by Taha Zaghdoudi - R survival analysis with
survminercheat sheet by Przemysław Biecek - R Data Mining reference card
- R
sparklyrcheat sheet by RStudio
Time series
- CRAN Task View – TimeSeries
- R
xtscheat sheet - Forecasting: Principles and Practice (Hyndman & Athanasopoulos, 2017)
- A little book of R for time series (tutorial)
- ARIMA forecasting in R (6-part Youtube series)
- Introduction to the
tsfeaturespackage - Tutorials: Part 1, Part 2, Part 3, & Part 4 of tidy time series @Business-Science.io with
tidyquant - Packages:
xts– extensible time seriestsfeatures– methods for extracting various features from time series datatidyquant–tidyverse-style financial analysis
Survival analysis
- CRAN Task View – Survival
- R survival analysis cheat sheet by Przemysław Biecek
- Packages:
survival– functionality for survival and hazard modelsggsurvplot(survminer) – survival curves
Bayesian
Miscellaneous
corrr– easier correlation matrix management and exploration
Natural Language Processing & Text Mining
- Text Mining Tutorial with
tm - Tidy Text Mining (Silges & Robinson, 2017) with
tidytext - Text Analysis with R for Students of Literature (Jockers, 2014)
- Tidytext tutorials by computational journalism
- 21 Recipes for Mining Twitter Data (Rudis, 2017) with
rtweet - Emil Hvitfeldt’s R-text-data GitHub repository
- Course: Introduction to Text Analytics with R @DataScienceDojo
- Course: Twitter Text Mining and Social Network Analysis (Zhoa, 2016) @RDataMining with
twitteR - Quantitative Analysis of Textual Data with
quantedacheat sheet by Stefan Müller and Kenneth Benoit - List of resources for NLP & Text Mining by Stephen Thomas
- Packages — for an overview: CRAN Task View – Natural Language Processing:
tm– text mining.tidytext– text mining usingtidyverseprinciplesquanteda– framework for quantitative text analysisgutenbergr– public domain works (free books to practice on)corpora– statistics and data sets for corpus frequency data.tau– Text Analysis UtilitiesSentiment140– headache-free sentiment analysissentimentr– sentiment analysis using text polarityopenNLP– sentence detector, tokenizer, pos-tagger, shallow and full syntactic parser, named-entity detector, and maximum entropy models with OpenNLP.cleanNLP– natural language processing via tidy data modelsRSentiment– English lexicon-based sentiment analysis with negation and sarcasm detection functionalities.RWeka– data mining tasks with Wekawordnet– a large lexical database of English with WordNet .stringi– language processing wrapperstextcat– provides support for n-gram based text categorization.text2vec– text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), and similarities.lsa– Latent Semantic Analysistopicmodels-Latent Dirichlet Allocation (LDA) and Correlated Topics Models (CTM)lda-Latent Dirichlet Allocation and related models
Regular Expressions
- R Regular Expression cheat sheet by Lise Vaudor
- R Regular Expression cheat sheet
- R Regular Expression cheat sheet (page 2) by RStudio
regexplain– interactive RStudio addin for regular expressions- Regular Expressions in R – Part 1: Introduction and base R functions
- R Regular Expressions by Jon M. Calder in swirl()
- R Regular Expression Video Tutorial by Roger Peng
- General Regular Expression cheat sheet
- General Regular Expression Video Tutorial by Roger Peng
- General Regular Expression cheat sheet by OverAPI.com
Geographic & Spatial mapping
- Making Maps with R (tutorial) with ggmaps, maps, and mapdata
- Importing OpenStreetMap data (tutorial) with osmar
- Geocomputation with R (Lovelace, Nowosad, & Muenchow, 2018)
- Spatial manipulation with Simple Features (
sf) cheat sheet by Ryan Garnett
Bioinformatics & Computational Biology
- Applied statistics for Bioinformatics using R (Krijnen, 2009)
- A little book of R for Bioinformatics (Coghlan, ???)
- Bioinformatics and Functional Genomics (Pevsner, 2015)
- Applied Biostatistical Analyses using R (Cox, 2017)
- Molecular data analysis using R (Ortutay & Ortutay, 2017)
- Modern statistics for Biology (Holmes & Huber, 2019)
Integrated Development Environments (IDEs) &
Graphical User Inferfaces (GUIs)
Descriptions mostly taken from their own websites:
- RStudio*** – Open source and enterprise ready professional software
- Jupyter Notebook*** – open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text across dozens of programming languages.
- Microsoft R tools for Visual Studio – turn Visual Studio into a powerful R IDE
- R Plugins for Vim, Emax, and Atom editors
- Rattle*** – GUI for data mining
- equisse – RStudio add-in to interactively explore and visualize data
- R Analytic Flow – data flow diagram-based IDE
- RKWard – easy to use and easily extensible IDE and GUI
- Eclipse StatET – Eclipse-based IDE
- OpenAnalytics Architect – Eclipse-based IDE
- TinnR – open source GUI and IDE
- DisplayR – cloud-based GUI
- BlueSkyStatistics – GUI designed to look like SPSS and SAS
- ducer – GUI for everyone
- R commander (Rcmdr) – easy and intuitive GUI
- JGR – Java-based GUI for R
- jamovi &
jmv– free and open statistical software to bridge the gap between researcher and statistician - Exploratory.io – cloud-based data science focused GUI
- Stagraph – GUI for ggplot2 that allows you to visualize and connect to databases and/or basic file types
- ggraptr – GUI for visualization (Rapid And Pretty Things in R)
- ML Studio – interactive Shiny platform for data visualization, statistical modeling and machine learning
R & other software and languages
R & Excel
- BERT – Basic Excel R Toolkit
- A Comprehensive Guide to Transitioning from Excel to R by Alyssa Columbus
readxl– package to load in Excel dataxlsx– package to read and write Excel datarvg– produces Vector Graphics which can be modified in Exceldevoutpdf– an PDF graphics devicetidyxl– imports non-tabular (e.g., format) data from Excel files into Runpivotr– unpivot complex and irregular data layouts in Runheadr– handle data with embedded subheaders
R & Python
- Python for R users
reticulatecheat sheet by RStudioreticulate– tools for interoperability between Python and R
R & SQL
sqldf– running SQL statements on R data frames
R Help, Connect, & Inspiration
- RStudio Community
- R help mailing list
- R seek – search engine for R-related websites
- R site search – search engine for help files, manuals, and mailing lists
- Nabble – mailing list archive and forum
- R User Groups & Conferences
- R for Data Science Online Learning Community
- Stack Overflow – a FAQ for all your R struggles (programming)
- Cross Validated – a FAQ for all your R struggles (statistics)
- CRAN Task Views – discover new packages per topic
- The R Journal – open access, refereed journal of R
- Twitter: #rstats, RStudio, Hadley Wickham, Yihui Xie, Mara Averick, Julia Silge, Jenny Bryan, David Smith, Hilary Parker, R-bloggers
- Facebook: R Users Psychology
- Youtube: Ben Lambert, Roger Peng
- Reddit: rstats, rstudio, statistics, machinelearning, dataisbeautiful
R Blogs
- http://adamleerich.com
- http://njtierney.github.io/
- https://trinkerrstuff.wordpress.com
- https://rollingyours.wordpress.com
- https://r-statistics.com
- https://beckmw.wordpress.com
- http://rgraphgallery.blogspot.com
- http://onertipaday.blogspot.com
- https://learnr.wordpress.com
- http://padamson.github.io
- http://www.r-datacollection.com/blog/
- http://www.thertrader.com
- https://fronkonstin.com
- https://nicercode.github.io
- http://www.rblog.uni-freiburg.de
- https://advanceddataanalytics.net
- http://r4stats.com/blog/
- http://blog.revolutionanalytics.com/
- http://www.r-bloggers.com/
- http://kbroman.org/blog/
- https://juliasilge.com/blog/
- http://andrewgelman.com/
- http://www.statsblogs.com/author/eric-cai-the-chemical-statistician/
- https://www.statmethods.net/
- http://www.stats-et-al.com/search/label/R
- http://www.brodrigues.co/
- https://datasharkie.com/
- https://www.programmingwithr.com/
R Conferences, Events, & Meetups
- Overview of R conferences by JumpingRivers
- Overview of R virtual events by JumpingRivers
- Overview of R user groups by JumpingRivers
- Overview of R-Ladies groups by JumpingRivers


