Woohoo, so tidy! Now comes the fun part: visualization. The following plots how often houses are mentioned overall, and in each book seperately.
In terms of the data, Gryffindor and Slytherin definitely play a larger role in the Harry Potter stories. However, as the storyline progresses, Slytherin as a house seems to lose its importance. Their downward trend since the Chamber of Secrets results in Ravenclaw being mentioned more often in the final book (Edit – this is likely due to the diadem horcrux, as you will see later on).
I can’t but feel sorry for house Hufflepuff, which never really gets to involved throughout the saga.
Visualize Word-House Combinations
Now we can visualize which words relate to each of the houses. Because facet_wrap()
has trouble reordering the axes (because words may related to multiple houses in different frequencies), I needed some custom functionality, which I happily recycled from dgrtwo’s github. With these reorder_within()
and scale_x_reordered()
we can now make an ordered barplot of the top-20 most frequent words per house.
reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
new_x <- paste(x, within, sep = sep)
reorder(new_x, by, FUN = fun)
}
scale_x_reordered <- function(..., sep = "___") {
reg <- paste0(sep, ".+$")
ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
}
w = 10; h = 7;
words_per_house = 20
words_by_houses %>%
group_by(house) %>%
arrange(house, desc(word_n)) %>%
mutate(top = row_number()) %>%
filter(top <= words_per_house) %>%
ggplot(aes(reorder_within(word, -top, house),
word_n, fill = house)) +
geom_col(show.legend = F) +
scale_x_reordered() +
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
facet_wrap(~ house, scales = "free_y") +
coord_flip() +
labs(title = default_title,
subtitle = "Words most commonly used together with houses",
caption = default_caption,
x = NULL, y = 'Word Frequency')

Unsurprisingly, several stop words occur most frequently in the data. Intuitively, we would rerun the code but use dplyr::anti_join()
on tidytext::stop_words
to remove stop words.
words_by_houses %>%
anti_join(stop_words, 'word') %>%
group_by(house) %>%
arrange(house, desc(word_n)) %>%
mutate(top = row_number()) %>%
filter(top <= words_per_house) %>%
ggplot(aes(reorder_within(word, -top, house),
word_n, fill = house)) +
geom_col(show.legend = F) +
scale_x_reordered() +
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
facet_wrap(~ house, scales = "free") +
coord_flip() +
labs(title = default_title,
subtitle = "Words most commonly used together with houses, excluding stop words",
caption = default_caption,
x = NULL, y = 'Word Frequency')

However, some stop words have a different meaning in the Harry Potter universe. points are for instance quite informative to the Hogwarts houses but included in stop_words
.
Moreover, many of the most frequent words above occur in relation to multiple or all houses. Take, for instance, Harry and Ron, which are in the top-10 of each house, or words like table, house, and professor.
We are more interested in words that describe one house, but not another. Similarly, we only want to exclude stop words which are really irrelevant. To this end, we compute a ratio-statistic below. This statistic displays how frequently a word occurs in combination with one house rather than with the others. However, we need to adjust this ratio for how often houses occur in the text as more text (and thus words) is used in reference to house Gryffindor than, for instance, Ravenclaw.
words_by_houses <- words_by_houses %>%
group_by(word) %>% mutate(word_sum = sum(word_n)) %>%
group_by(house) %>% mutate(house_n = n()) %>%
ungroup() %>%
mutate(ratio = (word_n / (word_sum - word_n + 1) / (house_n / n())))
words_by_houses %>% select(-word_sum, -house_n) %>% arrange(desc(word_n)) %>% head()
## # A tibble: 6 x 4
## house word word_n ratio
## <chr> <chr> <int> <dbl>
## 1 Gryffindor the 1057 2.373115
## 2 Slytherin the 675 1.467926
## 3 Gryffindor gryffindor 602 13.076218
## 4 Gryffindor and 477 2.197259
## 5 Gryffindor to 428 2.830435
## 6 Gryffindor of 362 2.213186
words_by_houses %>%
group_by(house) %>%
arrange(house, desc(ratio)) %>%
mutate(top = row_number()) %>%
filter(top <= words_per_house) %>%
ggplot(aes(reorder_within(word, -top, house),
ratio, fill = house)) +
geom_col(show.legend = F) +
scale_x_reordered() +
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
facet_wrap(~ house, scales = "free") +
coord_flip() +
labs(title = default_title,
subtitle = "Most informative words per house, by ratio",
caption = default_caption,
x = NULL, y = 'Adjusted Frequency Ratio (house vs. non-house)')

This ratio statistic (x-axis) should be interpreted as follows: night is used 29 times more often in combination with Gryffindor than with the other houses.
Do you think the results make sense:
- Gryffindors spent dozens of hours during their afternoons, evenings, and nights in the, often empty, tower room, apparently playing chess? Nevile Longbottom and Hermione Granger are Gryffindors, obviously, and Sirius Black is also on the list. The sword of Gryffindor is no surprise here either.
- Hannah Abbot, Ernie Macmillan and Cedric Diggory are Hufflepuffs. Were they mostly hot curly blondes interested in herbology? Nevertheless, wild and aggresive seem unfitting for Hogwarts most boring house.
- A lot of names on the list of Helena Ravenclaw’s house. Roger Davies, Padma Patil, Cho Chang, Miss S. Fawcett, Stewart Ackerley, Terry Boot, and Penelope Clearwater are indeed Ravenclaws, I believe. Ravenclaw’s Diadem was one of Voldemort horcruxes. AlectoCarrow, Death Eater by profession, was apparently sent on a mission by Voldemort to surprise Harry in Rawenclaw’s common room (source), which explains what she does on this list. Can anybody tell me what bust, statue and spot have in relation to Ravenclaw?
- House Slytherin is best represented by Gregory Goyle, one of the members of Draco Malfoy’s gang along with Vincent Crabbe. Pansy Parkinson also represents house Slytherin. Slytherin are famous for speaking Parseltongue and their house’s gem is an emerald. House Gaunt were pure-blood descendants from Salazar Slytherin and apparently Viktor Krum would not have misrepresented the Slytherin values either. Oh, and only the heir of Slytherin could control the monster in the Chamber of Secrets.
Honestly, I was not expecting such good results! However, there is always room for improvement.
We may want to exclude words that only occur once or twice in the book (e.g., Alecto) as well as the house names. Additionally, these barplots are not the optimal visualization if we would like to include more words per house. Fortunately, Hadley Wickham helped me discover treeplots. Let’s draw one using the ggfittext
and the treemapify
packages.
w = 12; h = 8;
library(ggfittext)
library(treemapify)
words_by_houses %>%
filter(word_n > 3) %>%
filter(!grepl(regex_houses, word)) %>%
group_by(house) %>%
arrange(house, desc(ratio), desc(word_n)) %>%
mutate(top = seq_along(ratio)) %>%
filter(top <= words_per_house) %>%
ggplot(aes(area = ratio, label = word, subgroup = house, fill = house)) +
geom_treemap() +
geom_treemap_text(aes(col = house), family = "HP", place = 'center') +
geom_treemap_subgroup_text(aes(col = house),
family = "HP", place = 'center', alpha = 0.3, grow = T) +
geom_treemap_subgroup_border(colour = 'black') +
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
theme(legend.position = 'none') +
labs(title = default_title,
subtitle = "Most informative words per house, by ratio",
caption = default_caption)

A treemap can display more words for each of the houses and displays their relative proportions better. New words regarding the houses include the following, but do you see any others?
- Slytherin girls laugh out loud whereas Ravenclaw had a few little, pretty girls?
- Gryffindors, at least Harry and his friends, got in trouble often, that is a fact.
- Yellow is the color of house Hufflepuff whereas Slytherin is green indeed.
- Zacherias Smith joined Hufflepuff and Luna Lovegood Ravenclaw.
- Why is Voldemort in camp Ravenclaw?!
In the earlier code, we specified a minimum number of occurances for words to be included, which is a bit hacky but necessary to make the ratio statistic work as intended. Foruntately, there are other ways to estimate how unique or informative words are to houses that do not require such hacks.
TF-IDF
tf-idf similarly estimates how unique / informative words are for a body of text (for more info: Wikipedia). We can calculate a tf-idf score for each word within each document (in our case house texts) by taking the product of two statistics:
- TF or term frequency, meaning the number of times the word occurs in a document.
- IDF or inverse document frequency, specifically the logarithm of the inverse number of documents the word occurs in.
A high tf-idf score means that a word occurs relatively often in a specific document and not often in other documents. Different weighting schemes can be used to td-idf’s performance in different settings but we used the simple default of tidytext::bind_tf_idf()
.
An advantage of tf-idf over the earlier ratio statistic is that we no longer need to specify a minimum frequency: low frequency words will have low tf and thus low tf-idf. A disadvantage is that tf-idf will automatically disregard words occur together with each house, be it only once: these words have zero idf (log(4/4)) so zero tf-idf.
Let’s run the treemap gain, but not on the computed tf-idf scores.
words_by_houses <- words_by_houses %>%
bind_tf_idf(word, house, word_n)
words_by_houses %>% select(-house_n) %>% head()
## # A tibble: 6 x 8
## house word word_n word_sum ratio tf idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Gryffindor 104 1 1 2.671719 6.488872e-05 1.3862944
## 2 Gryffindor 22nd 1 1 2.671719 6.488872e-05 1.3862944
## 3 Gryffindor a 251 628 1.774078 1.628707e-02 0.0000000
## 4 Gryffindor abandoned 1 1 2.671719 6.488872e-05 1.3862944
## 5 Gryffindor abandoning 1 2 1.335860 6.488872e-05 0.6931472
## 6 Gryffindor abercrombie 1 1 2.671719 6.488872e-05 1.3862944
## # ... with 1 more variables: tf_idf <dbl>
words_per_house = 30
words_by_houses %>%
filter(tf_idf > 0) %>%
group_by(house) %>%
arrange(house, desc(tf_idf), desc(word_n)) %>%
mutate(top = seq_along(tf_idf)) %>%
filter(top <= words_per_house) %>%
ggplot(aes(area = tf_idf, label = word, subgroup = house, fill = house)) +
geom_treemap() +
geom_treemap_text(aes(col = house), family = "HP", place = 'center') +
geom_treemap_subgroup_text(aes(col = house),
family = "HP", place = 'center', alpha = 0.3, grow = T) +
geom_treemap_subgroup_border(colour = 'black') +
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
theme(legend.position = 'none') +
labs(title = default_title,
subtitle = "Most informative words per house, by tf-idf",
caption = default_caption)

This plot looks quite different from its predecessor. For instance, Marcus Flint and Adrian Pucey are added to house Slytherin and Hufflepuff’s main color is indeed not just yellow, but canary yellow. Severus Snape’s dual role is also nicely depicted now, with him in both house Slytherin and house Gryffindor. Do you notice any other important differences? Did we lose any important words because they occured in each of our four documents?
House Personality Profiles (by NRC Sentiment Analysis)
We end this second Harry Plotter blog by examining to what the extent the stereotypes that exist of the Hogwarts Houses can be traced back to the books. To this end, we use the NRC sentiment dictionary, see also the the previous blog, with which we can estimate to what extent the most informative words for houses (we have over a thousand for each house) relate to emotions such as anger, fear, or trust.
The code below retains only the emotion words in our words_by_houses
dataset and multiplies their tf-idf scores by their relative frequency, so that we retrieve one score per house per sentiment.
words_by_houses %>%
inner_join(get_sentiments("nrc"), by = 'word') %>%
group_by(house, sentiment) %>%
summarize(score = sum(word_n / house_n * tf_idf)) %>%
ggplot(aes(x = house, y = score, group = house)) +
geom_col(aes(fill = house)) +
geom_text(aes(y = score / 2, label = substring(house, 1, 1), col = house),
family = "HP", vjust = 0.5) +
facet_wrap(~ Capitalize(sentiment), scales = 'free_y') +
scale_fill_manual(values = houses_colors1) +
scale_color_manual(values = houses_colors2) +
theme(legend.position = 'none',
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
strip.text.x = element_text(colour = 'black', size = 12)) +
labs(title = default_title,
subtitle = "Sentiment (nrc) related to houses' informative words (tf-idf)",
caption = default_caption,
y = "Sentiment score", x = NULL)

The results to a large extent confirm the stereotypes that exist regarding the Hogwarts houses:
- Gryffindors are full of anticipation and the most positive and trustworthy.
- Hufflepuffs are the most joyous but not extraordinary on any other front.
- Ravenclaws are distinguished by their low scores. They are super not-angry and relatively not-anticipating, not-negative, and not-sad.
- Slytherins are the angriest, the saddest, and the most feared and disgusting. However, they are also relatively joyous (optimistic?) and very surprising (shocking?).
Conclusion and future work
With this we have come to the end of the second part of the Harry Plotter project, in which we used tf-idf and ratio statistics to examine which words were most informative / unique to each of the houses of Hogwarts. The data was retrieved using the harrypotter
package and transformed using tidytext
and the tidyverse
. Visualizations were made with ggplot2
and treemapify
, using a Harry Potter font.
I have several ideas for subsequent posts and I’d love to hear your preferences or suggestions:
- I would like to demonstrate how regular expressions can be used to retrieve (sub)strings that follow a specific format. We could use regex to examine, for instance, when, and by whom, which magical spells are cast.
- I would like to use network analysis to examine the interactions between the characters. We could retrieve networks from the books and conduct sentiment analysis to establish the nature of relationships. Similarly, we could use unsupervised learning / clustering to explore character groups.
- I would like to use topic models, such as latent dirichlet allocation, to identify the main topics in the books. We could, for instance, try to summarize each book chapter in single sentence, or examine how topics (e.g., love or death) build or fall over time.
- Finally, I would like to build an interactive application / dashboard in Shiny (another hobby of mine) so that readers like you can explore patterns in the books yourself. Unfortunately, the free on shinyapps.io only 25 hosting hours per month : (
For now, I hope you enjoyed this blog and that you’ll be back for more. To receive new content first, please subscribe to my website www.paulvanderlaken.com, follow me on Twitter, or add me on LinkedIn.
If you would like to contribute to, collaborate on, or need assistance with a data science project or venture, please feel free to reach out