Category: visualization

Summarizing our Daily News: Clustering 100.000+ Articles in Python

Andrew Thompson was interested in what 10 topics a computer would identify in our daily news. He gathered over 140.000 new articles from the archives of 10 different sources, as you can see in the figure below.

The sources of the news articles used in the analysis.

In Python, Andrew converted the text of all these articles into a manageable form (tf-idf document term matrix (see also Harry Plotter: Part 2)), reduced these data to 100 dimensions using latent semantic analysis (singular value decomposition), and ran a k-means clustering to retrieve the 10 main clusters. I included his main results below, but I highly suggest you visit the original article on Medium as Andrew used Plotly to generate interactive plots!

newplot — Most important words per topic (interactive visual in original article)

The topics structure seems quite nice! Topic 0 involves legal issues, such as immigration, whereas topic 1 seems to be more about politics. Topic 8 is clearly sports whereas 9 is education. Next, Andres inspected which media outlet covers which topics most. Again, visit the original article for interactive plots!

newplot (1).png — Media outlets and the topics they cover (interactive version in original article)

In light of the fake news crisis and the developments in (internet) media, I believe Andrew’s conclusions on these data are quite interesting.

I suppose different people could interpret this data and these graphs differently, but I interpret them as the following: when forced into groups, the publications sort into Reuters and everything else.

[…]

Every publication in this dataset except Reuters shares some common denominators. They’re entirely funded on ads and/or subscriptions (Vox and BuzzFeed also have VC funding, but they’re ad-based models), and their existence relies on clicks. By contrast, Reuters’s news product is merely the public face of a massive information conglomerate. Perhaps more importantly, it’s a news wire whose coverage includes deep reporting on the affairs of our financial universe, and therefore is charged with a different mandate than the others — arguably more than the New York Times, it must cover all the news, without getting trapped in the character driven reality-TV spectacle that every other citizen of the dataset appears to so heavily relish in doing. Of them all, its voice tends to maintain the most moderate indoor volume, and no single global event provokes larger-than-life outrage, if outrage can be provoked from Reuters at all. Perhaps this is the product of belonging to the financial press and analyzing the world macroscopically; the narrative of the non-financial press fails to accord equal weight to a change in the LIBOR rate and to the policy proposals of a madman, even though it arguably should. Every other publication here seems to bear intimations of utopia, and the subtext of their content is often that a perfect world would materialize if we mixed the right ingredients in the recipe book, and that the thing you’re outraged about is actually the thing standing between us and paradise. In my experience as a reader, I’ve never felt anything of the sort emanate from Reuters.

This should not be interpreted as asserting that the New York Times and Breitbart are therefore identical cauldrons of apoplexy. I read a beautifully designed piece today in the Times about just how common bioluminescence is among deep sea creatures. It goes without saying that the prospect of finding a piece like that in Breitbart is nonexistent, which is one of the things I find so god damned sad about that territory of the political spectrum, as well as in its diametrical opponents a la Talking Points Memo. But this is the whole point: show an algorithm the number of stories you write about deep sea creatures and it’ll show you who you are. At a finer resolution, we would probably find a chasm between the Times and Fox News, or between NPR and the New York Post. See that third cluster up there, where all the words are kind of compressed with lower TfIdf values and nothing sticks out? It’s actually a whole jungle of other topics, and you can run the algorithm on just that cluster and get new groups and distinctions — and one of those clusters will also be a compression of different kinds of stories, and you can do this over and over in a fractal of machine learning. The distinction here is not the only one, but it is, from the aerial perspective of data, the first.

It would be really interesting to see whether more high-quality media outlets, like the New York Times, could be easily distinguished from more sensational outlets, such as Buzzfeed, when more clusters were used, or potentially other text analytics methodology, like latent Dirichlet allocation.

Harry Plotter: Part 2 – Hogwarts Houses and their Stereotypes

Two weeks ago, I started the Harry Plotter project to celebrate the 20th anniversary of the first Harry Potter book. I could not have imagined that the first blog would be so well received. It reached over 4000 views in a matter of days thanks to the lovely people in the data science and #rstats community that were kind enough to share it (special thanks to MaraAverick and DataCamp). The response from the Harry Potter community, for instance on reddit, was also just overwhelming

Part 2: Hogwarts Houses

All in all, I could not resist a sequel and in this second post we will explore the four houses of Hogwarts: Gryffindor, Hufflepuff, Ravenclaw, and Slytherin. At the end of today’s post we will end up with visualizations like this:

Various stereotypes exist regarding these houses and a textual analysis seemed a perfect way to uncover their origins. More specifically, we will try to identify which words are most unique, informative, important or otherwise characteristic for each house by means of ratio and tf-idf statistics. Additionally, we will try to estime a personality profile for each house using these characteristic words and the emotions they relate to. Again, we rely strongly on ggplot2 for our visualizations, but we will also be using the treemaps of treemapify. Moreover, I have a special surprise this second post, as I found the orginal Harry Potter font, which will definately make the visualizations feel more authentic. Of course, we will conduct all analyses in a tidy manner using tidytext and the tidyverse.

I hope you will enjoy this blog and that you’ll be back for more. To be the first to receive new content, please subscribe to my website www.paulvanderlaken.com, follow me on Twitter, or add me on LinkedIn. Additionally, if you would like to contribute to, collaborate on, or need assistance with a data science project or venture, please feel free to reach out.

R Setup

All analysis were performed in RStudio, and knit using rmarkdown so that you can follow my steps.

In term of setup, we will be needing some of the same packages as last time. Bradley Boehmke gathered the text of the Harry Potter books in his harrypotter package. We need devtools to install that package the first time, but from then on can load it in as usual. We need plyr for ldply(). We load in most other tidyverse packages in a single bundle and add tidytext. Finally, I load the Harry Potter font and set some default plotting options.

# SETUP ####
# LOAD IN PACKAGES
# library(devtools)
# devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
library(plyr)
library(tidyverse)
library(tidytext)

# VIZUALIZATION SETTINGS
# custom Harry Potter font
# http://www.fontspace.com/category/harry%20potter
library(extrafont)
font_import(paste0(getwd(),"/fontomen_harry-potter"), prompt = F) # load in custom Harry Potter font
windowsFonts(HP = windowsFont("Harry Potter"))
theme_set(theme_light(base_family = "HP")) # set default ggplot theme to light
default_title = "Harry Plotter: References to the Hogwarts houses" # set default title
default_caption = "www.paulvanderlaken.com" # set default caption
dpi = 600 # set default dpi

Importing and Transforming Data

Before we import and transform the data in one large piping chunk, I need to specify some variables.

First, I tell R the house names, which we are likely to need often, so standardization will help prevent errors. Next, my girlfriend was kind enough to help me (colorblind) select the primary and secondary colors for the four houses. Here, the ggplot2 color guide in my R resources list helped a lot! Finally, I specify the regular expression (tutorials) which we will use a couple of times in order to identify whether text includes either of the four house names.

# DATA PREPARATION ####
houses <- c('gryffindor', 'ravenclaw', 'hufflepuff', 'slytherin') # define house names
houses_colors1 <- c("red3", "yellow2", "blue4", "#006400") # specify primary colors
houses_colors2 <- c("#FFD700", "black", "#B87333", "#BCC6CC") # specify secondary colors
regex_houses <- paste(houses, collapse = "|") # regular expression

Import Data and Tidy

Ok, let’s import the data now. You may recognize pieces of the code below from last time, but this version runs slightly smoother after some optimalization. Have a look at the current data format.

# LOAD IN BOOK TEXT 
houses_sentences <- list(
  `Philosophers Stone` = philosophers_stone,
  `Chamber of Secrets` = chamber_of_secrets,
  `Prisoner of Azkaban` = prisoner_of_azkaban,
  `Goblet of Fire` = goblet_of_fire,
  `Order of the Phoenix` = order_of_the_phoenix,
  `Half Blood Prince` = half_blood_prince,
  `Deathly Hallows` = deathly_hallows
) %>% 
  # TRANSFORM TO TOKENIZED DATASET
  ldply(cbind) %>% # bind all chapters to dataframe
  mutate(.id = factor(.id, levels = unique(.id), ordered = T)) %>% # identify associated book
  unnest_tokens(sentence, `1`, token = 'sentences') %>% # seperate sentences
  filter(grepl(regex_houses, sentence)) %>% # exclude sentences without house reference
  cbind(sapply(houses, function(x) grepl(x, .$sentence)))# identify references
# examine
max.char = 30 # define max sentence length
houses_sentences %>%
  mutate(sentence = ifelse(nchar(sentence) > max.char, # cut off long sentences
                           paste0(substring(sentence, 1, max.char), "..."),
                           sentence)) %>% 
  head(5)

##                  .id                          sentence gryffindor
## 1 Philosophers Stone "well, no one really knows unt...      FALSE
## 2 Philosophers Stone "and what are slytherin and hu...      FALSE
## 3 Philosophers Stone everyone says hufflepuff are a...      FALSE
## 4 Philosophers Stone "better hufflepuff than slythe...      FALSE
## 5 Philosophers Stone "there's not a single witch or...      FALSE
##   ravenclaw hufflepuff slytherin
## 1     FALSE       TRUE      TRUE
## 2     FALSE       TRUE      TRUE
## 3     FALSE       TRUE     FALSE
## 4     FALSE       TRUE      TRUE
## 5     FALSE      FALSE      TRUE

Transform to Long Format

Ok, looking great, but not tidy yet. We need gather the columns and put them in a long dataframe. Thinking ahead, it would be nice to already capitalize the house names for which I wrote a custom Capitalize() function.

# custom capitalization function
Capitalize = function(text){ 
  paste0(substring(text,1,1) %>% toupper(),
         substring(text,2))
}

# TO LONG FORMAT
houses_long <- houses_sentences %>%
  gather(key = house, value = test, -sentence, -.id) %>% 
  mutate(house = Capitalize(house)) %>% # capitalize names
  filter(test) %>% select(-test) # delete rows where house not referenced
# examine
houses_long %>%
  mutate(sentence = ifelse(nchar(sentence) > max.char, # cut off long sentences
                           paste0(substring(sentence, 1, max.char), "..."),
                           sentence)) %>% 
  head(20)

##                   .id                          sentence      house
## 1  Philosophers Stone i've been asking around, and i... Gryffindor
## 2  Philosophers Stone           "gryffindor," said ron. Gryffindor
## 3  Philosophers Stone "the four houses are called gr... Gryffindor
## 4  Philosophers Stone you might belong in gryffindor... Gryffindor
## 5  Philosophers Stone " brocklehurst, mandy" went to... Gryffindor
## 6  Philosophers Stone "finnigan, seamus," the sandy-... Gryffindor
## 7  Philosophers Stone                     "gryffindor!" Gryffindor
## 8  Philosophers Stone when it finally shouted, "gryf... Gryffindor
## 9  Philosophers Stone well, if you're sure -- better... Gryffindor
## 10 Philosophers Stone he took off the hat and walked... Gryffindor
## 11 Philosophers Stone "thomas, dean," a black boy ev... Gryffindor
## 12 Philosophers Stone harry crossed his fingers unde... Gryffindor
## 13 Philosophers Stone resident ghost of gryffindor t... Gryffindor
## 14 Philosophers Stone looking pleased at the stunned... Gryffindor
## 15 Philosophers Stone gryffindors have never gone so... Gryffindor
## 16 Philosophers Stone the gryffindor first years fol... Gryffindor
## 17 Philosophers Stone they all scrambled through it ... Gryffindor
## 18 Philosophers Stone nearly headless nick was alway... Gryffindor
## 19 Philosophers Stone professor mcgonagall was head ... Gryffindor
## 20 Philosophers Stone over the noise, snape said, "a... Gryffindor

Visualize House References

Woohoo, so tidy! Now comes the fun part: visualization. The following plots how often houses are mentioned overall, and in each book seperately.

# set plot width & height
w = 10; h = 6  

# PLOT REFERENCE FREQUENCY
houses_long %>%
  group_by(house) %>%
  summarize(n = n()) %>% # count sentences per house
  ggplot(aes(x = desc(house), y = n)) +
  geom_bar(aes(fill = house), stat = 'identity') +
  geom_text(aes(y = n / 2, label = house, col = house),  # center text
            size = 8, family = 'HP') +
  scale_fill_manual(values = houses_colors1) +
  scale_color_manual(values = houses_colors2) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = 'none') +
  labs(title = default_title,
       subtitle = "Combined references in all Harry Potter books",
       caption = default_caption,
       x = '', y = 'Name occurence') + 
  coord_flip()

# PLOT REFERENCE FREQUENCY OVER TIME 
houses_long %>%
  group_by(.id, house) %>%
  summarize(n = n()) %>% # count sentences per house per book
  ggplot(aes(x = .id, y = n, group = house)) +
  geom_line(aes(col = house), size = 2) +
  scale_color_manual(values = houses_colors1) +
  theme(legend.position = 'bottom',
        axis.text.x = element_text(angle = 15, hjust = 0.5, vjust = 0.5)) + # rotate x axis text
  labs(title = default_title, 
       subtitle = "References throughout the Harry Potter books",
       caption = default_caption,
       x = NULL, y = 'Name occurence', color = 'House')

The Harry Potter font looks wonderful, right?

In terms of the data, Gryffindor and Slytherin definitely play a larger role in the Harry Potter stories. However, as the storyline progresses, Slytherin as a house seems to lose its importance. Their downward trend since the Chamber of Secrets results in Ravenclaw being mentioned more often in the final book (Edit – this is likely due to the diadem horcrux, as you will see later on).

I can’t but feel sorry for house Hufflepuff, which never really gets to involved throughout the saga.

Retrieve Reference Words & Data

Let’s dive into the specific words used in combination with each house. The following code retrieves and counts the single words used in the sentences where houses are mentioned.

# IDENTIFY WORDS USED IN COMBINATION WITH HOUSES
words_by_houses <- houses_long %>% 
  unnest_tokens(word, sentence, token = 'words') %>% # retrieve words
  mutate(word = gsub("'s", "", word)) %>% # remove possesive determiners
  group_by(house, word) %>% 
  summarize(word_n = n()) # count words per house
# examine
words_by_houses %>% head()

## # A tibble: 6 x 3
## # Groups:   house [1]
##        house        word word_n
##        <chr>       <chr>  <int>
## 1 Gryffindor         104      1
## 2 Gryffindor        22nd      1
## 3 Gryffindor           a    251
## 4 Gryffindor   abandoned      1
## 5 Gryffindor  abandoning      1
## 6 Gryffindor abercrombie      1

Visualize Word-House Combinations

Now we can visualize which words relate to each of the houses. Because facet_wrap() has trouble reordering the axes (because words may related to multiple houses in different frequencies), I needed some custom functionality, which I happily recycled from dgrtwo’s github. With these reorder_within() and scale_x_reordered() we can now make an ordered barplot of the top-20 most frequent words per house.

# custom functions for reordering facet plots
# https://github.com/dgrtwo/drlib/blob/master/R/reorder_within.R
reorder_within <- function(x, by, within, fun = mean, sep = "___", ...) {
  new_x <- paste(x, within, sep = sep)
  reorder(new_x, by, FUN = fun)
}

scale_x_reordered <- function(..., sep = "___") {
  reg <- paste0(sep, ".+$")
  ggplot2::scale_x_discrete(labels = function(x) gsub(reg, "", x), ...)
}

# set plot width & height
w = 10; h = 7; 

# PLOT MOST FREQUENT WORDS PER HOUSE
words_per_house = 20 # set number of top words
words_by_houses %>%
  group_by(house) %>%
  arrange(house, desc(word_n)) %>%
  mutate(top = row_number()) %>% # count word top position
  filter(top <= words_per_house) %>% # retain specified top number
  ggplot(aes(reorder_within(word, -top, house), # reorder by minus top number
             word_n, fill = house)) +
  geom_col(show.legend = F) +
  scale_x_reordered() + # rectify x axis labels 
  scale_fill_manual(values = houses_colors1) +
  scale_color_manual(values = houses_colors2) + 
  facet_wrap(~ house, scales = "free_y") + # facet wrap and free y axis
  coord_flip() +
  labs(title = default_title, 
       subtitle = "Words most commonly used together with houses",
       caption = default_caption,
       x = NULL, y = 'Word Frequency')

Unsurprisingly, several stop words occur most frequently in the data. Intuitively, we would rerun the code but use dplyr::anti_join() on tidytext::stop_words to remove stop words.

# PLOT MOST FREQUENT WORDS PER HOUSE
# EXCLUDING STOPWORDS
words_by_houses %>%
  anti_join(stop_words, 'word') %>% # remove stop words
  group_by(house) %>% 
  arrange(house, desc(word_n)) %>%
  mutate(top = row_number()) %>% # count word top position
  filter(top <= words_per_house) %>% # retain specified top number
  ggplot(aes(reorder_within(word, -top, house), # reorder by minus top number
             word_n, fill = house)) +
  geom_col(show.legend = F) +
  scale_x_reordered() + # rectify x axis labels
  scale_fill_manual(values = houses_colors1) +
  scale_color_manual(values = houses_colors2) + 
  facet_wrap(~ house, scales = "free") + # facet wrap and free scales
  coord_flip() +
  labs(title = default_title, 
       subtitle = "Words most commonly used together with houses, excluding stop words",
       caption = default_caption,
       x = NULL, y = 'Word Frequency')

However, some stop words have a different meaning in the Harry Potter universe. points are for instance quite informative to the Hogwarts houses but included in stop_words.

Moreover, many of the most frequent words above occur in relation to multiple or all houses. Take, for instance, Harry and Ron, which are in the top-10 of each house, or words like table, house, and professor.

We are more interested in words that describe one house, but not another. Similarly, we only want to exclude stop words which are really irrelevant. To this end, we compute a ratio-statistic below. This statistic displays how frequently a word occurs in combination with one house rather than with the others. However, we need to adjust this ratio for how often houses occur in the text as more text (and thus words) is used in reference to house Gryffindor than, for instance, Ravenclaw.

words_by_houses <- words_by_houses %>%
  group_by(word) %>% mutate(word_sum = sum(word_n)) %>% # counts words overall
  group_by(house) %>% mutate(house_n = n()) %>%
  ungroup() %>%
    # compute ratio of usage in combination with house as opposed to overall
  # adjusted for house references frequency as opposed to overall frequency
  mutate(ratio = (word_n / (word_sum - word_n + 1) / (house_n / n()))) 
# examine
words_by_houses %>% select(-word_sum, -house_n) %>% arrange(desc(word_n)) %>% head()

## # A tibble: 6 x 4
##        house       word word_n     ratio
##        <chr>      <chr>  <int>     <dbl>
## 1 Gryffindor        the   1057  2.373115
## 2  Slytherin        the    675  1.467926
## 3 Gryffindor gryffindor    602 13.076218
## 4 Gryffindor        and    477  2.197259
## 5 Gryffindor         to    428  2.830435
## 6 Gryffindor         of    362  2.213186

# PLOT MOST UNIQUE WORDS PER HOUSE BY RATIO
words_by_houses %>%
  group_by(house) %>%
  arrange(house, desc(ratio)) %>%
  mutate(top = row_number()) %>% # count word top position
  filter(top <= words_per_house) %>% # retain specified top number
  ggplot(aes(reorder_within(word, -top, house), # reorder by minus top number
             ratio, fill = house)) +
  geom_col(show.legend = F) +
  scale_x_reordered() + # rectify x axis labels
  scale_fill_manual(values = houses_colors1) +
  scale_color_manual(values = houses_colors2) + 
  facet_wrap(~ house, scales = "free") +  # facet wrap and free scales
  coord_flip() +
  labs(title = default_title, 
       subtitle = "Most informative words per house, by ratio",
       caption = default_caption,
       x = NULL, y = 'Adjusted Frequency Ratio (house vs. non-house)')

# PS. normally I would make a custom ggplot function 
#    when I plot three highly similar graphs

This ratio statistic (x-axis) should be interpreted as follows: night is used 29 times more often in combination with Gryffindor than with the other houses.

Do you think the results make sense:

Gryffindors spent dozens of hours during their afternoons, evenings, and nights in the, often empty, tower room, apparently playing chess? Nevile Longbottom and Hermione Granger are Gryffindors, obviously, and Sirius Black is also on the list. The sword of Gryffindor is no surprise here either.
Hannah Abbot, Ernie Macmillan and Cedric Diggory are Hufflepuffs. Were they mostly hot curly blondes interested in herbology? Nevertheless, wild and aggresive seem unfitting for Hogwarts most boring house.
A lot of names on the list of Helena Ravenclaw’s house. Roger Davies, Padma Patil, Cho Chang, Miss S. Fawcett, Stewart Ackerley, Terry Boot, and Penelope Clearwater are indeed Ravenclaws, I believe. Ravenclaw’s Diadem was one of Voldemort horcruxes. AlectoCarrow, Death Eater by profession, was apparently sent on a mission by Voldemort to surprise Harry in Rawenclaw’s common room (source), which explains what she does on this list. Can anybody tell me what bust, statue and spot have in relation to Ravenclaw?
House Slytherin is best represented by Gregory Goyle, one of the members of Draco Malfoy’s gang along with Vincent Crabbe. Pansy Parkinson also represents house Slytherin. Slytherin are famous for speaking Parseltongue and their house’s gem is an emerald. House Gaunt were pure-blood descendants from Salazar Slytherin and apparently Viktor Krum would not have misrepresented the Slytherin values either. Oh, and only the heir of Slytherin could control the monster in the Chamber of Secrets.

Honestly, I was not expecting such good results! However, there is always room for improvement.

We may want to exclude words that only occur once or twice in the book (e.g., Alecto) as well as the house names. Additionally, these barplots are not the optimal visualization if we would like to include more words per house. Fortunately, Hadley Wickham helped me discover treeplots. Let’s draw one using the ggfittext and the treemapify packages.

# set plot width & height
w = 12; h = 8; 

# PACKAGES FOR TREEMAP
# devtools::install_github("wilkox/ggfittext")
# devtools::install_github("wilkox/treemapify")
library(ggfittext)
library(treemapify)


# PLOT MOST UNIQUE WORDS PER HOUSE BY RATIO
words_by_houses %>%
  filter(word_n > 3) %>% # filter words with few occurances
  filter(!grepl(regex_houses, word)) %>% # exclude house names
  group_by(house) %>%
  arrange(house, desc(ratio), desc(word_n)) %>%
  mutate(top = seq_along(ratio)) %>%
  filter(top <= words_per_house) %>% # filter top n words
  ggplot(aes(area = ratio, label = word, subgroup = house, fill = house)) +
  geom_treemap() + # create treemap
  geom_treemap_text(aes(col = house), family = "HP", place = 'center') + # add text
  geom_treemap_subgroup_text(aes(col = house), # add house names
                             family = "HP", place = 'center', alpha = 0.3, grow = T) +
  geom_treemap_subgroup_border(colour = 'black') +
  scale_fill_manual(values = houses_colors1) +
  scale_color_manual(values = houses_colors2) + 
  theme(legend.position = 'none') +
  labs(title = default_title, 
       subtitle = "Most informative words per house, by ratio",
       caption = default_caption)

treemap_house_word_ratio

A treemap can display more words for each of the houses and displays their relative proportions better. New words regarding the houses include the following, but do you see any others?

Slytherin girls laugh out loud whereas Ravenclaw had a few little, pretty girls?
Gryffindors, at least Harry and his friends, got in trouble often, that is a fact.
Yellow is the color of house Hufflepuff whereas Slytherin is green indeed.
Zacherias Smith joined Hufflepuff and Luna Lovegood Ravenclaw.
Why is Voldemort in camp Ravenclaw?!

In the earlier code, we specified a minimum number of occurances for words to be included, which is a bit hacky but necessary to make the ratio statistic work as intended. Foruntately, there are other ways to estimate how unique or informative words are to houses that do not require such hacks.

TF-IDF

tf-idf similarly estimates how unique / informative words are for a body of text (for more info: Wikipedia). We can calculate a tf-idf score for each word within each document (in our case house texts) by taking the product of two statistics:

TF or term frequency, meaning the number of times the word occurs in a document.
IDF or inverse document frequency, specifically the logarithm of the inverse number of documents the word occurs in.

A high tf-idf score means that a word occurs relatively often in a specific document and not often in other documents. Different weighting schemes can be used to td-idf’s performance in different settings but we used the simple default of tidytext::bind_tf_idf().

An advantage of tf-idf over the earlier ratio statistic is that we no longer need to specify a minimum frequency: low frequency words will have low tf and thus low tf-idf. A disadvantage is that tf-idf will automatically disregard words occur together with each house, be it only once: these words have zero idf (log(4/4)) so zero tf-idf.

Let’s run the treemap gain, but not on the computed tf-idf scores.

words_by_houses <- words_by_houses %>%
  # compute term frequency and inverse document frequency
  bind_tf_idf(word, house, word_n)
# examine
words_by_houses %>% select(-house_n) %>% head()

## # A tibble: 6 x 8
##        house        word word_n word_sum    ratio           tf       idf
##        <chr>       <chr>  <int>    <int>    <dbl>        <dbl>     <dbl>
## 1 Gryffindor         104      1        1 2.671719 6.488872e-05 1.3862944
## 2 Gryffindor        22nd      1        1 2.671719 6.488872e-05 1.3862944
## 3 Gryffindor           a    251      628 1.774078 1.628707e-02 0.0000000
## 4 Gryffindor   abandoned      1        1 2.671719 6.488872e-05 1.3862944
## 5 Gryffindor  abandoning      1        2 1.335860 6.488872e-05 0.6931472
## 6 Gryffindor abercrombie      1        1 2.671719 6.488872e-05 1.3862944
## # ... with 1 more variables: tf_idf <dbl>

# PLOT MOST UNIQUE WORDS PER HOUSE BY TF_IDF
words_per_house = 30
words_by_houses %>%
  filter(tf_idf > 0) %>% # filter for zero tf_idf
  group_by(house) %>%
  arrange(house, desc(tf_idf), desc(word_n)) %>%
  mutate(top = seq_along(tf_idf)) %>%
  filter(top <= words_per_house) %>%
  ggplot(aes(area = tf_idf, label = word, subgroup = house, fill = house)) +
  geom_treemap() + # create treemap
  geom_treemap_text(aes(col = house), family = "HP", place = 'center') + # add text
  geom_treemap_subgroup_text(aes(col = house), # add house names
                             family = "HP", place = 'center', alpha = 0.3, grow = T) +
  geom_treemap_subgroup_border(colour = 'black') +
  scale_fill_manual(values = houses_colors1) +
  scale_color_manual(values = houses_colors2) + 
  theme(legend.position = 'none') +
  labs(title = default_title, 
       subtitle = "Most informative words per house, by tf-idf",
       caption = default_caption)

This plot looks quite different from its predecessor. For instance, Marcus Flint and Adrian Pucey are added to house Slytherin and Hufflepuff’s main color is indeed not just yellow, but canary yellow. Severus Snape’s dual role is also nicely depicted now, with him in both house Slytherin and house Gryffindor. Do you notice any other important differences? Did we lose any important words because they occured in each of our four documents?

House Personality Profiles (by NRC Sentiment Analysis)

We end this second Harry Plotter blog by examining to what the extent the stereotypes that exist of the Hogwarts Houses can be traced back to the books. To this end, we use the NRC sentiment dictionary, see also the the previous blog, with which we can estimate to what extent the most informative words for houses (we have over a thousand for each house) relate to emotions such as anger, fear, or trust.

The code below retains only the emotion words in our words_by_houses dataset and multiplies their tf-idf scores by their relative frequency, so that we retrieve one score per house per sentiment.

# PLOT SENTIMENT OF INFORMATIVE WORDS (TFIDF)
words_by_houses %>%
  inner_join(get_sentiments("nrc"), by = 'word') %>%
  group_by(house, sentiment) %>%
  summarize(score = sum(word_n / house_n * tf_idf)) %>% # compute emotion score
  ggplot(aes(x = house, y = score, group = house)) +
  geom_col(aes(fill = house)) + # create barplots
  geom_text(aes(y = score / 2, label = substring(house, 1, 1), col = house), 
            family = "HP", vjust = 0.5) + # add house letter in middle
  facet_wrap(~ Capitalize(sentiment), scales = 'free_y') + # facet and free y axis
  scale_fill_manual(values = houses_colors1) +
  scale_color_manual(values = houses_colors2) + 
  theme(legend.position = 'none', # tidy dataviz
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        strip.text.x = element_text(colour = 'black', size = 12)) +
  labs(title = default_title, 
       subtitle = "Sentiment (nrc) related to houses' informative words (tf-idf)",
       caption = default_caption,
       y = "Sentiment score", x = NULL)

The results to a large extent confirm the stereotypes that exist regarding the Hogwarts houses:

Gryffindors are full of anticipation and the most positive and trustworthy.
Hufflepuffs are the most joyous but not extraordinary on any other front.
Ravenclaws are distinguished by their low scores. They are super not-angry and relatively not-anticipating, not-negative, and not-sad.
Slytherins are the angriest, the saddest, and the most feared and disgusting. However, they are also relatively joyous (optimistic?) and very surprising (shocking?).

Conclusion and future work

With this we have come to the end of the second part of the Harry Plotter project, in which we used tf-idf and ratio statistics to examine which words were most informative / unique to each of the houses of Hogwarts. The data was retrieved using the harrypotter package and transformed using tidytext and the tidyverse. Visualizations were made with ggplot2 and treemapify, using a Harry Potter font.

I have several ideas for subsequent posts and I’d love to hear your preferences or suggestions:

I would like to demonstrate how regular expressions can be used to retrieve (sub)strings that follow a specific format. We could use regex to examine, for instance, when, and by whom, which magical spells are cast.
I would like to use network analysis to examine the interactions between the characters. We could retrieve networks from the books and conduct sentiment analysis to establish the nature of relationships. Similarly, we could use unsupervised learning / clustering to explore character groups.
I would like to use topic models, such as latent dirichlet allocation, to identify the main topics in the books. We could, for instance, try to summarize each book chapter in single sentence, or examine how topics (e.g., love or death) build or fall over time.
Finally, I would like to build an interactive application / dashboard in Shiny (another hobby of mine) so that readers like you can explore patterns in the books yourself. Unfortunately, the free on shinyapps.io only 25 hosting hours per month : (

For now, I hope you enjoyed this blog and that you’ll be back for more. To receive new content first, please subscribe to my website www.paulvanderlaken.com, follow me on Twitter, or add me on LinkedIn.

If you would like to contribute to, collaborate on, or need assistance with a data science project or venture, please feel free to reach out

Mapping Median Household Incomes in the US

The US Census Download Center contains rich information on its countries demographic data. Here you can find a piece of R code that uses the highcharter package in R to create an interactive map showing the median household per country.

Beer-in-hand Data Science

Obviously, analysing beer data in high on everybody’s list of favourite things to do in your weekend. Amanda Dobbyn wanted to examine whether she could provide us with an informative categorization the 45.000+ beers in her data set, without having to taste them all herself.

You can find the full report here but you may also like to interactively discover beer similarities yourself in Amanda’s Beer Clustering Shiny App. Or just have a quick look at some of Amanda’s wonderful visualizations below.

A density map of the bitterness (y-axis) and alcohol percentages (x-axis) in the most popular beer styles.

A k-means clustering of each of the 45000 beers in 10 clusters. Try out other settings in Amanda’s Beer Clustering Shiny App.

The alcohol percentages (x), bitterness (y) and cluster assignments of some popular beer styles.

Modelling beer’s bitterness (y) by the number of used hops (x).

Visualizing Neural Networks in Processing

Coding Train is a Youtube channel by Daniel Shiffman that covers anything from the basics of programming languages like JavaScript (with p5.js) and Java (with Processing) to generative algorithms like physics simulation, computer vision, and data visualization. In particular, these latter topics, which Shiffman bundles under the label “the Nature of Code”, draw me to the channel.

In a recent series, Daniel draws from his free e-book to create his seven-video playlist where he elaborates on the inner workings of neural networks, visualizing the entire process as he programs the algorithm from scratch in Processing (Java). I recommend the two videos below consisting of the actual programming, especially for beginners who want to get an intuitive sense of how a neural network works.

PS. I tend to watch them on double speed.

Part 1:

Part 2:

Short ggplot2 tutorial by MiniMaxir

The following was reposted from minimaxir.com

QUICK INTRODUCTION TO GGPLOT2

ggplot2 uses a more concise setup toward creating charts as opposed to the more declarative style of Python’s matplotlib and base R. And it also includes a few example datasets for practicing ggplot2 functionality; for example, the mpg dataset is a dataset of the performance of popular models of cars in 1998 and 2008.

Let’s say you want to create a scatter plot. Following a great example from the ggplot2 documentation, let’s plot the highway mileage of the car vs. the volume displacement of the engine. In ggplot2, first you instantiate the chart with the ggplot() function, specifying the source dataset and the core aesthetics you want to plot, such as x, y, color, and fill. In this case, we set the core aesthetics to x = displacement and y = mileage, and add a geom_point() layer to make a scatter plot:

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
            geom_point()

As we can see, there is a negative correlation between the two metrics. I’m sure you’ve seen plots like these around the internet before. But with only a couple of lines of codes, you can make them look more contemporary.

ggplot2 lets you add a well-designed theme with just one line of code. Relatively new to ggplot2 is theme_minimal(), which generates a muted style similar to FiveThirtyEight’s modern data visualizations:

p <- p +
    theme_minimal()

But we can still add color. Setting a color aesthetic on a character/categorical variable will set the colors of the corresponding points, making it easy to differentiate at a glance.

p <- ggplot(mpg, aes(x = displ, y = hwy, color=class)) +
            geom_point() +
            theme_minimal()

Adding the color aesthetic certainly makes things much prettier. ggplot2 automatically adds a legend for the colors as well. However, for this particular visualization, it is difficult to see trends in the points for each class. A easy way around this is to add a least squares regression trendline for each class using geom_smooth() (which normally adds a smoothed line, but since there isn’t a lot of data for each group, we force it to a linear model and do not plot confidence intervals)

p <- p +
    geom_smooth(method = "lm", se = F)

Pretty neat, and now comparative trends are much more apparent! For example, pickups and SUVs have similar efficiency, which makes intuitive sense.

The chart axes should be labeled (always label your charts!). All the typical labels, like title, x-axis, and y-axis can be done with the labs() function. But relatively new to ggplot2 are the subtitle and caption fields, both of do what you expect:

p <- p +
    labs(title="Efficiency of Popular Models of Cars",
         subtitle="By Class of Car",
         x="Engine Displacement (liters)",
         y="Highway Miles per Gallon",
         caption="by Max Woolf — minimaxir.com")

That’s a pretty good start. Now let’s take it to the next level.

HOW TO SAVE A GGPLOT2 CHART FOR WEB

Something surprisingly undiscussed in the field of data visualization is how to save a chart as a high quality image file. For example, with Excel charts, Microsoft officially recommends to copy the chart, paste it as an image back into Excel, then save the pasted image, without having any control over image quality and size in the browser (the real best way to save an Excel/Numbers chart as an image for a webpage is to copy/paste the chart object into a PowerPoint/Keynote slide, and export the slideas an image. This also makes it extremely easy to annotate/brand said chart beforehand in PowerPoint/Keynote).

R IDEs such as RStudio have a chart-saving UI with the typical size/filetype options. But if you save an image from this UI, the shapes and texts of the resulting image will be heavily aliased (R renders images at 72 dpi by default, which is much lower than that of modern HiDPI/Retina displays).

The data visualizations used earlier in this post were generated in-line as a part of an R Notebook, but it is surprisingly difficult to extract the generated chart as a separate file. But ggplot2 also has ggsave(), which saves the image to disk using antialiasing and makes the fonts/shapes in the chart look much better, and assumes a default dpi of 300. Saving charts using ggsave(), and adjusting the sizes of the text and geoms to compensate for the higher dpi, makes the charts look very presentable. A width of 4 and a height of 3 results in a 1200x900px image, which if posted on a blog with a content width of ~600px (like mine), will render at full resolution on HiDPI/Retina displays, or downsample appropriately otherwise. Due to modern PNG compression, the file size/bandwidth cost for using larger images is minimal.

p <- ggplot(mpg, aes(x = displ, y = hwy, color=class)) + 
    geom_smooth(method = "lm", se=F, size=0.5) +
    geom_point(size=0.5) +
    theme_minimal(base_size=9) +
    labs(title="Efficiency of Popular Models of Cars",
         subtitle="By Class of Car",
         x="Engine Displacement (liters)",
         y="Highway Miles per Gallon",
         caption="by Max Woolf — minimaxir.com")

ggsave("tutorial-0.png", p, width=4, height=3)

Compare to the previous non-ggsave chart, which is more blurry around text/shapes:

For posterity, here’s the same chart saved at 1200x900px using the RStudio image-saving UI:

Note that the antialiasing optimizations assume that you are not uploading the final chart to a service like Medium or WordPress.com, which will compress the images and reduce the quality anyways. But if you are uploading it to Reddit or self-hosting your own blog, it’s definitely worth it.

FANCY FONTS

Changing the chart font is another way to add a personal flair. Theme functions like theme_minimal()accept a base_family parameter. With that, you can specify any font family as the default instead of the base sans-serif. (On Windows, you may need to install the extrafont package first). Fonts from Google Fonts are free and work easily with ggplot2 once installed. For example, we can use Roboto, Google’s modern font which has also been getting a lot of usage on Stack Overflow’s great ggplot2 data visualizations.

p <- p +
    theme_minimal(base_size=9, base_family="Roboto")

A general text design guideline is to use fonts of different weights/widths for different hierarchies of content. In this case, we can use a bolder condensed font for the title, and deemphasize the subtitle and caption using lighter colors, all done using the theme() function.

p <- p + 
    theme(plot.subtitle = element_text(color="#666666"),
          plot.title = element_text(family="Roboto Condensed Bold"),
          plot.caption = element_text(color="#AAAAAA", size=6))

It’s worth nothing that data visualizations posted on websites should be easily legible for mobile-device users as well, hence the intentional use of larger fonts relative to charts typically produced in the desktop-oriented Excel.

Additionally, all theming options can be set as a session default at the beginning of a script using theme_set(), saving even more time instead of having to recreate the theme for each chart.

THE “GGPLOT2 COLORS”

The “ggplot2 colors” for categorical variables are infamous for being the primary indicator of a chart being made with ggplot2. But there is a science to it; ggplot2 by default selects colors using the scale_color_hue() function, which selects colors in the HSL space by changing the hue [H] between 0 and 360, keeping saturation [S] and lightness [L] constant. As a result, ggplot2 selects the most distinct colors possible while keeping lightness constant. For example, if you have 2 different categories, ggplot2 chooses the colors with h = 0 and h = 180; if 3 colors, h = 0, h = 120, h = 240, etc.

It’s smart, but does make a given chart lose distinctness when many other ggplot2 charts use the same selection methodology. A quick way to take advantage of this hue dispersion while still making the colors unique is to change the lightness; by default, l = 65, but setting it slightly lower will make the charts look more professional/Bloomberg-esque.

p_color <- p +
        scale_color_hue(l = 40)

RCOLORBREWER

Another coloring option for ggplot2 charts are the ColorBrewer palettes implemented with the RColorBrewer package, which are supported natively in ggplot2 with functions such as scale_color_brewer(). The sequential palettes like “Blues” and “Greens” do what the name implies:

p_color <- p +
        scale_color_brewer(palette="Blues")

A famous diverging palette for visualizations on /r/dataisbeautiful is the “Spectral” palette, which is a lighter rainbow (recommended for dark backgrounds)

However, while the charts look pretty, it’s difficult to tell the categories apart. The qualitative palettes fix this problem, and have more distinct possibilities than the scale_color_hue() approach mentioned earlier.

Here are 3 examples of qualitative palettes, “Set1”, “Set2”, and “Set3,” whichever fit your preference.

VIRIDIS AND ACCESSIBILITY

Let’s mix up the visualization a bit. A rarely-used-but-very-useful ggplot2 geom is geom2d_bin(), which counts the number of points in a given 2d spatial area:

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
    geom_bin2d(bins=10) +
    [...theming options...]

We see that the largest number of points are centered around (2,30). However, the default ggplot2 color palette for continuous variables is boring. Yes, we can use the RColorBrewer sequential palettes above, but as noted, they aren’t perceptually distinct, and could cause issues for readers who are colorblind.

The viridis R package provides a set of 4 high-contrast palettes which are very colorblind friendly, and works easily with ggplot2 by extending a scale_fill_viridis()/scale_color_viridis() function.

The default “viridis” palette has been increasingly popular on the web lately:

p_color <- p +
        scale_fill_viridis(option="viridis")

“magma” and “inferno” are similar, and give the data visualization a fiery edge:

Lastly, “plasma” is a mix between the 3 palettes above:

If you’ve been following my blog, I like to use R and ggplot2 for data visualization. A lot.

One of my older blog posts, An Introduction on How to Make Beautiful Charts With R and ggplot2, is still one of my most-trafficked posts years later, and even today I see techniques from that particular post incorporated into modern data visualizations on sites such as Reddit’s /r/dataisbeautiful subreddit.

NEXT STEPS

FiveThirtyEight actually uses ggplot2 for their data journalism workflow in an interesting way; they render the base chart using ggplot2, but export it as as a SVG/PDF vector file which can scale to any size, and then the design team annotates/customizes the data visualization in Adobe Illustrator before exporting it as a static PNG for the article (in general, I recommend using an external image editor to add text annotations to a data visualization because doing it manually in ggplot2 is inefficient).

For general use cases, ggplot2 has very strong defaults for beautiful data visualizations. And certainly there is a lot more you can do to make a visualization beautiful than what’s listed in this post, such as using facets and tweaking parameters of geoms for further distinction, but those are more specific to a given data visualization. In general, it takes little additional effort to make something unique with ggplot2, and the effort is well worth it. And prettier charts are more persuasive, which is a good return-on-investment.

Max Woolf (@minimaxir) is a former Apple Software QA Engineer living in San Francisco and a Carnegie Mellon University graduate. In his spare time, Max uses Python to gather data from public APIs and ggplot2 to plot plenty of pretty charts from that data. You can learn more about Max here, view his data analysis portfolio here, or view his coding portfolio here.