This is reposted from DavisVaughan.com with minor modifications.
Introduction
A while back, I saw a conversation on twitter about how Hadley uses the word “pleased” very often when introducing a new blog post (I couldn’t seem to find this tweet anymore. Can anyone help?). Out of curiosity, and to flex my R web scraping muscles a bit, I’ve decided to analyze the 240+ blog posts that RStudio has put out since 2011. This post will do a few things:
- Scrape the RStudio blog archive page to construct URL links to each blog post
- Scrape the blog post text and metadata from each post
- Use a bit of
tidytext
for some exploratory analysis
- Perform a statistical test to compare Hadley’s use of “pleased” to the other blog post authors
Spoiler alert: Hadley uses “pleased” ALOT.
Required packages
library(tidyverse)
library(tidytext)
library(rvest)
library(xml2)
HTML from each blog post
Now that we have every link, we’re ready to extract the HTML from each individual blog post. To make things more manageable, we start by creating a tibble, and then using the mutate + map
combination to created a column of XML Nodesets (we will use this combination a lot). Each nodeset contains the HTML for that blog post (exactly like the HTML for the archive page).
blog_data <- tibble(links)
blog_data <- blog_data %>%
mutate(main = map(
.x = links,
.f = ~read_html(.) %>%
html_nodes("main")
)
)
select(blog_data, main)
blog_data$main[1]
## [[1]]
## {xml_nodeset (1)}
## [1] <main><div class="article-meta">\n<h1><span class="title">RStudio 1. ...
Meta information
Before extracting the blog post itself, lets grab the meta information about each post, specifically:
- Author
- Title
- Date
- Category
- Tags
In the exploratory analysis, we will use author and title, but the other information might be useful for future analysis.
Looking at the first blog post, the Author, Date, and Title are all HTML class names that we can feed into rvest
to extract that information.

In the code below, an example of extracting the author information is shown. To select a HTML class (like “author”) as opposed to a tag (like “main”), we have to put a period in front of the class name. Once the html node we are interested in has been identified, we can extract the text for that node using html_text()
.
blog_data$main[[1]] %>%
html_nodes(".author") %>%
html_text()
To scale up to grab the author for all posts, we use map_chr()
since we want a character of the author’s name returned.
map_chr(.x = blog_data$main,
.f = ~html_nodes(.x, ".author") %>%
html_text()) %>%
head(10)
Finally, notice that if we switch ".author"
with ".title"
or ".date"
then we can grab that information as well. This kind of thinking means that we should create a function for extracting these pieces of information!
extract_info <- function(html, class_name) {
map_chr(
.x = html,
.f = ~html_nodes(.x, class_name) %>%
html_text())
}
blog_data <- blog_data %>%
mutate(
author = extract_info(main, ".author"),
title = extract_info(main, ".title"),
date = extract_info(main, ".date")
)
select(blog_data, author, date)
select(blog_data, title)
The blog post itself
Finally, to extract the blog post itself, we can notice that each piece of text in the post is inside of a paragraph tag (p
). Being careful to avoid the ".terms"
class that contained the categories and tags, which also happens to be in a paragraph tag, we can extract the full blog posts. To ignore the ".terms"
class, use the :not()
selector.
blog_data <- blog_data %>%
mutate(
text = map_chr(main, ~html_nodes(.x, "p:not(.terms)") %>%
html_text() %>%
paste0(collapse = " "))
)
select(blog_data, text)
Who writes the most posts?
Now that we have all of this data, what can we do with it? To start with, who writes the most posts?
blog_data %>%
group_by(author) %>%
summarise(count = n()) %>%
mutate(author = reorder(author, count)) %>%
ggplot(mapping = aes(x = author, y = count)) +
geom_col() +
coord_flip() +
labs(title = "Who writes the most RStudio blog posts?",
subtitle = "By a huge margin, Hadley!") +
hrbrthemes::theme_ipsum(grid = "Y")

Tidytext
I’ve never used tidytext
before today, but to get our feet wet, let’s create a tokenized tidy version of our data. By using unnest_tokens()
the data will be reshaped to a long format holding 1 word per row, for each blog post. This tidy format lends itself to all manner of analysis, and a number of them are outlined in Julia Silge and David Robinson’s Text Mining with R.
tokenized_blog <- blog_data %>%
select(title, author, date, text) %>%
unnest_tokens(output = word, input = text)
select(tokenized_blog, title, word)
Remove stop words
A number of words like “a” or “the” are included in the blog that don’t really add value to a text analysis. These stop words can be removed using an anti_join()
with the stop_words
dataset that comes with tidytext
. After removing stop words, the number of rows was cut in half!
tokenized_blog <- tokenized_blog %>%
anti_join(stop_words, by = "word") %>%
arrange(desc(date))
select(tokenized_blog, title, word)
Top 15 words overall
Out of pure curiousity, what are the top 15 words for all of the blog posts?
tokenized_blog %>%
count(word, sort = TRUE) %>%
slice(1:15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
coord_flip() +
labs(title = "Top 15 words overall") +
hrbrthemes::theme_ipsum(grid = "Y")

Is Hadley more “pleased” than everyone else?
As mentioned at the beginning of the post, Hadley apparently uses the word “pleased” in his blog posts an above average number of times. Can we verify this statistically?
Our null hypothesis is that the proportion of blog posts that use the word “pleased” written by Hadley is less than or equal to the proportion of those written by the rest of the RStudio team.
More simply, our null is that Hadley uses “pleased” less than or the same as the rest of the team.
Let’s check visually to compare the two groups of posts.
pleased <- tokenized_blog %>%
group_by(title) %>%
mutate(
contains_pleased = case_when(
"pleased" %in% word ~ "Yes",
TRUE ~ "No"),
is_hadley = case_when(
author == "Hadley Wickham" ~ "Hadley",
TRUE ~ "Not Hadley")
) %>%
distinct(title, contains_pleased, is_hadley)
pleased %>%
ggplot(aes(x = contains_pleased)) +
geom_bar() +
facet_wrap(~is_hadley, scales = "free_y") +
labs(title = "Does this blog post contain 'pleased'?",
subtitle = "Nearly half of Hadley's do!",
x = "Contains 'pleased'",
y = "Count") +
hrbrthemes::theme_ipsum(grid = "Y")

Is there a statistical difference here?
To check if there is a statistical difference, we will use a test for difference in proportions contained in the R function, prop.test()
. First, we need a continency table of the counts. Given the current form of our dataset, this isn’t too hard with the table()
function from base R.
contingency_table <- pleased %>%
ungroup() %>%
select(is_hadley, contains_pleased) %>%
mutate(contains_pleased = factor(contains_pleased, levels = c("Yes", "No"))) %>%
table()
contingency_table
From our null hypothesis, we want to perform a one sided test. The alternative to our null is that Hadley uses “pleased” more than the rest of the RStudio team. For this reason, we specify alternative = "greater"
.
test_prop <- contingency_table %>%
prop.test(alternative = "greater")
test_prop
We could also tidy this up with broom
if we were inclined to.
broom::tidy(test_prop)
## estimate1 estimate2 statistic p.value parameter conf.low conf.high
## 1 0.4886364 0.1055901 43.57517 2.039913e-11 1 0.2779818 1
## method
## 1 2-sample test for equality of proportions with continuity correction
## alternative
## 1 greater
Test conclusion
- 48.86% of Hadley’s posts contain “pleased”
- 10.56% of the rest of the RStudio team’s posts contain “pleased”
- With a p-value of 2.04e-11, we reject the null that Hadley uses “pleased” less than or the same as the rest of the team. The evidence supports the idea that he has a much higher preference for it!
Hadley uses “pleased” quite a bit!
About the author
Davis Vaughan is a Master’s student studying Mathematical Finance at the University of North Carolina at Charlotte. He is the other half of Business Science. We develop R packages for financial analysis. Additionally, we have a network of data scientists at our disposal to bring together the best team to work on consulting projects. Check out our website to learn more! He is the coauthor of R packages tidyquant and timetk.
Like this:
Like Loading...