Follow

Todd Vaziri

✔@tvaziri

Every non-hyperbolic tweet is from iPhone (his staff).

Every hyperbolic tweet is from Android (from him).

Our null hypothesis is that the proportion of blog posts that use the word “pleased” written by Hadley is less than or equal to the proportion of those written by the rest of the RStudio team.

library(tidyverse)
library(tidytext)
library(rvest)
library(xml2)
archive_page <- "https://blog.rstudio.com/archives/"

archive_html <- read_html(archive_page)

# Doesn't seem very useful...yet
archive_html
## {xml_document}
## <html lang="en-us">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n    <nav class="menu"><svg version="1.1" xmlns="http://www.w ...
links <- archive_html %>%

  # Only the "main" body of the archive
  html_nodes("main") %>%

  # Grab any node that is a link
  html_nodes("a") %>%

  # Extract the hyperlink reference from those link tags
  # The hyperlink is an attribute as opposed to a node
  html_attr("href") %>%

  # Prefix them all with the base URL
  paste0("http://blog.rstudio.com", .)

head(links)
## [1] "http://blog.rstudio.com/2017/08/16/rstudio-preview-connections/"             
## [2] "http://blog.rstudio.com/2017/08/15/contributed-talks-diversity-scholarships/"
## [3] "http://blog.rstudio.com/2017/08/11/rstudio-v1-1-preview-terminal/"           
## [4] "http://blog.rstudio.com/2017/08/10/upcoming-workshops/"                      
## [5] "http://blog.rstudio.com/2017/08/03/rstudio-connect-v1-5-4-plumber/"          
## [6] "http://blog.rstudio.com/2017/07/31/sparklyr-0-6/"
blog_data <- tibble(links)

blog_data <- blog_data %>%
  mutate(main = map(
                    # Iterate through every link
                    .x = links, 

                    # For each link, read the HTML for that page, and return the main section 
                    .f = ~read_html(.) %>%
                            html_nodes("main")
                    )
         )

select(blog_data, main)
## # A tibble: 249 x 1
##                 main
##               <list>
##  1 <S3: xml_nodeset>
##  2 <S3: xml_nodeset>
##  3 <S3: xml_nodeset>
##  4 <S3: xml_nodeset>
##  5 <S3: xml_nodeset>
##  6 <S3: xml_nodeset>
##  7 <S3: xml_nodeset>
##  8 <S3: xml_nodeset>
##  9 <S3: xml_nodeset>
## 10 <S3: xml_nodeset>
## # ... with 239 more rows
blog_data$main[1]
## [[1]]
## {xml_nodeset (1)}
## [1] <main><div class="article-meta">\n<h1><span class="title">RStudio 1. ...
blog_data$main[[1]] %>%
  html_nodes(".author") %>%
  html_text()
## [1] "Jonathan McPherson"
map_chr(.x = blog_data$main,
        .f = ~html_nodes(.x, ".author") %>%
                html_text()) %>%
  head(10)
##  [1] "Jonathan McPherson" "Hadley Wickham"     "Gary Ritchie"      
##  [4] "Roger Oberg"        "Jeff Allen"         "Javier Luraschi"   
##  [7] "Hadley Wickham"     "Roger Oberg"        "Garrett Grolemund" 
## [10] "Hadley Wickham"
extract_info <- function(html, class_name) {
  map_chr(
          # Given the list of main HTMLs
          .x = html,

          # Extract the text we are interested in for each one 
          .f = ~html_nodes(.x, class_name) %>%
                  html_text())
}

# Extract the data
blog_data <- blog_data %>%
  mutate(
     author = extract_info(main, ".author"),
     title  = extract_info(main, ".title"),
     date   = extract_info(main, ".date")
    )

select(blog_data, author, date)
## # A tibble: 249 x 2
##                author       date
##                 <chr>      <chr>
##  1 Jonathan McPherson 2017-08-16
##  2     Hadley Wickham 2017-08-15
##  3       Gary Ritchie 2017-08-11
##  4        Roger Oberg 2017-08-10
##  5         Jeff Allen 2017-08-03
##  6    Javier Luraschi 2017-07-31
##  7     Hadley Wickham 2017-07-13
##  8        Roger Oberg 2017-07-12
##  9  Garrett Grolemund 2017-07-11
## 10     Hadley Wickham 2017-06-27
## # ... with 239 more rows
select(blog_data, title)
## # A tibble: 249 x 1
##                                                                          title
##                                                                          <chr>
##  1                                      RStudio 1.1 Preview - Data Connections
##  2 rstudio::conf(2018): Contributed talks, e-posters, and diversity scholarshi
##  3                                              RStudio v1.1 Preview: Terminal
##  4                                                Building tidy tools workshop
##  5                            RStudio Connect v1.5.4 - Now Supporting Plumber!
##  6                                                                sparklyr 0.6
##  7                                                                 haven 1.1.0
##  8                                   Registration open for rstudio::conf 2018!
##  9                                                          Introducing learnr
## 10                                                                dbplyr 1.1.0
## # ... with 239 more rows
extract_tag_or_cat <- function(html, info_name) {

  # Extract the links under the terms class
  cats_and_tags <- map(.x = html, 
                       .f = ~html_nodes(.x, ".terms") %>%
                              html_nodes("a"))

  # For each link, if the href contains the word categories/tags 
  # return the text corresponding to that link
  map(cats_and_tags, 
    ~if_else(condition = grepl(info_name, html_attr(.x, "href")), 
             true      = html_text(.x), 
             false     = NA_character_) %>%
      .[!is.na(.)])
}

# Apply our new extraction function
blog_data <- blog_data %>%
  mutate(
    categories = extract_tag_or_cat(main, "categories"),
    tags       = extract_tag_or_cat(main, "tags")
  )

select(blog_data, categories, tags)
## # A tibble: 249 x 2
##    categories       tags
##        <list>     <list>
##  1  <chr [1]>  <chr [0]>
##  2  <chr [1]>  <chr [0]>
##  3  <chr [1]>  <chr [3]>
##  4  <chr [3]>  <chr [8]>
##  5  <chr [3]>  <chr [2]>
##  6  <chr [1]>  <chr [3]>
##  7  <chr [2]>  <chr [0]>
##  8  <chr [4]> <chr [13]>
##  9  <chr [2]>  <chr [2]>
## 10  <chr [2]>  <chr [0]>
## # ... with 239 more rows
blog_data$categories[4]
## [[1]]
## [1] "Packages"  "tidyverse" "Training"
blog_data$tags[4]
## [[1]]
## [1] "Advanced R"       "data science"     "ggplot2"         
## [4] "Hadley Wickham"   "R"                "RStudio Workshop"
## [7] "r training"       "tutorial"
blog_data <- blog_data %>%
  mutate(
    text = map_chr(main, ~html_nodes(.x, "p:not(.terms)") %>%
                 html_text() %>%
                 # The text is returned as a character vector. 
                 # Collapse them all into 1 string.
                 paste0(collapse = " "))
  )

select(blog_data, text)
## # A tibble: 249 x 1
##                                                                           text
##                                                                          <chr>
##  1 Today, we’re continuing our blog series on new features in RStudio 1.1. If 
##  2 rstudio::conf, the conference on all things R and RStudio, will take place 
##  3 Today we’re excited to announce availability of our first Preview Release f
##  4 Have you embraced the tidyverse? Do you now want to expand it to meet your 
##  5 We’re thrilled to announce support for hosting Plumber APIs in RStudio Conn
##  6 We’re excited to announce a new release of the sparklyr package, available 
##  7 "I’m pleased to announce the release of haven 1.1.0. Haven is designed to f
##  8 RStudio is very excited to announce that rstudio::conf 2018 is open for reg
##  9 We’re pleased to introduce the learnr package, now available on CRAN. The l
## 10 "I’m pleased to announce the release of the dbplyr package, which now conta
## # ... with 239 more rows
blog_data %>%
  group_by(author) %>%
  summarise(count = n()) %>%
  mutate(author = reorder(author, count)) %>%

  # Create a bar graph of author counts
  ggplot(mapping = aes(x = author, y = count)) + 
  geom_col() +
  coord_flip() +
  labs(title    = "Who writes the most RStudio blog posts?",
       subtitle = "By a huge margin, Hadley!") +
  # Shoutout to Bob Rudis for the always fantastic themes
  hrbrthemes::theme_ipsum(grid = "Y")
tokenized_blog <- blog_data %>%
  select(title, author, date, text) %>%
  unnest_tokens(output = word, input = text)

select(tokenized_blog, title, word)
## # A tibble: 84,542 x 2
##                                     title       word
##                                     <chr>      <chr>
##  1 RStudio 1.1 Preview - Data Connections      today
##  2 RStudio 1.1 Preview - Data Connections      we’re
##  3 RStudio 1.1 Preview - Data Connections continuing
##  4 RStudio 1.1 Preview - Data Connections        our
##  5 RStudio 1.1 Preview - Data Connections       blog
##  6 RStudio 1.1 Preview - Data Connections     series
##  7 RStudio 1.1 Preview - Data Connections         on
##  8 RStudio 1.1 Preview - Data Connections        new
##  9 RStudio 1.1 Preview - Data Connections   features
## 10 RStudio 1.1 Preview - Data Connections         in
## # ... with 84,532 more rows
tokenized_blog <- tokenized_blog %>%
  anti_join(stop_words, by = "word") %>%
  arrange(desc(date))

select(tokenized_blog, title, word)
## # A tibble: 39,768 x 2
##                                     title            word
##                                     <chr>           <chr>
##  1 RStudio 1.1 Preview - Data Connections          server
##  2 RStudio 1.1 Preview - Data Connections          here’s
##  3 RStudio 1.1 Preview - Data Connections           isn’t
##  4 RStudio 1.1 Preview - Data Connections straightforward
##  5 RStudio 1.1 Preview - Data Connections             pro
##  6 RStudio 1.1 Preview - Data Connections         command
##  7 RStudio 1.1 Preview - Data Connections         console
##  8 RStudio 1.1 Preview - Data Connections           makes
##  9 RStudio 1.1 Preview - Data Connections           makes
## 10 RStudio 1.1 Preview - Data Connections          you’re
## # ... with 39,758 more rows
tokenized_blog %>%
  count(word, sort = TRUE) %>%
  slice(1:15) %>%
  mutate(word = reorder(word, n)) %>%

  ggplot(aes(word, n)) +
  geom_col() + 
  coord_flip() + 
  labs(title = "Top 15 words overall") +
  hrbrthemes::theme_ipsum(grid = "Y")
pleased <- tokenized_blog %>%

  # Group by blog post
  group_by(title) %>%

  # If the blog post contains "pleased" put yes, otherwise no
  # Add a column checking if the author was Hadley
  mutate(
    contains_pleased = case_when(
      "pleased" %in% word ~ "Yes",
      TRUE                ~ "No"),

    is_hadley = case_when(
      author == "Hadley Wickham" ~ "Hadley",
      TRUE                       ~ "Not Hadley")
    ) %>%

  # Remove all duplicates now
  distinct(title, contains_pleased, is_hadley)

pleased %>%
  ggplot(aes(x = contains_pleased)) +
  geom_bar() +
  facet_wrap(~is_hadley, scales = "free_y") +
  labs(title    = "Does this blog post contain 'pleased'?", 
       subtitle = "Nearly half of Hadley's do!",
       x        = "Contains 'pleased'",
       y        = "Count") +
  hrbrthemes::theme_ipsum(grid = "Y")
contingency_table <- pleased %>%
  ungroup() %>%
  select(is_hadley, contains_pleased) %>%
  # Order the factor so Yes is before No for easy interpretation
  mutate(contains_pleased = factor(contains_pleased, levels = c("Yes", "No"))) %>%
  table()

contingency_table
##             contains_pleased
## is_hadley    Yes  No
##   Hadley      43  45
##   Not Hadley  17 144
test_prop <- contingency_table %>%
  prop.test(alternative = "greater")

test_prop
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  .
## X-squared = 43.575, df = 1, p-value = 2.04e-11
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.2779818 1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.4886364 0.1055901
broom::tidy(test_prop)
##   estimate1 estimate2 statistic      p.value parameter  conf.low conf.high
## 1 0.4886364 0.1055901  43.57517 2.039913e-11         1 0.2779818         1
##                                                                 method
## 1 2-sample test for equality of proportions with continuity correction
##   alternative
## 1     greater
p <- ggplot(mpg, aes(x = displ, y = hwy)) +
            geom_point()

p <- p +
    theme_minimal()

p <- ggplot(mpg, aes(x = displ, y = hwy, color=class)) +
            geom_point() +
            theme_minimal()

p <- p +
    geom_smooth(method = "lm", se = F)

p <- p +
    labs(title="Efficiency of Popular Models of Cars",
         subtitle="By Class of Car",
         x="Engine Displacement (liters)",
         y="Highway Miles per Gallon",
         caption="by Max Woolf — minimaxir.com")

p <- ggplot(mpg, aes(x = displ, y = hwy, color=class)) + 
    geom_smooth(method = "lm", se=F, size=0.5) +
    geom_point(size=0.5) +
    theme_minimal(base_size=9) +
    labs(title="Efficiency of Popular Models of Cars",
         subtitle="By Class of Car",
         x="Engine Displacement (liters)",
         y="Highway Miles per Gallon",
         caption="by Max Woolf — minimaxir.com")

ggsave("tutorial-0.png", p, width=4, height=3)

p <- p +
    theme_minimal(base_size=9, base_family="Roboto")

p <- p + 
    theme(plot.subtitle = element_text(color="#666666"),
          plot.title = element_text(family="Roboto Condensed Bold"),
          plot.caption = element_text(color="#AAAAAA", size=6))

p_color <- p +
        scale_color_hue(l = 40)

p_color <- p +
        scale_color_brewer(palette="Blues")

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
    geom_bin2d(bins=10) +
    [...theming options...]

p_color <- p +
        scale_fill_viridis(option="viridis")

library(dplyr)
library(purrr)
library(twitteR)
# You'd need to set global options with an authenticated app
setup_twitter_oauth(getOption("twitter_consumer_key"),
                    getOption("twitter_consumer_secret"),
                    getOption("twitter_access_token"),
                    getOption("twitter_access_token_secret"))

# We can request only 3200 tweets at a time; it will return fewer
# depending on the API
trump_tweets <- userTimeline("realDonaldTrump", n = 3200)
trump_tweets_df <- tbl_df(map_df(trump_tweets, as.data.frame))
# if you want to follow along without setting up Twitter authentication,
# just use my dataset:
load(url("http://varianceexplained.org/files/trump_tweets_df.rda"))
library(tidyr)

tweets <- trump_tweets_df %>%
  select(id, statusSource, text, created) %>%
  extract(statusSource, "source", "Twitter for (.*?)<") %>%
  filter(source %in% c("iPhone", "Android"))
library(lubridate)
library(scales)

tweets %>%
  count(source, hour = hour(with_tz(created, "EST"))) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(hour, percent, color = source)) +
  geom_line() +
  scale_y_continuous(labels = percent_format()) +
  labs(x = "Hour of day (EST)",
       y = "% of tweets",
       color = "")
tweet_picture_counts <- tweets %>%
  filter(!str_detect(text, '^"')) %>%
  count(source,
        picture = ifelse(str_detect(text, "t.co"),
                         "Picture/link", "No picture/link"))

ggplot(tweet_picture_counts, aes(source, n, fill = picture)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "", y = "Number of tweets", fill = "")
library(tidytext)

reg <- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
tweet_words <- tweets %>%
  filter(!str_detect(text, '^"')) %>%
  mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|&", "")) %>%
  unnest_tokens(word, text, token = "regex", pattern = reg) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "[a-z]"))

tweet_words
## # A tibble: 8,753 x 4
##                    id source             created                   word
##                                                   
## 1  676494179216805888 iPhone 2015-12-14 20:09:15                 record
## 2  676494179216805888 iPhone 2015-12-14 20:09:15                 health
## 3  676494179216805888 iPhone 2015-12-14 20:09:15 #makeamericagreatagain
## 4  676494179216805888 iPhone 2015-12-14 20:09:15             #trump2016
## 5  676509769562251264 iPhone 2015-12-14 21:11:12               accolade
## 6  676509769562251264 iPhone 2015-12-14 21:11:12             @trumpgolf
## 7  676509769562251264 iPhone 2015-12-14 21:11:12                 highly
## 8  676509769562251264 iPhone 2015-12-14 21:11:12              respected
## 9  676509769562251264 iPhone 2015-12-14 21:11:12                   golf
## 10 676509769562251264 iPhone 2015-12-14 21:11:12                odyssey
## # ... with 8,743 more rows
android_iphone_ratios <- tweet_words %>%
  count(word, source) %>%
  filter(sum(n) >= 5) %>%
  spread(source, n, fill = 0) %>%
  ungroup() %>%
  mutate_each(funs((. + 1) / sum(. + 1)), -word) %>%
  mutate(logratio = log2(Android / iPhone)) %>%
  arrange(desc(logratio))
nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  dplyr::select(word, sentiment)

nrc
## # A tibble: 13,901 x 2
##           word sentiment
##               
## 1       abacus     trust
## 2      abandon      fear
## 3      abandon  negative
## 4      abandon   sadness
## 5    abandoned     anger
## 6    abandoned      fear
## 7    abandoned  negative
## 8    abandoned   sadness
## 9  abandonment     anger
## 10 abandonment      fear
## # ... with 13,891 more rows
sources <- tweet_words %>%
  group_by(source) %>%
  mutate(total_words = n()) %>%
  ungroup() %>%
  distinct(id, source, total_words)

by_source_sentiment <- tweet_words %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment, id) %>%
  ungroup() %>%
  complete(sentiment, id, fill = list(n = 0)) %>%
  inner_join(sources) %>%
  group_by(source, sentiment, total_words) %>%
  summarize(words = sum(n)) %>%
  ungroup()

head(by_source_sentiment)
## # A tibble: 6 x 4
##    source    sentiment total_words words
##                     
## 1 Android        anger        4901   321
## 2 Android anticipation        4901   256
## 3 Android      disgust        4901   207
## 4 Android         fear        4901   268
## 5 Android          joy        4901   199
## 6 Android     negative        4901   560
library(broom)

sentiment_differences <- by_source_sentiment %>%
  group_by(sentiment) %>%
  do(tidy(poisson.test(.$words, .$total_words)))

sentiment_differences
## Source: local data frame [10 x 9]
## Groups: sentiment [10]
## 
##       sentiment estimate statistic      p.value parameter  conf.low
##           (chr)    (dbl)     (dbl)        (dbl)     (dbl)     (dbl)
## 1         anger 1.492863       321 2.193242e-05  274.3619 1.2353162
## 2  anticipation 1.169804       256 1.191668e-01  239.6467 0.9604950
## 3       disgust 1.677259       207 1.777434e-05  170.2164 1.3116238
## 4          fear 1.560280       268 1.886129e-05  225.6487 1.2640494
## 5           joy 1.002605       199 1.000000e+00  198.7724 0.8089357
## 6      negative 1.692841       560 7.094486e-13  459.1363 1.4586926
## 7      positive 1.058760       555 3.820571e-01  541.4449 0.9303732
## 8       sadness 1.620044       303 1.150493e-06  251.9650 1.3260252
## 9      surprise 1.167925       159 2.174483e-01  148.9393 0.9083517
## 10        trust 1.128482       369 1.471929e-01  350.5114 0.9597478
## Variables not shown: conf.high (dbl), method (fctr), alternative (fctr)
library(rtweet)
library(dplyr)
library(purrr)
library(igraph)
library(ggraph)

r_users <- search_users("#rstats", n = 1000)

r_users %>% summarise(n_users = n_distinct(screen_name))

##   n_users
## 1     564

r_users_info <- lookup_users(r_users$screen_name)

r_users_info %>% select(dplyr::contains("count")) %>% head()

##   followers_count friends_count listed_count favourites_count
## 1            8311           366          580             9325
## 2           44474            11         1298                3
## 3           11106           524          467            18495
## 4           12481           431          542             7222
## 5           15345          1872          680            27971
## 6            5122           700          549             2796
##   statuses_count
## 1          66117
## 2           1700
## 3           8853
## 4           6388
## 5          22194
## 6          10010

r_users_ranking <- r_users_info %>%
  filter(protected == FALSE) %>% 
  select(screen_name, dplyr::contains("count")) %>% 
  unique() %>% 
  mutate(followers_percentile = ecdf(followers_count)(followers_count),
         friends_percentile = ecdf(friends_count)(friends_count),
         listed_percentile = ecdf(listed_count)(listed_count),
         favourites_percentile = ecdf(favourites_count)(favourites_count),
         statuses_percentile = ecdf(statuses_count)(statuses_count)
         ) %>% 
  group_by(screen_name) %>% 
  summarise(top_score = followers_percentile + friends_percentile + listed_percentile + favourites_percentile + statuses_percentile) %>% 
  ungroup() %>% 
  mutate(ranking = rank(-top_score))

top_30 <- r_users_ranking %>% arrange(desc(top_score)) %>% head(30) %>% arrange(desc(top_score))
top_30 

## # A tibble: 30 x 3
##        screen_name top_score ranking
##              <chr>     <dbl>   <dbl>
##  1          hspter  4.877005       1
##  2    RallidaeRule  4.839572       2
##  3         DEJPett  4.771836       3
##  4 modernscientist  4.752228       4
##  5 nicoleradziwill  4.700535       5
##  6      tomhouslay  4.684492       6
##  7    ChetanChawla  4.639929       7
##  8   TheSmartJokes  4.627451       8
##  9   Physical_Prep  4.625668       9
## 10       Cataranea  4.602496      10
## # ... with 20 more rows

top30_lookup <- r_users_info %>%
  filter(screen_name %in% top_30$screen_name) %>% 
  select(screen_name, user_id)

top30_lookup$gender <- c("M", "F", "F", "F", "F",
                         "M", "M", "M", "F", "F", 
                         "F", "M", "M", "M", "F", 
                         "F", "M", "M", "M", "M", 
                         "M", "M", "M", "F", "M",
                         "M", "M", "M", "M", "M")

table(top30_lookup$gender)

## 
##  F  M 
## 10 20

top_30_usernames <- top30_lookup$screen_name

friends_top30a <-   map(top_30_usernames[1:15 ], get_friends)
names(friends_top30a) <- top_30_usernames[1:15]

# 15 minutes later....
friends_top30b <- map(top_30_usernames[16:30], get_friends)

# turning lists into data frames and putting them together
friends_top30 <- append(friends_top30a, friends_top30b)
names(friends_top30) <- top_30_usernames

# purrr - trick I've been banging on about!
friends_top <- map2_df(friends_top30, names(friends_top30), ~ mutate(.x, twitter_top_user = .y)) %>% 
  rename(friend_id = user_id) %>% select(twitter_top_user, friend_id)

# getting a full list of friends
SJ1 <- get_friends("TheSmartJokes")
SJ2 <- get_friends("TheSmartJokes", page = next_cursor(SJ1))

# putting the data frames together 
SJ_friends <-rbind(SJ1, SJ2) %>%  
  rename(friend_id = user_id) %>% 
  mutate(twitter_top_user = "TheSmartJokes") %>% 
  select(twitter_top_user, friend_id)

# the final results - over 8000 friends, rather than 5000
str(SJ_friends) 

## 'data.frame':    8611 obs. of  2 variables:
##  $ twitter_top_user: chr  "TheSmartJokes" "TheSmartJokes" "TheSmartJokes" "TheSmartJokes" ...
##  $ friend_id       : chr  "390877754" "6085962" "88540151" "108186743" ...

friends_top30 <- friends_top %>% 
  filter(twitter_top_user != "TheSmartJokes") %>% 
  rbind(SJ_friends) 

# select friends that are top30 users
final_friends_top30 <- friends_top  %>% 
  filter(friend_id %in% top30_lookup$user_id)

# add friends' screen_name
final_friends_top30$friend_name <- top30_lookup$screen_name[match(final_friends_top30$friend_id, top30_lookup$user_id)]

# add users' and friends' gender
final_friends_top30$user_gender <- top30_lookup$gender[match(final_friends_top30$twitter_top_user, top30_lookup$screen_name)]
final_friends_top30$friend_gender <- top30_lookup$gender[match(final_friends_top30$friend_name, top30_lookup$screen_name)]

## final product!!!
final <- final_friends_top30 %>% select(-friend_id)

head(final)

##   twitter_top_user     friend_name user_gender friend_gender
## 1         hrbrmstr nicoleradziwill           M             F
## 2         hrbrmstr        kara_woo           M             F
## 3         hrbrmstr      juliasilge           M             F
## 4         hrbrmstr        noamross           M             M
## 5         hrbrmstr      JennyBryan           M             F
## 6         hrbrmstr     thosjleeper           M             M

f1 <- graph_from_data_frame(final, directed = TRUE, vertices = NULL)
V(f1)$Popularity <- degree(f1, mode = 'in')

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  theme_graph( fg_text_colour = 'black') 

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  geom_node_text(aes(label = name, fontface='bold'), 
                 color = 'white', size = 3) +
  theme_graph(background = 'dimgray', text_colour = 'white',title_size = 30) 

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  theme_graph( fg_text_colour = 'black') +
  geom_edge_link(aes(colour = friend_gender)) +
  scale_edge_color_brewer(palette = 'Set1') + 
  labs(title='Top 30 #rstats users and gender of their friends')

ggraph(f1, layout='kk') + 
  geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
  geom_node_point(aes(size = Popularity)) +
  theme_graph( fg_text_colour = 'black') +
  facet_edges(~user_gender) +
  geom_edge_link(aes(colour = friend_gender)) +
  scale_edge_color_brewer(palette = 'Set1') +
  labs(title='Top 30 #rstats users and gender of their friends', subtitle='Graphs are separated by top user gender, edge colour indicates their friend gender' )

final %>% 
  group_by(user_gender, friend_gender) %>% 
  summarize(n = n()) %>% 
  group_by(user_gender) %>% 
  mutate(sum = sum(n),
         percent = round(n/sum, 2)) 

## # A tibble: 4 x 5
## # Groups:   user_gender [2]
##   user_gender friend_gender     n   sum percent
##         <chr>         <chr> <int> <int>   <dbl>
## 1           F             F    26    57    0.46
## 2           F             M    31    57    0.54
## 3           M             F    55   101    0.54
## 4           M             M    46   101    0.46

# LOAD IN PACKAGES
# library(devtools)
# devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
library(tidytext)
library(plyr)
library(tidyverse)
library(wordcloud)

# OPTIONS
options(stringsAsFactors = F, # do not convert upon loading
        scipen = 999, # do not convert numbers to e-values
        max.print = 200) # stop printing after 200 values

# VIZUALIZATION SETTINGS
theme_set(theme_light()) # set default ggplot theme to light
fs = 12 # default plot font size
# LOAD IN BOOK CHAPTERS
# TRANSFORM TO TOKENIZED DATASET
hp_words <- list(
 philosophers_stone = philosophers_stone,
 chamber_of_secrets = chamber_of_secrets,
 prisoner_of_azkaban = prisoner_of_azkaban,
 goblet_of_fire = goblet_of_fire,
 order_of_the_phoenix = order_of_the_phoenix,
 half_blood_prince = half_blood_prince,
 deathly_hallows = deathly_hallows
) %>%
 ldply(rbind) %>% # bind all chapter text to dataframe columns
 mutate(book = factor(seq_along(.id), labels = .id)) %>% # identify associated book
 select(-.id) %>% # remove ID column
 gather(key = 'chapter', value = 'text', -book) %>% # gather chapter columns to rows
 filter(!is.na(text)) %>% # delete the rows/chapters without text
 mutate(chapter = as.integer(chapter)) %>% # chapter id to numeric
 unnest_tokens(word, text, token = 'words') # tokenize data frame
# EXAMINE FIRST AND LAST WORDS OF SAGA
hp_words %>% head()
##                   book chapter  word
## 1   philosophers_stone       1   the
## 1.1 philosophers_stone       1   boy
## 1.2 philosophers_stone       1   who
## 1.3 philosophers_stone       1 lived
## 1.4 philosophers_stone       1    mr
## 1.5 philosophers_stone       1   and
# PLOT WORD FREQUENCY PER BOOK
hp_words %>%
  group_by(book, word) %>%
  anti_join(stop_words, by = "word") %>% # delete stopwords
  count() %>% # summarize count per word per book
  arrange(desc(n)) %>% # highest freq on top
  group_by(book) %>% # 
  mutate(top = seq_along(word)) %>% # identify rank within group
  filter(top <= 15) %>% # retain top 15 frequent words
  # create barplot
  ggplot(aes(x = -top, fill = book)) + 
  geom_bar(aes(y = n), stat = 'identity', col = 'black') +
  # make sure words are printed either in or next to bar
  geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
                label = word), size = fs/3, hjust = "left") +
  theme(legend.position = 'none', # get rid of legend
        text = element_text(size = fs), # determine fontsize
        axis.text.x = element_text(angle = 45, hjust = 1, size = fs/1.5), # rotate x text
        axis.ticks.y = element_blank(), # remove y ticks
        axis.text.y = element_blank()) + # remove y text
  labs(y = "Word count", x = "", # add labels
       title = "Harry Plotter: Most frequent words throughout the saga") +
  facet_grid(. ~ book) + # separate plot for each book
  coord_flip() # flip axes
# EXTRACT SENTIMENT WITH THREE DICTIONARIES
hp_senti <- bind_rows(
  # 1 AFINN 
  hp_words %>% 
    inner_join(get_sentiments("afinn"), by = "word") %>%
    filter(score != 0) %>% # delete neutral words
    mutate(sentiment = ifelse(score < 0, 'negative', 'positive')) %>% # identify sentiment
    mutate(score = sqrt(score ^ 2)) %>% # all scores to positive
    group_by(book, chapter, sentiment) %>% 
    mutate(dictionary = 'afinn'), # create dictionary identifier
  # 2 BING 
  hp_words %>% 
    inner_join(get_sentiments("bing"), by = "word") %>%
    group_by(book, chapter, sentiment) %>%
    mutate(dictionary = 'bing'), # create dictionary identifier
  # 3 NRC 
  hp_words %>% 
    inner_join(get_sentiments("nrc"), by = "word") %>%
    group_by(book, chapter, sentiment) %>%
    mutate(dictionary = 'nrc') # create dictionary identifier
)

# EXAMINE FIRST SENTIMENT WORDS
hp_senti %>% head()
## # A tibble: 6 x 6
## # Groups:   book, chapter, sentiment [2]
##                 book chapter      word score sentiment dictionary
##                                   
## 1 philosophers_stone       1     proud     2  positive      afinn
## 2 philosophers_stone       1 perfectly     3  positive      afinn
## 3 philosophers_stone       1     thank     2  positive      afinn
## 4 philosophers_stone       1   strange     1  negative      afinn
## 5 philosophers_stone       1  nonsense     2  negative      afinn
## 6 philosophers_stone       1       big     1  positive      afinn
hp_senti %>%
  group_by(word) %>%
  count() %>% # summarize count per word
  mutate(log_n = sqrt(n)) %>% # take root to decrease outlier impact
  with(wordcloud(word, log_n, max.words = 100))
# DELETE SENTIMENT FOR CHARACTER NAMES
hp_senti_sel <- hp_senti %>% filter(!word %in% c("harry","moody"))
# VIZUALIZE MOST FREQUENT WORDS PER SENTIMENT
hp_senti_sel %>% # NAMES EXCLUDED
  group_by(word, sentiment) %>%
  count() %>% # summarize count per word per sentiment
  group_by(sentiment) %>%
  arrange(sentiment, desc(n)) %>% # most frequent on top
  mutate(top = seq_along(word)) %>% # identify rank within group
  filter(top <= 15) %>% # keep top 15 frequent words
  ggplot(aes(x = -top, fill = factor(sentiment))) + 
  # create barplot
  geom_bar(aes(y = n), stat = 'identity', col = 'black') +
  # make sure words are printed either in or next to bar
  geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
                label = word), size = fs/3, hjust = "left") +
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs), # determine fontsize
        axis.text.x = element_text(angle = 45, hjust = 1), # rotate x text
        axis.ticks.y = element_blank(), # remove y ticks
        axis.text.y = element_blank()) + # remove y text
  labs(y = "Word count", x = "", # add manual labels
       title = "Harry Plotter: Words carrying sentiment as counted throughout the saga",
       subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
  facet_grid(. ~ sentiment) + # separate plot for each sentiment
  coord_flip() # flip axes
# VIZUALIZE POSTIVE/NEGATIVE SENTIMENT OVER TIME
plot_sentiment <- hp_senti_sel %>% # NAMES EXCLUDED
  group_by(dictionary, sentiment, book, chapter) %>%
  summarize(score = sum(score), # summarize AFINN scores
            count = n(), # summarize bing and nrc counts
            # move bing and nrc counts to score 
            score = ifelse(is.na(score), count, score))  %>%
  filter(sentiment %in% c('positive','negative')) %>%   # only retain bipolar sentiment
  mutate(score = ifelse(sentiment == 'negative', -score, score)) %>% # reverse negative values
  # create area plot
  ggplot(aes(x = chapter, y = score)) +    
  geom_area(aes(fill = score > 0),stat = 'identity') +
  scale_fill_manual(values = c('red','green')) + # change colors
  # add black smoothed line without standard error
  geom_smooth(method = "loess", se = F, col = "black") + 
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs)) + # change font size
  labs(x = "Chapter", y = "Sentiment score", # add labels
       title = "Harry Plotter: Sentiment during the saga",
       subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
     # separate plot per book and dictionary and free up x-axes
  facet_grid(dictionary ~ book, scale = "free_x")
plot_sentiment
plot_sentiment + coord_cartesian(ylim = c(-100,50)) # zoom in plot
# VIZUALIZE EMOTIONAL SENTIMENT OVER TIME
hp_senti_sel %>% # NAMES EXCLUDED 
  filter(!sentiment %in% c('negative','positive')) %>% # only retain other sentiments (nrc)
  group_by(sentiment, book, chapter) %>%
  count() %>% # summarize count
  # create area plot
  ggplot(aes(x = chapter, y = n)) +
  geom_area(aes(fill = sentiment), stat = 'identity') + 
  # add black smoothing line without standard error
  geom_smooth(aes(fill = sentiment), method = "loess", se = F, col = 'black') + 
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs)) + # change font size
  labs(x = "Chapter", y = "Emotion score", # add labels
       title = "Harry Plotter: Emotions during the saga",
       subtitle = "Using tidytext and the nrc sentiment dictionary") +
  # separate plots per sentiment and book and free up x-axes
  facet_grid(sentiment ~ book, scale = "free_x") 

This is reposted from DavisVaughan.com with minor modifications.

Introduction

Required packages

Extract the HTML from the RStudio blog archive

HTML from each blog post

Meta information

Categories and tags

The blog post itself

Who writes the most posts?

Tidytext

Remove stop words

Top 15 words overall

Is Hadley more “pleased” than everyone else?

Is there a statistical difference here?

Test conclusion

About the author

Share this:

The following was reposted from minimaxir.com

QUICK INTRODUCTION TO GGPLOT2

HOW TO SAVE A GGPLOT2 CHART FOR WEB

FANCY FONTS

THE “GGPLOT2 COLORS”

RCOLORBREWER

VIRIDIS AND ACCESSIBILITY

NEXT STEPS

Share this:

Reposted from Variance Explained with minor modifications. Note this post was written in 2016, a follow-up was posted in 2017.

The dataset

Comparison of words

Sentiment analysis: Trump’s tweets are much more negative than his campaign’s

Conclusion: the ghost in the political machine

About the author:

Follow this link to the 2017 sequel to this article.

Share this:

Reposted from Kasia Kulma’s github with minor modifications.

IMPORTING #RSTATS USERS

SCORING AND CHOOSING TOP #RSTATS USERS

GETTING FRIENDS NETWORK

VISUALIZING FRIENDS NETWORKS

About the author:

Share this:

Table of Contents (clickable)

Introductory R

Introductory Books

Online Courses

Style Guides

Advanced R

Package Development

Non-standard Evaluation

Functional Programming

Cheat Sheets

Data Manipulation

Data Visualization

Colors

Interactive / HTML / JavaScript widgets

ggplot2

ggplot2 extensions

Miscellaneous

Shiny, Dashboards, & Apps

Markdown & Other Output Formats

Cloud, Server, & Database

Statistical Modeling & Machine Learning

Books

Courses

Cheat sheets

Time series

Survival analysis

Bayesian

Miscellaneous

Natural Language Processing & Text Mining

Regular Expressions

Geographic & Spatial mapping

Bioinformatics & Computational Biology

Integrated Development Environments (IDEs) & Graphical User Inferfaces (GUIs)

R & other software and languages

R & Excel

R & Python

R & SQL

R Help, Connect, & Inspiration

R Blogs

Reposted from Variance Explained with minor modifications.
Note this post was written in 2016, a follow-up was posted in 2017.

Integrated Development Environments (IDEs) &
Graphical User Inferfaces (GUIs)