Tag: tidyverse

R resources (free courses, books, tutorials, & cheat sheets)

Help yourself to these free books, tutorials, packages, cheat sheets, and many more materials for R programming. There’s a separate overview for handy R programming tricks. If you have additions, please comment below or contact me!

Join 385 other subscribers

LAST UPDATED: 2021-09-24

Table of Contents (clickable)

Beginner
Advanced
Cheat sheets
Data manipulation
Data visualization
Dashboards & Shiny
Markdown
Database connections
Machine learning
Text mining
Geospatial analysis
Bioinformatics
R IDEs
Software & language connections
Help
Blogs
Conferences, Events, & Groups
Jobs
Other tips & tricks

Completely new to R? → Start learning here!

Introductory R

Introductory Books

Online Courses

Youtube R classes by Chris Bilder
37 Youtube R Tutorials by Flavio Azevedo***
Essential R tutorials by Gilad Feldman
Data Carpentry Social Science in R
Statistics and R, by Rafael Irizarry and Michael Love
Learn R via R-coder.com

Style Guides

Google’s R style guide
Tidyverse style guide by Hadley Wickham
Advanced R style guide by Hadley Wickham
R style guide for stat405 by Hadley Wickham
R style guide by Collin Gillespie
Best practices for R Coding by Arnaud Amsellem / The R Trader
The State of Naming Conventions in R (Bååth, 2012)
A guide for switching from base R to the tidyverse

BACK TO TABLE OF CONTENTS

Advanced R

Package Development

Mastering Software Development in R (Peng, Kross, & Anderson, 2017)
R Packages (Wickham & Bryan, ???)
rOpenSci Packages: Development, Maintenance, and Peer Review
How to develop good R packages (for open science) by Maëlle Salmon
Tutorial on creating R packages by Friedrich Leisch
Developing R Packages by Jeff Leek
Writing an R package from scratch by Hilary Parker
Write your own R package by STAT545
Making an R Package, by R.M. Ripley
Prepare your package for CRAN
Introduction to roxygen2 by Hadley Wickham
How to build package vignettes with knitr by Yihui Xie
knitr in a nutshell: a minimal tutorial by Karl Broman
Rtools: Building R for Windows by Brian Ripley, Duncan Murdoch, and Jeroen Ooms
devtools – tools to make an R developer’s life easier
roxygen2 – tools for describing functions in comments next to their definitions
Rd2roxygen – tools for converting Rd to roxygen documentation
testthat – tools that simplify the testing of R packages

Non-standard Evaluation

Functional Programming

Writing Functions in R by Hadley Wickham via DataCamp.com
R for Data Science chapters on Functions and Iteration
(Grolemund & Wickham, 2018)***
Advanced R chapter on Functions (Wickham, 2014)
Lesson on writing, testing, and documenting custom functions by Software-Carpentry.org
User-defined R fuctions tutorial by Carlo Fanara via DataCamp.com
Functional programming lecture by Duke University
purrr tutorial by Jenny Bryan***
Intro to purrr tutorial by Emorie Beck
Learn purrr tutorial by Dan Ovando
purrr cheat sheet by RStudio

BACK TO TABLE OF CONTENTS

Cheat Sheets

Many of the above cheat sheets are hosted in the official RStudio cheat sheet overview.

Data Manipulation

Data Visualization

Colors

R Color Guide***
colourpicker – widget that allows users to choose colours
paletteer – comprehensive collection of color palettes in R***
ggplot2 colour guide***
Canva’s 100 color palette included in ggthemes::scale_color_canva
Wes Anderson color palettes
Multicolored annotated text in ggplot2 by Andrew Whitby & Visuelle Data
Picular.co – Google, but for colors

Interactive / HTML / JavaScript widgets

R HTML Widgets Gallery***
plotly – interactive plots
billboarder – easy interface to billboard.js, a JavaScript chart library based on D3
d3heatmap – interactive D3 heatmaps
altair – Vega-Lite visualizations via Python
DT – interactive tables
DiagrammeR – interactive diagrams (DiagrammeR cheat sheet)
dygraphs – interactive time series plots
formattable – formattable data structures
ggvis – interactive ggplot2
highcharter – interactive Highcharts plots
leaflet – interactive maps
metricsgraphics – interactive JavaScript bare-bones line, scatterplot and bar charts
networkD3 – interative D3 network graphs
scatterD3 – interactive scatterplots with D3
rbokeh – interactive Bokeh plots
rCharts – interactive Javascript charts
rcdimple – interactive JavaScript bar charts and others
rglwidget – interactive 3d plots
threejs – interactive 3d plots and globes
visNetwork – interactive network graphs
wordcloud2 – interface to wordcloud2.js.
timevis – interactive timelines

ggplot2

Code examples of top-50 ggplot2 visualizations***
ggplot2 Cheatsheet by RStudio
ggplot2 Quick Reference Guide
ggplot2 Code Snippets
ggplot2 Code Snippets 2
Hitchhiker’s Guide to ggplot2 in R (Burchell & Vargas, 2016)
A practical introduction with R and ggplot2 (Healy, 2017)
Data Vizualization: A practical introduction (Healy, 2018)
Complete ggplot2 Tutorial
Principles & Practice of Data Visualization CS631 at Oregon Health & Science University
Data visualization cheat sheet by RStudio with ggplot2
Setting custom ggplot themes with ggthemr
Creating custom, reproducible color palettes by Simon Jackson
Rearranging values within ggplot2 facets
Combine plots using patchwork or cowplot
equisse – RStudio addin to interactively explore data with ggplot2 without coding

ggplot2 extensions

ggplot2 extensions overview***
ggthemes – plot style themes
hrbrthemes – opinionated, typographic-centric themes
ggmap – maps with Google Maps, Open Street Maps, etc.
ggiraph – interactive ggplots
gghighight – highlight lines or values, see vignette
ggstance – horizontal versions of common plots
GGally – scatterplot matrices
ggalt – additional coordinate systems, geoms, etc.
ggbeeswarm – column scatter plots or voilin scatter plots
ggforce – additional geoms, see visual guide
ggrepel – prevent plot labels from overlapping
ggraph – graphs, networks, trees and more
ggpmisc – photo-biology related extensions
geomnet – network visualization
ggExtra – marginal histograms for a plot
gganimate – animations, see also the gganimate wiki page
ggpage – pagestyled visualizations of text based data
ggpmisc – useful additional geom_* and stat_* functions
ggstatsplot – include details from statistical tests in plots
ggspectra – tools for plotting light spectra
ggnetwork – geoms to plot networks
ggpoindensity – cross between a scatter plot and a 2D density plot
ggradar – radar charts
ggsurvplot (survminer) – survival curves
ggseas – seasonal adjustment tools
ggthreed – (evil) 3D geoms
ggtech – style themes for plots
ggtern – ternary diagrams
ggTimeSeries – time series visualizations
ggtree – tree visualizations
treemapify – wilcox’s treemaps
seewave – spectograms

Miscellaneous

coefplot – visualizes model statistics
circlize – circular visualizations for categorical data
clustree – visualize clustering analysis
quantmod – candlestick financial charts
dabestr– Data Analysis using Bootstrap-Coupled ESTimation
devoutsvg – an SVG graphics device (with pattern fills)
devoutpdf – an PDF graphics device
cartography – create and integrate maps in your R workflow
colorspace – HSL based color palettes
viridis – Matplotlib viridis color pallete for R
munsell – Munsell color palettes for R
Cairo – high-quality display output
igraph – Network Analysis and Visualization
graphlayouts – new layout algorithms for network visualization
lattice – Trellis graphics
tmap – thematic maps
trelliscopejs – interactive alternative for facet_wrap
rgl – interactive 3D plots
corrplot – graphical display of a correlation matrix
googleVis – Google Charts API
plotROC – interactive ROC plots
extrafont – fonts in R graphics
rvg – produces Vector Graphics that allow further editing in PowerPoint or Excel
showtext – text using system fonts
animation – animated graphics using ImageMagick.
misc3d – 3d plots, isosurfaces, etc.
xkcd – xkcd style graphics
imager – CImg library to work with images
ungeviz – tools for visualize uncertainty
waffle – square pie charts a.k.a. waffle charts
Creating spectograms in R with hht, warbleR, soundgen, signal, seewave, or phonTools

BACK TO TABLE OF CONTENTS

Shiny, Dashboards, & Apps

Shiny Cheat Sheet by RStudio
Shiny Tutorial
A collection of links to Shiny applications that have been shared on Twitter.
Enterprise-ready dashboards with Shiny and databases
Several packages to upgrade your Shiny dashboards
More Shiny Resources by Rob Gilmore
More Shiny Resources for Statistics by Yingjie Hu
Building Shiny apps – an interactive tutorial by Dean Attali
Advanced Shiny tips & tricks by Dean Attali (version 2)
flexdashboard – dashboard creation simplified
colourpicker – widget that allows users to choose colours
brighter – toolbox with helpful functions for shiny development
DesktopDeployR – self-contained R-based desktop applications

Markdown & Other Output Formats

R Markdown cheat sheet by RStudio
R Markdown reference guide by RStudio
R Markdown Basics
R Markdown tutorial by RStudio
R Markdown gallery by RStudio
The knitr book (Xie, 2015)
Getting used to R, RStudio, and R Markdown (2016)
R Markdown: The Definitive Guide (Xie, Allaire, & Grolemund, 2018)
Introduction to R Markdown (Clark, 2018)
R Markdown for Scientists (Tierney, 2019)
R Markdown Tips and Tricks
Pimp my RMD by Holtz Yan
Pandoc syntax highlighting examples by Garrick Aden-Buie
Creating slides with R Markdown (Video) by Brian Caffo
Introduction to xaringan by Yihui Xie
A quick demonstration of xarigan
General Markdown cheat sheet
blogdown websites with R Markdown (Xie, Thomas, & Hill, 2018)
blogdown tutorials
How to build a website with blogdown in R, by Storybench
radix – online publication format designed for scientific and technical communication
A template RStudio project with data analysis and manuscript writing by Thomas Julou
Multiple reports from a single Markdown file (example 1) (example2)

tidystats – automating updating of model statistics
papaja – preparing APA journal articles
blogdown – build websites with Markdown & Hugo
huxtable – create Excel, html, & LaTeX tables
xaringan – make slideshows via remark.js and markdown
summarytools – produces neat, quick data summary tables
citr – RStudio Addin to Insert Markdown Citations

Cloud, Server, & Database

Access and manage Google spreadsheets from R with googlesheets
Tutorial: Database Queries with R
Introduction to sparklyr by DataCamp
Running R on AWS
AWS EC2 Tutorial For Beginners
Using RStudio on Amazon EC2 under the Free Usage Tier
Getting started with databases using R, by RStudio
- RMySQL – connects to MySQL and MariaDB
- RPostgreSQL – connects to Postgres and Redshift.
- RSQLite – embeds a SQLite database.
- odbc – connects to many commercial databases via the open database connectivity protocol.
- bigrquery – connects to Google’s BigQuery.
- DBI – separates the connectivity to the DBMS into a “front-end” and a “back-end”.
- dbplot – leverages dplyr to process calculations of plot inside database
- dplyr – also works with remote on-disk data stored in databases
- tidypredict – run predictions inside the database

BACK TO TABLE OF CONTENTS

Statistical Modeling & Machine Learning

Books

Courses

Introduction to Statistical Learning*** at Stanford University by Trevor Hastie and Rob Tibshirani
Introduction to R for Data Science @Microsoft
Introduction to R for Data Science @FutureLearn by Hadley Wickham
PSY2002: Advanced Statistics at University of Toronto by Elizabeth Page-Gould
STAT 450/870: Regression Analysis at University of Nebraska-Lincoln by Chris Bilder
STAT 850: Computing Tools for Statisticians at University of Nebraska-Lincoln by Chris Bilder
STAT 873: Applied Multivariate Statistical Analysis at University of Nebraska-Lincoln by Chris Bilder
STAT 875: Categorical Data Analysis at University of Nebraska-Lincoln by Chris Bilder
STAT 950: Computational Statistics at University of Nebraska-Lincoln by Chris Bilder
Joint Statistical Meetings: Analysis of Categorical Data by Chris Bilder

Cheat sheets

Time series

CRAN Task View – TimeSeries
R xts cheat sheet
Forecasting: Principles and Practice (Hyndman & Athanasopoulos, 2017)
A little book of R for time series (tutorial)
ARIMA forecasting in R (6-part Youtube series)
Introduction to the tsfeatures package
Tutorials: Part 1, Part 2, Part 3, & Part 4 of tidy time series @Business-Science.io with tidyquant
Packages:
- xts – extensible time series
- tsfeatures – methods for extracting various features from time series data
- tidyquant – tidyverse-style financial analysis

Survival analysis

CRAN Task View – Survival
R survival analysis cheat sheet by Przemysław Biecek
Packages:
- survival – functionality for survival and hazard models
- ggsurvplot (survminer) – survival curves

Bayesian

Miscellaneous

corrr – easier correlation matrix management and exploration

BACK TO TABLE OF CONTENTS

Natural Language Processing & Text Mining

Text Mining Tutorial with tm
Tidy Text Mining (Silges & Robinson, 2017) with tidytext
Text Analysis with R for Students of Literature (Jockers, 2014)
Tidytext tutorials by computational journalism
21 Recipes for Mining Twitter Data (Rudis, 2017) with rtweet
Emil Hvitfeldt’s R-text-data GitHub repository
Course: Introduction to Text Analytics with R @DataScienceDojo
Course: Twitter Text Mining and Social Network Analysis (Zhoa, 2016) @RDataMining with twitteR
Quantitative Analysis of Textual Data with quanteda cheat sheet by Stefan Müller and Kenneth Benoit
List of resources for NLP & Text Mining by Stephen Thomas
Packages — for an overview: CRAN Task View – Natural Language Processing:
- tm – text mining.
- tidytext – text mining using tidyverse principles
- quanteda – framework for quantitative text analysis
- gutenbergr – public domain works (free books to practice on)
- corpora – statistics and data sets for corpus frequency data.
- tau – Text Analysis Utilities
- Sentiment140 – headache-free sentiment analysis
- sentimentr – sentiment analysis using text polarity
- openNLP – sentence detector, tokenizer, pos-tagger, shallow and full syntactic parser, named-entity detector, and maximum entropy models with OpenNLP.
- cleanNLP – natural language processing via tidy data models
- RSentiment – English lexicon-based sentiment analysis with negation and sarcasm detection functionalities.
- RWeka – data mining tasks with Weka
- wordnet – a large lexical database of English with WordNet .
- stringi – language processing wrappers
- textcat – provides support for n-gram based text categorization.
- text2vec – text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), and similarities.
- lsa – Latent Semantic Analysis
- topicmodels -Latent Dirichlet Allocation (LDA) and Correlated Topics Models (CTM)
- lda -Latent Dirichlet Allocation and related models

Regular Expressions

R Regular Expression cheat sheet by Lise Vaudor
R Regular Expression cheat sheet
R Regular Expression cheat sheet (page 2) by RStudio
regexplain – interactive RStudio addin for regular expressions
Regular Expressions in R – Part 1: Introduction and base R functions
R Regular Expressions by Jon M. Calder in swirl()
R Regular Expression Video Tutorial by Roger Peng
General Regular Expression cheat sheet
General Regular Expression Video Tutorial by Roger Peng
General Regular Expression cheat sheet by OverAPI.com

BACK TO TABLE OF CONTENTS

Geographic & Spatial mapping

Making Maps with R (tutorial) with ggmaps, maps, and mapdata
Importing OpenStreetMap data (tutorial) with osmar
Geocomputation with R (Lovelace, Nowosad, & Muenchow, 2018)
Spatial manipulation with Simple Features (sf) cheat sheet by Ryan Garnett

Bioinformatics & Computational Biology

BACK TO TABLE OF CONTENTS

Integrated Development Environments (IDEs) &
Graphical User Inferfaces (GUIs)

Descriptions mostly taken from their own websites:

RStudio*** – Open source and enterprise ready professional software
Jupyter Notebook*** – open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text across dozens of programming languages.
Microsoft R tools for Visual Studio – turn Visual Studio into a powerful R IDE
R Plugins for Vim, Emax, and Atom editors
Rattle*** – GUI for data mining
equisse – RStudio add-in to interactively explore and visualize data
R Analytic Flow – data flow diagram-based IDE
RKWard – easy to use and easily extensible IDE and GUI
Eclipse StatET – Eclipse-based IDE
OpenAnalytics Architect – Eclipse-based IDE
TinnR – open source GUI and IDE
DisplayR – cloud-based GUI
BlueSkyStatistics – GUI designed to look like SPSS and SAS
ducer – GUI for everyone
R commander (Rcmdr) – easy and intuitive GUI
JGR – Java-based GUI for R
jamovi & jmv – free and open statistical software to bridge the gap between researcher and statistician
Exploratory.io – cloud-based data science focused GUI
Stagraph – GUI for ggplot2 that allows you to visualize and connect to databases and/or basic file types
ggraptr – GUI for visualization (Rapid And Pretty Things in R)
ML Studio – interactive Shiny platform for data visualization, statistical modeling and machine learning

R & other software and languages

R & Excel

BERT – Basic Excel R Toolkit
A Comprehensive Guide to Transitioning from Excel to R by Alyssa Columbus
readxl – package to load in Excel data
xlsx – package to read and write Excel data
rvg – produces Vector Graphics which can be modified in Excel
devoutpdf – an PDF graphics device
tidyxl – imports non-tabular (e.g., format) data from Excel files into R
unpivotr – unpivot complex and irregular data layouts in R
unheadr – handle data with embedded subheaders

R & Python

Python for R users
reticulate cheat sheet by RStudio
reticulate – tools for interoperability between Python and R

R & SQL

sqldf – running SQL statements on R data frames

BACK TO TABLE OF CONTENTS

Join 385 other subscribers

R Help, Connect, & Inspiration

RStudio Community
R help mailing list
R seek – search engine for R-related websites
R site search – search engine for help files, manuals, and mailing lists
Nabble – mailing list archive and forum
R User Groups & Conferences
R for Data Science Online Learning Community
Stack Overflow – a FAQ for all your R struggles (programming)
Cross Validated – a FAQ for all your R struggles (statistics)
CRAN Task Views – discover new packages per topic
The R Journal – open access, refereed journal of R
Twitter: #rstats, RStudio, Hadley Wickham, Yihui Xie, Mara Averick, Julia Silge, Jenny Bryan, David Smith, Hilary Parker, R-bloggers
Facebook: R Users Psychology
Youtube: Ben Lambert, Roger Peng
Reddit: rstats, rstudio, statistics, machinelearning, dataisbeautiful

R Blogs

R Conferences, Events, & Meetups

R Jobs

BACK TO TABLE OF CONTENTS

Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R

It has been twenty years since the first Harry Potter novel, the sorcerer’s/philosopher’s stone, was published. To honour the series, I started a text analysis and visualization project, which my other-half wittily dubbed Harry Plotter. In several blogs, I intend to demonstrate how Hadley Wickham’s tidyverse and packages that build on its principles, such as tidytext (free book), have taken programming in R to an all-new level. Moreover, I just enjoy making pretty graphs : )

In this first blog (easier read), we will look at the sentiment throughout the books. In a second blog, we have examined the stereotypes behind the Hogwarts houses.

Setup

First, we need to set up our environment in RStudio. We will be needing several packages for our analyses. Most importantly, Bradley Boehmke was nice enough to gather all Harry Potter books in his harrypotter package on GitHub. We need devtools to install that package the first time, but from then on can load it in normally. Next, we load the tidytext package, which automates and tidies a lot of the text mining functionalities. We also need plyr for a specific function (ldply()). Other tidyverse packages we can load in a single bundle, including ggplot2, dplyr, and tidyr, which I use in almost every of my projects. Finally, we load the wordcloud visualization package which draws on tm.

After loading these packages, I set some additional default options.

# LOAD IN PACKAGES
# library(devtools)
# devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
library(tidytext)
library(plyr)
library(tidyverse)
library(wordcloud)

# OPTIONS
options(stringsAsFactors = F, # do not convert upon loading
        scipen = 999, # do not convert numbers to e-values
        max.print = 200) # stop printing after 200 values

# VIZUALIZATION SETTINGS
theme_set(theme_light()) # set default ggplot theme to light
fs = 12 # default plot font size

Data preparation

With RStudio set, its time to the text of each book from the harrypotter package which we then “pipe” (%>% – another magical function from the tidyverse – specifically magrittr) along to bind all objects into a single dataframe. Here, each row represents a book with the text for each chapter stored in a separate columns. We want tidy data, so we use tidyr’s gather() function to turn each column into grouped rows. With tidytext’s unnest_tokens() function we can separate the tokens (in this case, single words) from these chapters.

# LOAD IN BOOK CHAPTERS
# TRANSFORM TO TOKENIZED DATASET
hp_words <- list(
 philosophers_stone = philosophers_stone,
 chamber_of_secrets = chamber_of_secrets,
 prisoner_of_azkaban = prisoner_of_azkaban,
 goblet_of_fire = goblet_of_fire,
 order_of_the_phoenix = order_of_the_phoenix,
 half_blood_prince = half_blood_prince,
 deathly_hallows = deathly_hallows
) %>%
 ldply(rbind) %>% # bind all chapter text to dataframe columns
 mutate(book = factor(seq_along(.id), labels = .id)) %>% # identify associated book
 select(-.id) %>% # remove ID column
 gather(key = 'chapter', value = 'text', -book) %>% # gather chapter columns to rows
 filter(!is.na(text)) %>% # delete the rows/chapters without text
 mutate(chapter = as.integer(chapter)) %>% # chapter id to numeric
 unnest_tokens(word, text, token = 'words') # tokenize data frame

Let’s inspect our current data format with head(), which prints the first rows (default n = 6).

# EXAMINE FIRST AND LAST WORDS OF SAGA
hp_words %>% head()

##                   book chapter  word
## 1   philosophers_stone       1   the
## 1.1 philosophers_stone       1   boy
## 1.2 philosophers_stone       1   who
## 1.3 philosophers_stone       1 lived
## 1.4 philosophers_stone       1    mr
## 1.5 philosophers_stone       1   and

Word frequency

A next step would be to examine word frequencies.

# PLOT WORD FREQUENCY PER BOOK
hp_words %>%
  group_by(book, word) %>%
  anti_join(stop_words, by = "word") %>% # delete stopwords
  count() %>% # summarize count per word per book
  arrange(desc(n)) %>% # highest freq on top
  group_by(book) %>% # 
  mutate(top = seq_along(word)) %>% # identify rank within group
  filter(top <= 15) %>% # retain top 15 frequent words
  # create barplot
  ggplot(aes(x = -top, fill = book)) + 
  geom_bar(aes(y = n), stat = 'identity', col = 'black') +
  # make sure words are printed either in or next to bar
  geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
                label = word), size = fs/3, hjust = "left") +
  theme(legend.position = 'none', # get rid of legend
        text = element_text(size = fs), # determine fontsize
        axis.text.x = element_text(angle = 45, hjust = 1, size = fs/1.5), # rotate x text
        axis.ticks.y = element_blank(), # remove y ticks
        axis.text.y = element_blank()) + # remove y text
  labs(y = "Word count", x = "", # add labels
       title = "Harry Plotter: Most frequent words throughout the saga") +
  facet_grid(. ~ book) + # separate plot for each book
  coord_flip() # flip axes

Unsuprisingly, Harry is the most common word in every single book and Ron and Hermione are also present. Dumbledore’s role as an (irresponsible) mentor becomes greater as the storyline progresses. The plot also nicely depicts other key characters:

Lockhart and Dobby in book 2,
Lupin in book 3,
Moody and Crouch in book 4,
Umbridge in book 5,
Ginny in book 6,
and the final confrontation with He who must not be named in book 7.

Finally, why does J.K. seem obsessively writing about eyes that look at doors?

Estimating sentiment

Next, we turn to the sentiment of the text. tidytext includes three famous sentiment dictionaries:

AFINN: including bipolar sentiment scores ranging from -5 to 5
bing: including bipolar sentiment scores
nrc: including sentiment scores for many different emotions (e.g., anger, joy, and surprise)

The following script identifies all words that occur both in the books and the dictionaries and combines them into a long dataframe:

# EXTRACT SENTIMENT WITH THREE DICTIONARIES
hp_senti <- bind_rows(
  # 1 AFINN 
  hp_words %>% 
    inner_join(get_sentiments("afinn"), by = "word") %>%
    filter(score != 0) %>% # delete neutral words
    mutate(sentiment = ifelse(score < 0, 'negative', 'positive')) %>% # identify sentiment
    mutate(score = sqrt(score ^ 2)) %>% # all scores to positive
    group_by(book, chapter, sentiment) %>% 
    mutate(dictionary = 'afinn'), # create dictionary identifier
  # 2 BING 
  hp_words %>% 
    inner_join(get_sentiments("bing"), by = "word") %>%
    group_by(book, chapter, sentiment) %>%
    mutate(dictionary = 'bing'), # create dictionary identifier
  # 3 NRC 
  hp_words %>% 
    inner_join(get_sentiments("nrc"), by = "word") %>%
    group_by(book, chapter, sentiment) %>%
    mutate(dictionary = 'nrc') # create dictionary identifier
)

# EXAMINE FIRST SENTIMENT WORDS
hp_senti %>% head()

## # A tibble: 6 x 6
## # Groups:   book, chapter, sentiment [2]
##                 book chapter      word score sentiment dictionary
##                                   
## 1 philosophers_stone       1     proud     2  positive      afinn
## 2 philosophers_stone       1 perfectly     3  positive      afinn
## 3 philosophers_stone       1     thank     2  positive      afinn
## 4 philosophers_stone       1   strange     1  negative      afinn
## 5 philosophers_stone       1  nonsense     2  negative      afinn
## 6 philosophers_stone       1       big     1  positive      afinn

Wordcloud

Although wordclouds are not my favorite visualizations, they do allow for a quick display of frequencies among a large body of words.

hp_senti %>%
  group_by(word) %>%
  count() %>% # summarize count per word
  mutate(log_n = sqrt(n)) %>% # take root to decrease outlier impact
  with(wordcloud(word, log_n, max.words = 100))

It appears we need to correct for some words that occur in the sentiment dictionaries but have a different meaning in J.K. Rowling’s books. Most importantly, we need to filter two character names.

# DELETE SENTIMENT FOR CHARACTER NAMES
hp_senti_sel <- hp_senti %>% filter(!word %in% c("harry","moody"))

Words per sentiment

Let’s quickly sketch the remaining words per sentiment.

# VIZUALIZE MOST FREQUENT WORDS PER SENTIMENT
hp_senti_sel %>% # NAMES EXCLUDED
  group_by(word, sentiment) %>%
  count() %>% # summarize count per word per sentiment
  group_by(sentiment) %>%
  arrange(sentiment, desc(n)) %>% # most frequent on top
  mutate(top = seq_along(word)) %>% # identify rank within group
  filter(top <= 15) %>% # keep top 15 frequent words
  ggplot(aes(x = -top, fill = factor(sentiment))) + 
  # create barplot
  geom_bar(aes(y = n), stat = 'identity', col = 'black') +
  # make sure words are printed either in or next to bar
  geom_text(aes(y = ifelse(n > max(n) / 2, max(n) / 50, n + max(n) / 50),
                label = word), size = fs/3, hjust = "left") +
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs), # determine fontsize
        axis.text.x = element_text(angle = 45, hjust = 1), # rotate x text
        axis.ticks.y = element_blank(), # remove y ticks
        axis.text.y = element_blank()) + # remove y text
  labs(y = "Word count", x = "", # add manual labels
       title = "Harry Plotter: Words carrying sentiment as counted throughout the saga",
       subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
  facet_grid(. ~ sentiment) + # separate plot for each sentiment
  coord_flip() # flip axes

This seems ok. Let’s continue to plot the sentiment over time.

Positive and negative sentiment throughout the series

As positive and negative sentiment is included in each of the three dictionaries we can to compare and contrast scores.

# VIZUALIZE POSTIVE/NEGATIVE SENTIMENT OVER TIME
plot_sentiment <- hp_senti_sel %>% # NAMES EXCLUDED
  group_by(dictionary, sentiment, book, chapter) %>%
  summarize(score = sum(score), # summarize AFINN scores
            count = n(), # summarize bing and nrc counts
            # move bing and nrc counts to score 
            score = ifelse(is.na(score), count, score))  %>%
  filter(sentiment %in% c('positive','negative')) %>%   # only retain bipolar sentiment
  mutate(score = ifelse(sentiment == 'negative', -score, score)) %>% # reverse negative values
  # create area plot
  ggplot(aes(x = chapter, y = score)) +    
  geom_area(aes(fill = score > 0),stat = 'identity') +
  scale_fill_manual(values = c('red','green')) + # change colors
  # add black smoothed line without standard error
  geom_smooth(method = "loess", se = F, col = "black") + 
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs)) + # change font size
  labs(x = "Chapter", y = "Sentiment score", # add labels
       title = "Harry Plotter: Sentiment during the saga",
       subtitle = "Using tidytext and the AFINN, bing, and nrc sentiment dictionaries") +
     # separate plot per book and dictionary and free up x-axes
  facet_grid(dictionary ~ book, scale = "free_x")
plot_sentiment

Let’s zoom in on the smoothed average.

plot_sentiment + coord_cartesian(ylim = c(-100,50)) # zoom in plot

Sentiment seems overly negative throughout the series. Particularly salient is that every book ends on a down note, except the Prisoner of Azkaban. Moreover, sentiment becomes more volatile in books four through six. These start out negative, brighten up in the middle, just to end in misery again. In her final book, J.K. Rowling depicts a world about to be conquered by the Dark Lord and the average negative sentiment clearly resembles this grim outlook.

The bing sentiment dictionary estimates the most negative sentiment on average, but that might be due to this specific text.

Other emotions throughout the series

Finally, let’s look at the other emotions that are included in the nrc dictionary.

# VIZUALIZE EMOTIONAL SENTIMENT OVER TIME
hp_senti_sel %>% # NAMES EXCLUDED 
  filter(!sentiment %in% c('negative','positive')) %>% # only retain other sentiments (nrc)
  group_by(sentiment, book, chapter) %>%
  count() %>% # summarize count
  # create area plot
  ggplot(aes(x = chapter, y = n)) +
  geom_area(aes(fill = sentiment), stat = 'identity') + 
  # add black smoothing line without standard error
  geom_smooth(aes(fill = sentiment), method = "loess", se = F, col = 'black') + 
  theme(legend.position = 'none', # remove legend
        text = element_text(size = fs)) + # change font size
  labs(x = "Chapter", y = "Emotion score", # add labels
       title = "Harry Plotter: Emotions during the saga",
       subtitle = "Using tidytext and the nrc sentiment dictionary") +
  # separate plots per sentiment and book and free up x-axes
  facet_grid(sentiment ~ book, scale = "free_x")

This plot is less insightful as either the eight emotions are represented by similar words or J.K. Rowling combines all in her writing simultaneously. Patterns across emotions are highly similar, evidenced especially by the patterns in the Chamber of Secrets. In a next post, I will examine sentiment in a more detailed fashion, testing the differences over time and between characters statistically. For now, I hope you enjoyed these visualizations. Feel free to come back or subscribe to read my subsequent analyses.

The second blog in the Harry Plotter series examines the stereotypes behind the Hogwarts houses.

Text Mining: Shirin’s Twitter Feed

Text mining and analytics, natural language processing, and topic modelling have definitely become sort of an obsession of mine. I am just amazed by the insights one can retrieve from textual information, and with the ever increasing amounts of unstructured data on the internet, recreational analysts are coming up with the most amazing text mining projects these days.

Only last week, I came across posts talking about how the text in the Game of Thrones books to demonstrate a gender bias, how someone created an entire book with weirdly-satisfying computer-generated poems, and how to conduct a rather impressive analysis of your Twitter following. The latter, I copied below, with all props obviously for Shirin – the author.

In Game of Thrones, women cry, scream, and struggle. Men move, lead, and take. A tidy text analysis of book 1 coming soon. #GoT pic.twitter.com/gMKn1DhkFX

— Julian Winternheimer (@JulHeimer) 15 juli 2017

I just published “Writing a poetry generator” https://t.co/itpKTw2Rkr

— Peter Organisciak (@POrg) 17 juli 2017

For those of you who want to learn more about text mining and, specifically, how to start mining in R with tidytext, an new text-mining complement to the tidyverse, I can strongly recommend the new book by Julia Silge and David Robinson. This book has helped me greatly in learning the basics and you can definitely expect some blogs on my personal text mining projects soon.

===== COPIED FROM SHIRIN’S PLAYGROUND =====

Lately, I have been more and more taken with tidy principles of data analysis. They are elegant and make analyses clearer and easier to comprehend. Following the tidyverse and ggraph, I have been quite intrigued by applying tidy principles to text analysis with Julia Silge and David Robinson’s tidytext.

In this post, I will explore tidytext with an analysis of my Twitter followers’ descriptions to try and learn more about the people who are interested in my tweets, which are mainly about Data Science and Machine Learning.

Resources I found useful for this analysis were http://www.rdatamining.com/docs/twitter-analysis-with-r and http://tidytextmining.com/tidytext.html

Retrieving Twitter data

I am using twitteR to retrieve data from Twitter (I have also tried rtweet but for some reason, my API key, secret and token (that worked with twitteR) resulted in a “failed to authorize” error with rtweet’s functions).

library(twitteR)

Once we have set up our Twitter REST API, we get the necessary information to authenticate our access.

consumerKey = "INSERT KEY HERE"
consumerSecret = "INSERT SECRET KEY HERE"
accessToken = "INSERT TOKEN HERE"
accessSecret = "INSERT SECRET TOKEN HERE"

options(httr_oauth_cache = TRUE)

setup_twitter_oauth(consumer_key = consumerKey, 
                    consumer_secret = consumerSecret, 
                    access_token = accessToken, 
                    access_secret = accessSecret)

Now, we can access information from Twitter, like timeline tweets, user timelines, mentions, tweets & retweets, followers, etc.

All the following datasets were retrieved on June 7th 2017, converted to a data frame for tidy analysis and saved for later use:

the last 3200 tweets on my timeline

my_name <- userTimeline("ShirinGlander", n = 3200, includeRts=TRUE)
my_name_df <- twListToDF(my_name)
save(my_name_df, file = "my_name.RData")

my last 3200 mentions and retweets

my_mentions <- mentions(n = 3200)
my_mentions_df <- twListToDF(my_mentions)
save(my_mentions_df, file = "my_mentions.RData")

my_retweets <- retweetsOfMe(n = 3200)
my_retweets_df <- twListToDF(my_retweets)
save(my_retweets_df, file = "my_retweets.RData")

the last 3200 tweets to me

tweetstome <- searchTwitter("@ShirinGlander", n = 3200)
tweetstome_df <- twListToDF(tweetstome)
save(tweetstome_df, file = "tweetstome.RData")

my friends and followers

user <- getUser("ShirinGlander")

friends <- user$getFriends() # who I follow
friends_df <- twListToDF(friends)
save(friends_df, file = "my_friends.RData")

followers <- user$getFollowers() # my followers
followers_df <- twListToDF(followers)
save(followers_df, file = "my_followers.RData")

Analyzing friends and followers

In this post, I will have a look at my friends and followers.

load("my_friends.RData")
load("my_followers.RData")

I am going to use packages from the tidyverse (tidyquant for plotting).

library(tidyverse)
library(tidyquant)

Number of friends (who I follow on Twitter): 225
Number of followers (who follows me on Twitter): 324
Number of friends who are also followers: 97

What languages do my followers speak?

One of the columns describing my followers is which language they have set for their Twitter account. Not surprisingly, English is by far the most predominant language of my followers, followed by German, Spanish and French.

followers_df %>%
  count(lang) %>%
  droplevels() %>%
  ggplot(aes(x = reorder(lang, desc(n)), y = n)) +
    geom_bar(stat = "identity", color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    theme_tq() +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "language ISO 639-1 code",
         y = "number of followers")

Who are my most “influential” followers (i.e. followers with the biggest network)?

I also have information about the number of followers that each of my followers have (2nd degree followers). Most of my followers are followed by up to ~ 1000 people, while only a few have a very large network.

followers_df %>%
  ggplot(aes(x = log2(followersCount))) +
    geom_density(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    theme_tq() +
    labs(x = "log2 of number of followers",
         y = "density")

How active are my followers (i.e. how often do they tweet)

The followers data frame also tells me how many statuses (i.e. tweets) each of followers have. To make the numbers comparable, I am normalizing them by the number of days that they have had their accounts to calculate the average number of tweets per day.

followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  ggplot(aes(x = log2(statusesCount_pDay))) +
    geom_density(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    theme_tq()

Who are my followers with the biggest network and who tweet the most?

followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  select(screenName, followersCount, statusesCount_pDay) %>%
  arrange(desc(followersCount)) %>%
  top_n(10)

##         screenName followersCount statusesCount_pDay
## 1        dr_morton         150937           71.35193
## 2    Scientists4EU          66117           17.64389
## 3       dr_morton_          63467           46.57763
## 4   NewScienceWrld          60092           54.65874
## 5     RubenRabines          42286           25.99592
## 6  machinelearnbot          27427          204.67061
## 7  BecomingDataSci          16807           25.24069
## 8       joelgombin           6566           21.24094
## 9    renato_umeton           1998           19.58387
## 10 FranPatogenLoco            311           28.92593

followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  select(screenName, followersCount, statusesCount_pDay) %>%
  arrange(desc(statusesCount_pDay)) %>%
  top_n(10)

##         screenName followersCount statusesCount_pDay
## 1  machinelearnbot          27427          204.67061
## 2        dr_morton         150937           71.35193
## 3   NewScienceWrld          60092           54.65874
## 4       dr_morton_          63467           46.57763
## 5  FranPatogenLoco            311           28.92593
## 6     RubenRabines          42286           25.99592
## 7  BecomingDataSci          16807           25.24069
## 8       joelgombin           6566           21.24094
## 9    renato_umeton           1998           19.58387
## 10   Scientists4EU          66117           17.64389

Is there a correlation between number of followers and number of tweets?

Indeed, there seems to be a correlation that users with many followers also tend to tweet more often.

followers_df %>%
  mutate(date = as.Date(created, format = "%Y-%m-%d"),
         today = as.Date("2017-06-07", format = "%Y-%m-%d"),
         days = as.numeric(today - date),
         statusesCount_pDay = statusesCount / days) %>%
  ggplot(aes(x = followersCount, y = statusesCount_pDay, color = days)) +
    geom_smooth(method = "lm") +
    geom_point() +
    scale_color_continuous(low = palette_light()[1], high = palette_light()[2]) +
    theme_tq()

Tidy text analysis

Next, I want to know more about my followers by analyzing their Twitter descriptions with the tidytext package.

library(tidytext)
library(SnowballC)

To prepare the data, I am going to unnest the words (or tokens) in the user descriptions, convert them to the word stem, remove stop words and urls.

data(stop_words)

tidy_descr <- followers_df %>%
  unnest_tokens(word, description) %>%
  mutate(word_stem = wordStem(word)) %>%
  anti_join(stop_words, by = "word") %>%
  filter(!grepl("\\.|http", word))

What are the most commonly used words in my followers’ descriptions?

The first question I want to ask is what words are most common in my followers’ descriptions.

Not surprisingly, the most common word is “data”. I do tweet mostly about data related topics, so it makes sense that my followers are mostly likeminded. The rest is also related to data science, machine learning and R.

tidy_descr %>%
  count(word_stem, sort = TRUE) %>%
  filter(n > 20) %>%
  ggplot(aes(x = reorder(word_stem, n), y = n)) +
    geom_col(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    coord_flip() +
    theme_tq() +
    labs(x = "",
         y = "count of word stem in all followers' descriptions")

This, we can also show with a word cloud.

library(wordcloud)
library(tm)

tidy_descr %>%
  count(word_stem) %>%
  mutate(word_stem = removeNumbers(word_stem)) %>%
  with(wordcloud(word_stem, n, max.words = 100, colors = palette_light()))

Instead of looking for the most common words, we can also look for the most common ngrams: here, for the most common word pairs (bigrams) in my followers’ descriptions.

tidy_descr_ngrams <- followers_df %>%
  unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
  filter(!grepl("\\.|http", bigram)) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

bigram_counts <- tidy_descr_ngrams %>%
  count(word1, word2, sort = TRUE)

bigram_counts %>%
  filter(n > 10) %>%
  ggplot(aes(x = reorder(word1, -n), y = reorder(word2, -n), fill = n)) +
    geom_tile(alpha = 0.8, color = "white") +
    scale_fill_gradientn(colours = c(palette_light()[[1]], palette_light()[[2]])) +
    coord_flip() +
    theme_tq() +
    theme(legend.position = "right") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(x = "first word in pair",
         y = "second word in pair")

These, we can also show as a graph:

library(igraph)
library(ggraph)

bigram_graph <- bigram_counts %>%
  filter(n > 5) %>%
  graph_from_data_frame()

set.seed(1)

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color =  palette_light()[1], size = 5, alpha = 0.8) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 0.5) +
  theme_void()

We can also use bigram analysis to identify negated meanings (this will become relevant for sentiment analysis later). So, let’s look at which words are preceded by “not” or “no”.

bigrams_separated <- followers_df %>%
  unnest_tokens(bigram, description, token = "ngrams", n = 2) %>%
  filter(!grepl("\\.|http", bigram)) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(word1 == "not" | word1 == "no") %>%
  filter(!word2 %in% stop_words$word)

not_words <- bigrams_separated %>%
  filter(word1 == "not") %>%
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>%
  count(word2, score, sort = TRUE) %>%
  ungroup()

not_words %>%
  mutate(contribution = n * score) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * score, fill = n * score > 0)) +
    geom_col(show.legend = FALSE) +
    scale_fill_manual(values = palette_light()) +
    labs(x = "",
         y = "Sentiment score * number of occurrences",
         title = "Words preceded by \"not\"") +
    coord_flip() +
    theme_tq()

What’s the predominant sentiment in my followers’ descriptions?

For sentiment analysis, I will exclude the words with a negated meaning from nrc and switch their positive and negative meanings from bing (although in this case, there was only one negated word, “endorsement”, so it won’t make a real difference).

tidy_descr_sentiment <- tidy_descr %>%
  left_join(select(bigrams_separated, word1, word2), by = c("word" = "word2")) %>%
  inner_join(get_sentiments("nrc"), by = "word") %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  rename(nrc = sentiment.x, bing = sentiment.y) %>%
  mutate(nrc = ifelse(!is.na(word1), NA, nrc),
         bing = ifelse(!is.na(word1) & bing == "positive", "negative", 
                       ifelse(!is.na(word1) & bing == "negative", "positive", bing)))

tidy_descr_sentiment %>%
  filter(nrc != "positive") %>%
  filter(nrc != "negative") %>%
  gather(x, y, nrc, bing) %>%
  count(x, y, sort = TRUE) %>%
  filter(n > 10) %>%
  ggplot(aes(x = reorder(y, n), y = n)) +
    facet_wrap(~ x, scales = "free") +
    geom_col(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    coord_flip() +
    theme_tq() +
    labs(x = "",
         y = "count of sentiment in followers' descriptions")

Are followers’ descriptions mostly positive or negative?

The majority of my followers have predominantly positive descriptions.

tidy_descr_sentiment %>%
  count(screenName, word, bing) %>%
  group_by(screenName, bing) %>%
  summarise(sum = sum(n)) %>%
  spread(bing, sum, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  ggplot(aes(x = sentiment)) +
    geom_density(color = palette_light()[1], fill = palette_light()[1], alpha = 0.8) +
    theme_tq()

What are the most common positive and negative words in followers’ descriptions?

library(reshape2)
tidy_descr_sentiment %>%
  count(word, bing, sort = TRUE) %>%
  acast(word ~ bing, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = palette_light()[1:2],
                   max.words = 100)

Topic modeling: are there groups of followers with specific interests?

Topic modeling can be used to categorize words into groups. Here, we can use it to see whether (some) of my followers can be grouped into subgroups according to their descriptions.

library(topicmodels)

dtm_words_count <- tidy_descr %>%
  mutate(word_stem = removeNumbers(word_stem)) %>%
  count(screenName, word_stem, sort = TRUE) %>%
  ungroup() %>%
  filter(word_stem != "") %>%
  cast_dtm(screenName, word_stem, n)

# set a seed so that the output of the model is predictable
dtm_lda <- LDA(dtm_words_count, k = 5, control = list(seed = 1234))

topics_beta <- tidy(dtm_lda, matrix = "beta")

p1 <- topics_beta %>%
  filter(grepl("[a-z]+", term)) %>% # some words are Chinese, etc. I don't want these because ggplot doesn't plot them correctly
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, color = factor(topic), fill = factor(topic))) +
    geom_col(show.legend = FALSE, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    scale_fill_manual(values = palette_light()) +
    facet_wrap(~ topic, ncol = 5) +
    coord_flip() +
    theme_tq() +
    labs(x = "",
         y = "beta (~ occurrence in topics 1-5)",
         title = "The top 10 most characteristic words describe topic categories.")

user_topic <- tidy(dtm_lda, matrix = "gamma") %>%
  arrange(desc(gamma)) %>%
  group_by(document) %>%
  top_n(1, gamma)

p2 <- user_topic %>%
  group_by(topic) %>%
  top_n(10, gamma) %>%
  ggplot(aes(x = reorder(document, -gamma), y = gamma, color = factor(topic))) +
    facet_wrap(~ topic, scales = "free", ncol = 5) +
    geom_point(show.legend = FALSE, size = 4, alpha = 0.8) +
    scale_color_manual(values = palette_light()) +
    scale_fill_manual(values = palette_light()) +
    theme_tq() +
    coord_flip() +
    labs(x = "",
         y = "gamma\n(~ affiliation with topics 1-5)")

library(grid)
library(gridExtra)
grid.arrange(p1, p2, ncol = 1, heights = c(0.7, 0.3))

The upper of the two plots above show the words that were most strongly grouped to five topics. The lower plots show my followers with the strongest affiliation with these five topics.

Because in my tweets I only cover a relatively narrow range of topics (i.e. related to data), my followers are not very diverse in terms of their descriptions and the five topics are not very distinct.

If you find yourself in any of the topics, let me know if you agree with the topic that was modeled for you!

For more text analysis, see my post about text mining and sentiment analysis of a Stuff You Should Know Podcast.

tidyverse 101: Simplifying life for useRs

Hadley Wickham‘s tidyverse has improved the workflow of analysts / data scientists, makes coding errors less likely and code more transparent. You’ve got to love the figure below, representing a simplified workflow of the average analysis project.

A simplified, standard cycle of data analysis

The tidyverse provides assistance in each of the stages. Various packages provide functionality to perform analytical tasks more effectively, in fewer lines, with fewer errors, and moreover in more transparent code. As a first step, the analyst will need to import (load) the data to his/her working environment (e.g., Excel, SPSS, R, RStudio, Spyder, Jupyter). In order to guarantee that the data are correct, a next step will be to clean up and tidy the data before continuing to the analysis part. In this early stage, the analyst can handle the explicit errors in the dataset, such as missing and nonsensical data points or records. After these preparatory steps, the main process starts. This consists of three interrelated tasks. (1) The analyst will need to transform the data in order to retrieve statistics, descriptives, and/or new features. (2) The analyst will need to visualize statistics, relations, and results. This is essential for storytelling and effective interpretation and communication of the results. (3) The analyst will try out different models to fit, explain, and predict the data. Finally, the results of this main process (leading to “understanding” of the data and the underlying processing) can be communicated to others.

I will run through each of these stages in separate posts, explaining the various packages, their inner workings, and demonstrating how they affect the process of data analysis in R:

Importing data (work in progress)
Tidying data (work in progress)
Transforming data (work in progress)
Visualizing data (work in progress)
Modeling data (work in progress)
Efficient programming (work in progress)

tidyverse1 — Overview of the tidyverse packages that belong to each of the stages.

General tutorials:

tidyverse: Example: Trump Approval Rate

For those of you unfamiliar with the tidyverse, it is a collection of R packages that share common philosophies and are designed to work together. Most if not all, are created by R-god Hadley Wickham, one of the leads at RStudio. I was introduced to the tidyverse-packages such as ggplot2 and dplyr in my second R-course, and they have cleaned and sped up my workflow tremendously ever since.

Although I don’t want to mix in the political debate, I came across such a wonderful example of how the tidyverse has simplified coding in R. On the downside, those unfamiliar with the syntax have trouble understanding what happens in the code the author uses.

Running the following R-code will install the core packages of the tidyverse:

install.packages(‘tidyverse’)

These consist among others of the following:

ggplot2: a more potent way of visualization
tibble: an upgrade to the standard data.frame
dplyr: adds great new functionality for manipulating data frames
tidyr: adds even more new functions for wrangling data frames
magrittr: adds piping functionality to improve code readability and workflow
readr: provides easier functions to load in data
purr: adds new functional programming functionality

There are several other packages included (e.g, stringr), but the above are the ones you are most likely to use in everyday projects.

Now, how about dissecting the code in the post. The author (1) loads some functionality in R, (2) scrapes data on approval rates from the web, (3) cleans it up, and creates a wonderful visualization. S/He does this all in only 35 lines of code! Better even, 2 of these code lines are blank, 3 are setup, 6 have aesthetic purposes, and many others could be combined being only several characters long. Due to the tidyverse syntax, the code is easy to read, transparent, and reproducible (it only consists of two chained code blocks, after loading the packages), and takes only 7 seconds to run!

   user  system elapsed 
   5.67    0.85    6.53

In the rest of this article, I walk you through the code of this post to explain what’s happening:

hrbrthemes includes additional ggplot2 themes (plot colors, etc.)
rvest includes functionalities for web scraping
tidyverse we discussed earlier

library(hrbrthemes) 
library(rvest)
library(tidyverse)

Below, the author then creates a list containing the links to the online data to scrape and run it through a magrittr pipe (%>%) to apply the next bit of code to it.

map_df() comes from the purrr package and applies the subsequent code to every element in the earlier list:

Read in the html files specified earlier in the list %>%
Convert them to a table %>%
Store the name of the list (this is the name of the president) as .id %>%
Store that as a data.frame %>%
Select columns (and rename them) %>%
Use the earlier stored president id and add it as a column (‘who’) %>%
Save the output as a dataframe called ratings.

list(
  Obama="http://m.rasmussenreports.com/public_content/politics/obama_administration/obama_approval_index_history",
  Trump="http://m.rasmussenreports.com/public_content/politics/trump_administration/trump_approval_index_history"
) %>% 
map_df(~{
    read_html(.x) %>%
      html_table() %>%
      .[[1]] %>%
      tbl_df() %>%
      select(date=Date, approve=`Total Approve`, disapprove=`Total Disapprove`)
  }, .id="who") -> ratings

Below, the author then starts a new chained code block. S/He first changes (mutate()), from the ratings dataframe, the approval & disapproval data with a custom function (get rid of the % sign and divide by 100), which is then piped through:

Mutate dates to a data format (lubridate is yet another tidyverse package) %>%
Filter out any missing values %>%
Group by the ‘who’-column (President name) %>%
Sort the data file by earlier specified date %>%
Give every line an id number, from 1 up to the number of records (n() returns the sample size per President due to the earlier group_by()) %>%
Ungroup the data %>%

For readability, I split the code here, but it actually still continues as depicted by the %>% at the end.

mutate_at(ratings, c("approve", "disapprove"), function(x) as.numeric(gsub("%", "", x, fixed=TRUE))/100) %>%
  mutate(date = lubridate::dmy(date)) %>%
  filter(!is.na(approve)) %>%
  group_by(who) %>%
  arrange(date) %>%
  mutate(dnum = 1:n()) %>%
  ungroup() %>%

The output is now entered into the ggplot2 visualization function below:

ggplot() creates a layered plot, where the aes(thetics) (parameters) are defined as
- x = the id number,
- y = the approval rate,
- and the color = the President name

Layers and details to this plot are specified/added using +

The first (bottom) layer of the plot is geom_hline() which creates a horizontal line at [x = 0; y = 0.5] with a size = 0.5. +
The 2nd layer is a scatterplot as geom_point() adds points with size = 0.25 on the x & y predefined in ggplot(aes()) +
Next the limits of the Y-axis are set to run from 0 to 1 +
A custom/manual color scheme is set +
Custom titles and labels are applied to the axis +
A predefined theme for the plot is used, drawn from hrbrthemes-package loading in at the start +
The direction of the legend is set +
The position of the legend is set

  ggplot(aes(dnum, approve, color=who)) +
  geom_hline(yintercept = 0.5, size=0.5) +
  geom_point(size=0.25) +
  scale_y_percent(limits=c(0,1)) +
  scale_color_manual(name=NULL, values=c("Obama"="#313695", "Trump"="#a50026")) +
  labs(x="Day in office", y="Approval Rating",
       title="Presidential approval ratings from day 1 in office",
       subtitle="For fairness, data was taken solely from Trump's favorite polling site (Ramussen)",
       caption="Data Source: \nCode: ") +
  theme_ipsum_rc(grid="XY", base_size = 16) +
  theme(legend.direction = "horizontal") +
  theme(legend.position=c(0.8, 1.05))

Theggplot()command at the start automatically prints the plot when it is finished (when no more + is found). The result is just wonderful, isn’t it? With only 35 lines, 2 chained commands, and 7 seconds runtime.

Found on https://www.r-bloggers.com.

Animated GIFs in R

Sometimes, it can be of interest to examine how two variables correlate over time. For example, how people in a social network (e.g., an organization) behave or move over the course of time. However, it can be hard to display multi-dimensional data in a single plot. Instead of including time as an additional dimension and providing stakeholders with complicated 3-D plots, ggplot2 now has a support package called gganimate, which allows you to create custom GIFs. Particularly helpful when you seek to demonstrate trends over time.

See this recent post by Analytics Vidhya for a tutorial on the implementation.

Table of Contents (clickable)

Introductory R

Introductory Books

Online Courses

Style Guides

Advanced R

Package Development

Non-standard Evaluation

Functional Programming

Cheat Sheets

Data Manipulation

Data Visualization

Colors

Interactive / HTML / JavaScript widgets

ggplot2

ggplot2 extensions

Miscellaneous

Shiny, Dashboards, & Apps

Markdown & Other Output Formats

Cloud, Server, & Database

Statistical Modeling & Machine Learning

Books

Courses

Cheat sheets

Time series

Survival analysis

Bayesian

Miscellaneous

Natural Language Processing & Text Mining

Regular Expressions

Geographic & Spatial mapping

Bioinformatics & Computational Biology

Integrated Development Environments (IDEs) & Graphical User Inferfaces (GUIs)

R & other software and languages

R & Excel

R & Python

R & SQL

R Help, Connect, & Inspiration

R Blogs

R Conferences, Events, & Meetups

R Jobs

Share this:

Setup

Data preparation

Word frequency

Estimating sentiment

Wordcloud

Words per sentiment

Positive and negative sentiment throughout the series

Other emotions throughout the series

Share this:

===== COPIED FROM SHIRIN’S PLAYGROUND =====

Retrieving Twitter data

Analyzing friends and followers

What languages do my followers speak?

Who are my most “influential” followers (i.e. followers with the biggest network)?

How active are my followers (i.e. how often do they tweet)

Who are my followers with the biggest network and who tweet the most?

Is there a correlation between number of followers and number of tweets?

Tidy text analysis

What are the most commonly used words in my followers’ descriptions?

What’s the predominant sentiment in my followers’ descriptions?

Are followers’ descriptions mostly positive or negative?

What are the most common positive and negative words in followers’ descriptions?

Topic modeling: are there groups of followers with specific interests?

Share this:

General tutorials:

Share this:

Share this:

Share this:

Integrated Development Environments (IDEs) &
Graphical User Inferfaces (GUIs)