Tag: regularexpression

Curated Regular Expression Resources

Regular expression (also abbreviated to regex) really is a powertool any programmer should know. It was and is one of the things I most liked learning, as it provides you with immediate, godlike powers that can speed up your (data science) workflow tenfold.

I’ve covered many regex related topics on this blog already, but thought I’d combine them and others in a nice curated overview — for myself, and for you of course, to use.

If you have any materials you liked, but are missing, please let me know!

Introduction & Learning
Language-specific Resources
Debugging & Testing
Fun

Introduction & Learning

Reading

Tutorials (interactive)

Video

Corey Shafer

The Coding Train

Language-specific

Python

Corey Shafer

R

Roger Peng

Testing & Debugging

debuggex.com

regex101.com

regextester.com | regexpal.com

regexr.com

ExtendsClass.com/regex-tester

rubular.com

pythex.com

Fun

Regular expression crosswords

Debuggex: A regular expression testing tool

I came across this awesome regular expression tool I wanted to share. Debuggex allows you to interactively write, test and visually inspect what your regular expressions match in either Python, JavaScript, or Perl.

Read more about regular expressions here, for instance their implementation in R.

Improved Twitter Mining in R

R users have been using the twitter package by Geoff Jentry to mine tweets for several years now. However, a recent blog suggests a novel package provides a better mining tool: rtweet by Michael Kearney (GitHub).

Both packages use a similar setup and require you to do some prep-work by creating a Twitter “app” (see the package instructions). However, rtweet will save you considerable API-time and post-API munging time. This is demonstrated by the examples below, where Twitter is searched for #rstats-tagged tweets, first using twitteR, then using rtweet.

library(twitteR)

# this relies on you setting up an app in apps.twitter.com
setup_twitter_oauth(
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
)

r_folks <- searchTwitter("#rstats", n=300)

str(r_folks, 1)
## List of 300
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant

str(r_folks[1])
## List of 1
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..$ text         : chr "RT @historying: Wow. This is an enormously helpful tutorial by @vivalosburros for anyone interested in mapping "| __truncated__
##   ..$ favorited    : logi FALSE
##   ..$ favoriteCount: num 0
##   ..$ replyToSN    : chr(0) 
##   ..$ created      : POSIXct[1:1], format: "2017-10-22 17:18:31"
##   ..$ truncated    : logi FALSE
##   ..$ replyToSID   : chr(0) 
##   ..$ id           : chr "922150185916157952"
##   ..$ replyToUID   : chr(0) 
##   ..$ statusSource : chr "Twitter for Android"
##   ..$ screenName   : chr "jasonrhody"
##   ..$ retweetCount : num 3
##   ..$ isRetweet    : logi TRUE
##   ..$ retweeted    : logi FALSE
##   ..$ longitude    : chr(0) 
##   ..$ latitude     : chr(0) 
##   ..$ urls         :'data.frame': 0 obs. of  4 variables:
##   .. ..$ url         : chr(0) 
##   .. ..$ expanded_url: chr(0) 
##   .. ..$ dispaly_url : chr(0) 
##   .. ..$ indices     : num(0) 
##   ..and 53 methods, of which 39 are  possibly relevant:
##   ..  getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet, getLatitude, getLongitude, getReplyToSID,
##   ..  getReplyToSN, getReplyToUID, getRetweetCount, getRetweeted, getRetweeters, getRetweets, getScreenName,
##   ..  getStatusSource, getText, getTruncated, getUrls, initialize, setCreated, setFavoriteCount, setFavorited, setId,
##   ..  setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN, setReplyToUID, setRetweetCount,
##   ..  setRetweeted, setScreenName, setStatusSource, setText, setTruncated, setUrls, toDataFrame, toDataFrame#twitterObj

The above operations required only several seconds to completely. The returned data is definitely usable, but not in the most handy format: the package models the Twitter API on to custom R objects. It’s elegant, but also likely overkill for most operations. Here’s the rtweet version:

library(rtweet)

# this relies on you setting up an app in apps.twitter.com
create_token(
  app = Sys.getenv("TWITTER_APP"),
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
) -> twitter_token

saveRDS(twitter_token, "~/.rtweet-oauth.rds")

# ideally put this in ~/.Renviron
Sys.setenv(TWITTER_PAT=path.expand("~/.rtweet-oauth.rds"))

rtweet_folks <- search_tweets("#rstats", n=300)

dplyr::glimpse(rtweet_folks)
## Observations: 300
## Variables: 35
## $ screen_name                     "AndySugs", "jsbreker", "__rahulgupta__", "AndySugs", "jasonrhody", "sibanjan...
## $ user_id                         "230403822", "703927710", "752359265394909184", "230403822", "14184263", "863...
## $ created_at                      2017-10-22 17:23:13, 2017-10-22 17:19:48, 2017-10-22 17:19:39, 2017-10-22 17...
## $ status_id                       "922151366767906819", "922150507745079297", "922150470382125057", "9221504090...
## $ text                            "RT:  (Rbloggers)Markets Performance after Election: Day 239  https://t.co/D1...
## $ retweet_count                   0, 0, 9, 0, 3, 1, 1, 57, 57, 103, 10, 10, 0, 0, 0, 34, 0, 0, 642, 34, 1, 1, 1...
## $ favorite_count                  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ is_quote_status                 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ quote_status_id                 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ is_retweet                      FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, F...
## $ retweet_status_id               NA, NA, "922085241493360642", NA, "921782329936408576", "922149318550843393",...
## $ in_reply_to_status_status_id    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_user_id      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_screen_name  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ lang                            "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "ro",...
## $ source                          "IFTTT", "Twitter for iPhone", "GaggleAMP", "IFTTT", "Twitter for Android", "...
## $ media_id                        NA, "922150500237062144", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "92...
## $ media_url                       NA, "http://pbs.twimg.com/media/DMwi_oQUMAAdx5A.jpg", NA, NA, NA, NA, NA, NA,...
## $ media_url_expanded              NA, "https://twitter.com/jsbreker/status/922150507745079297/photo/1", NA, NA,...
## $ urls                            NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ urls_display                    "ift.tt/2xe1xrR", NA, NA, "ift.tt/2xe1xrR", NA, "bit.ly/2yAAL0M", "bit.ly/2yA...
## $ urls_expanded                   "http://ift.tt/2xe1xrR", NA, NA, "http://ift.tt/2xe1xrR", NA, "http://bit.ly/...
## $ mentions_screen_name            NA, NA, "DataRobot", NA, "historying vivalosburros", "NoorDinTech ikashnitsky...
## $ mentions_user_id                NA, NA, "622519917", NA, "18521423 304837258", "2511247075 739773414316118017...
## $ symbols                         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ hashtags                        "rstats DataScience", "Rstats ACSmtg", "rstats", "rstats DataScience", "rstat...
## $ coordinates                     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_id                        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_type                      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_name                      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_full_name                 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country_code                    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country                         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_coordinates        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_type               NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

This operation took equal to less time but provides the data in a tidy, immediately usable structure.

On the rtweet website, you can read about the additional functionalities this new package provides. For instance,ts_plot() provides a quick visual of the frequency of tweets. It’s possible to aggregate by the minute, i.e., by = "mins", or by some value of seconds, e.g.,by = "15 secs".

## Plot time series of all tweets aggregated by second
ts_plot(rt, by = "secs")

ts_filter() creates a time series-like data structure, which consists of “time” (specific interval of time determined via the by argument), “freq” (the number of observations, or tweets, that fall within the corresponding interval of time), and “filter” (a label representing the filtering rule used to subset the data). If no filter is provided, the returned data object includes a “filter” variable, but all of the entries will be blank "", indicating that no filter filter was used. Otherwise, ts_filter() uses the regular expressions supplied to the filter argument as values for the filter variable. To make the filter labels pretty, users may also provide a character vector using the key parameter.

## plot multiple time series by first filtering the data using
## regular expressions on the tweet "text" variable
rt %>%
  dplyr::group_by(screen_name) %>%
  ## The pipe operator allows you to combine this with ts_plot
  ## without things getting too messy.
  ts_plot() + 
  ggplot2::labs(
    title = "Tweets during election day for the 2016 U.S. election",
    subtitle = "Tweets collected, parsed, and plotted using `rtweet`"
  )

The developer cautions that these plots often resemble frowny faces: the first and last points appear significantly lower than the rest. This is caused by the first and last intervals of time to be artificially shrunken by connection and disconnection processes. To remedy this, users may specify trim = TRUE to drop the first and last observation for each time series.

Give rtweet a try and let me know whether you prefer it over twitter.

Regular Expression Crosswords

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$.

Last week I posted a first tutorial on Regular Expressions in R and I am working its sequels. You may find additional resources on Regular Expressions in the learning overviews (R, Python, Data Science).

Today I came across this website of Regular Expression Crosswords, which proves a great resource to playfully master regular expression. All puzzles are validated live using the JavaScript regex engine. The figure below explains how it works

Via the links below you can jump puzzles that matches your expertise level:

Regular Expressions in R – Part 1: Introduction and base R functions

The following is the first part of my introduction to regular expression (regex), in general, and the use of regex in R, in specific. It is loosely inspired on the swirl() tutorial by Jon Calder. I created it in R Markdown and uploaded it to RPubs, for an easier read.

Regular expression

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings (Wikipedia). Regular expressions were originally developed for the Perl language and have since been implemented in many other languages including R.

Regular expressions usually involve two parts: a pattern and a text string. The pattern defines what type and/or sequence of characters to look for whereas the text string represents the content in which to search/match this pattern. Patterns are always strings themselves and thus need to be enclosed in (single or double) quotation marks.

Example

An example: the pattern “stat” will match the occurance of the letters “s”, “t”, “a”, “t” in that specific order. Regardless of where in the content (text string) they occur and what other characters may precede the “s” or follow the last “t”.

Base R’s grepl() function returns a logical value reflecting whether the pattern is matched. The below demonstrates how the pattern “stats” can be found in both “statistics” and “estate” but not in “castrate” (which does include the letters, but with an r in between), in “catalyst” (which does include the letters, but not in the right order), or in “banana” (which does not include all the letters).

words = c("statistics", "estate", "castrate", "catalyst", "banana")
grepl(pattern = "stat", x = words)

## [1]  TRUE  TRUE FALSE FALSE FALSE

Moreover, regular expressions are case sensitive, so “stat” is not found in “Statistics”, unless it is specified that case should be ignored (FALSE by default).

grepl(pattern = "stat", x = "Statistics")

## [1] FALSE

grepl(pattern = "stat", x = "Statistics", ignore.case = TRUE)

## [1] TRUE

Regular Expressions in Base R

Base R includes seven main functions that use regular expressions with different outcomes. These are grep(), grepl(), regexpr(), gregexpr(), regexec(), sub(), and gsub(). Although they require mostly similar inputs, their returned values are quite different.

`grep()` & `grepl()`

grep() examines each element of a character vector and returns the indices where the pattern is matched.

sentences = c("I like statistics", "I like bananas", "Estates and statues are expensive")
grep("stat", sentences)

## [1] 1 3

By setting the value parameter to TRUE, grep() will return the character element instead of its index.

grep("stat", sentences, value = TRUE)

## [1] "I like statistics"                 "Estates and statues are expensive"

It’s logical brother grepl() you’ve seen before. It returns a logical value instead of the index or the element.

grepl("stat", sentences)

## [1]  TRUE FALSE  TRUE

`regexpr()` & `gregexpr()`

regexpr() seeks for a pattern in a text and returns an integer vector with two attributes (also vectors). The main integer vector returned represents the position where the pattern was first matched in the text. Its attribute “match.length” is also an integer vector representing the length of the match (in this case “stat” is always length 4).

If the pattern is not matched, both of the main vector and the length attribute will have a value of -1.

The second attribute (“useBytes”) is always a logical vector of length one. It represents whether matching is done byte-by-byte (TRUE) or character-by-character (FALSE), but you may disregard it for now.

sentences

## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"

regexpr("stat", sentences)

## [1]  8 -1  2
## attr(,"match.length")
## [1]  4 -1  4
## attr(,"useBytes")
## [1] TRUE

Note that, for the third sentence, regexpr() only returns the values for the first match (i.e., “Estate”) but not those of the second match (i.e., “statues”). For this reason, the function has a brother, gregexpr(), which has the same functionality but performs the matching on a global scale (hence the leading g). This means that the algorithm does not stop after its first match, but continues and reports all matches within the content string.

grepexpr() thus does not return a single vector, but a list of vectors. Each of these vectors reflects an input content string as is the length of the number of matches within that content. For example, the “stat” pattern is matched twice in our third sentence, therefore its vector is length 2, with the starting position of each match as well as their lengths.

sentences

## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"

gregexpr("stat", sentences)

## [[1]]
## [1] 8
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  2 13
## attr(,"match.length")
## [1] 4 4
## attr(,"useBytes")
## [1] TRUE

`()`

In order to explain how regexec() differs from gregexpr(), we first need to explain how parentheses in work in regex. Most simply speaking, parentheses or round brackets (()) indicate groups. One of the advantages of groups is that logical tests can thus be conducted within regular expressions.

sentences

## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"

grepl("like", sentences)

## [1]  TRUE  TRUE FALSE

grepl("are", sentences)

## [1] FALSE FALSE  TRUE

grepl("(are|like)", sentences)

## [1] TRUE TRUE TRUE

`regexec()`

However, these groups can also be useful to extract more detailed information from a regular expression. This is where regexec() comes in.

Like gregexpr(), regexec() returns a list of the same length as the content. This list includes vectors that reflect the starting positions of the overall match, as well as the matches corresponding to parenthesized subpatterns. Similarly, attribute “match.length” reflects the lengths of each of the overall and submatches. In case no match is found, a -1 value is again returned.

The beauty of regexec() because clear when we split our pattern into two groups using parentheses: “(st)(at)”. As you can see below, both regexpr() and its global brother gregexpr() disregard this grouping and provide the same output as before – as you would expect for the pattern “stat”. In contast, regexec() notes that we now have a global pattern (“stat”)as well as two subpatterns (“st” and “at”). For each of these, the function returns the starting positions as well as the pattern lengths.

sentences

## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"

regexpr("(st)(at)", sentences)

## [1]  8 -1  2
## attr(,"match.length")
## [1]  4 -1  4
## attr(,"useBytes")
## [1] TRUE

gregexpr("(st)(at)", sentences)

## [[1]]
## [1] 8
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  2 13
## attr(,"match.length")
## [1] 4 4
## attr(,"useBytes")
## [1] TRUE

regexec("(st)(at)", sentences)

## [[1]]
## [1]  8  8 10
## attr(,"match.length")
## [1] 4 2 2
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] 2 2 4
## attr(,"match.length")
## [1] 4 2 2
## attr(,"useBytes")
## [1] TRUE

`sub()` & `gsub()`

The final two base regex functions are sub() and its global brother gsub(). These, very intiutively, substitute a matched pattern by a specified replacement and then return all inputs. For instance, we could replace “I” with “You” in our example sentences.

sub(pattern = "I", replacement = "You", sentences)

## [1] "You like statistics"               "You like bananas"                 
## [3] "Estates and statues are expensive"

Similarly, we could desire to replace all spaces by underscores. This would require a global search (i.e., gsub()), as sub() would stop after the first match.

sub(pattern = " ", replacement = "_", sentences)

## [1] "I_like statistics"                 "I_like bananas"                   
## [3] "Estates_and statues are expensive"

gsub(pattern = " ", replacement = "_", sentences)

## [1] "I_like_statistics"                 "I_like_bananas"                   
## [3] "Estates_and_statues_are_expensive"

This was the first part of my introduction to Regular Expression in R. For more information detailed information about all input parameters of each function, please consult the base R manual. In subsequent parts, I will introduce you to so-called Anchors, Character Classes, Groups, Ranges, and Quantifiers. These will allow you to perform more advanced searches and matches. Here, we will also elaborate on lazy, greedy, and possesive regular expressions, which further expand our search capability as well as flexibility.

In the end, I hope to provide you with an overview of several Regular Expressions that I have found extremely useful in my personal project, and which should be valuable to anyone who conducts applied research (in organizations).

R resources (free courses, books, tutorials, & cheat sheets)

Help yourself to these free books, tutorials, packages, cheat sheets, and many more materials for R programming. There’s a separate overview for handy R programming tricks. If you have additions, please comment below or contact me!

Join 1,415 other subscribers

LAST UPDATED: 2021-09-24

Table of Contents (clickable)

Beginner
Advanced
Cheat sheets
Data manipulation
Data visualization
Dashboards & Shiny
Markdown
Database connections
Machine learning
Text mining
Geospatial analysis
Bioinformatics
R IDEs
Software & language connections
Help
Blogs
Conferences, Events, & Groups
Jobs
Other tips & tricks

Completely new to R? → Start learning here!

Introductory R

Introductory Books

Online Courses

Youtube R classes by Chris Bilder
37 Youtube R Tutorials by Flavio Azevedo***
Essential R tutorials by Gilad Feldman
Data Carpentry Social Science in R
Statistics and R, by Rafael Irizarry and Michael Love
Learn R via R-coder.com

Style Guides

Google’s R style guide
Tidyverse style guide by Hadley Wickham
Advanced R style guide by Hadley Wickham
R style guide for stat405 by Hadley Wickham
R style guide by Collin Gillespie
Best practices for R Coding by Arnaud Amsellem / The R Trader
The State of Naming Conventions in R (Bååth, 2012)
A guide for switching from base R to the tidyverse

BACK TO TABLE OF CONTENTS

Advanced R

Package Development

Mastering Software Development in R (Peng, Kross, & Anderson, 2017)
R Packages (Wickham & Bryan, ???)
rOpenSci Packages: Development, Maintenance, and Peer Review
How to develop good R packages (for open science) by Maëlle Salmon
Tutorial on creating R packages by Friedrich Leisch
Developing R Packages by Jeff Leek
Writing an R package from scratch by Hilary Parker
Write your own R package by STAT545
Making an R Package, by R.M. Ripley
Prepare your package for CRAN
Introduction to roxygen2 by Hadley Wickham
How to build package vignettes with knitr by Yihui Xie
knitr in a nutshell: a minimal tutorial by Karl Broman
Rtools: Building R for Windows by Brian Ripley, Duncan Murdoch, and Jeroen Ooms
devtools – tools to make an R developer’s life easier
roxygen2 – tools for describing functions in comments next to their definitions
Rd2roxygen – tools for converting Rd to roxygen documentation
testthat – tools that simplify the testing of R packages

Non-standard Evaluation

Functional Programming

Writing Functions in R by Hadley Wickham via DataCamp.com
R for Data Science chapters on Functions and Iteration
(Grolemund & Wickham, 2018)***
Advanced R chapter on Functions (Wickham, 2014)
Lesson on writing, testing, and documenting custom functions by Software-Carpentry.org
User-defined R fuctions tutorial by Carlo Fanara via DataCamp.com
Functional programming lecture by Duke University
purrr tutorial by Jenny Bryan***
Intro to purrr tutorial by Emorie Beck
Learn purrr tutorial by Dan Ovando
purrr cheat sheet by RStudio

BACK TO TABLE OF CONTENTS

Cheat Sheets

Many of the above cheat sheets are hosted in the official RStudio cheat sheet overview.

Data Manipulation

Data Visualization

Colors

R Color Guide***
colourpicker – widget that allows users to choose colours
paletteer – comprehensive collection of color palettes in R***
ggplot2 colour guide***
Canva’s 100 color palette included in ggthemes::scale_color_canva
Wes Anderson color palettes
Multicolored annotated text in ggplot2 by Andrew Whitby & Visuelle Data
Picular.co – Google, but for colors

Interactive / HTML / JavaScript widgets

R HTML Widgets Gallery***
plotly – interactive plots
billboarder – easy interface to billboard.js, a JavaScript chart library based on D3
d3heatmap – interactive D3 heatmaps
altair – Vega-Lite visualizations via Python
DT – interactive tables
DiagrammeR – interactive diagrams (DiagrammeR cheat sheet)
dygraphs – interactive time series plots
formattable – formattable data structures
ggvis – interactive ggplot2
highcharter – interactive Highcharts plots
leaflet – interactive maps
metricsgraphics – interactive JavaScript bare-bones line, scatterplot and bar charts
networkD3 – interative D3 network graphs
scatterD3 – interactive scatterplots with D3
rbokeh – interactive Bokeh plots
rCharts – interactive Javascript charts
rcdimple – interactive JavaScript bar charts and others
rglwidget – interactive 3d plots
threejs – interactive 3d plots and globes
visNetwork – interactive network graphs
wordcloud2 – interface to wordcloud2.js.
timevis – interactive timelines

ggplot2

Code examples of top-50 ggplot2 visualizations***
ggplot2 Cheatsheet by RStudio
ggplot2 Quick Reference Guide
ggplot2 Code Snippets
ggplot2 Code Snippets 2
Hitchhiker’s Guide to ggplot2 in R (Burchell & Vargas, 2016)
A practical introduction with R and ggplot2 (Healy, 2017)
Data Vizualization: A practical introduction (Healy, 2018)
Complete ggplot2 Tutorial
Principles & Practice of Data Visualization CS631 at Oregon Health & Science University
Data visualization cheat sheet by RStudio with ggplot2
Setting custom ggplot themes with ggthemr
Creating custom, reproducible color palettes by Simon Jackson
Rearranging values within ggplot2 facets
Combine plots using patchwork or cowplot
equisse – RStudio addin to interactively explore data with ggplot2 without coding

ggplot2 extensions

ggplot2 extensions overview***
ggthemes – plot style themes
hrbrthemes – opinionated, typographic-centric themes
ggmap – maps with Google Maps, Open Street Maps, etc.
ggiraph – interactive ggplots
gghighight – highlight lines or values, see vignette
ggstance – horizontal versions of common plots
GGally – scatterplot matrices
ggalt – additional coordinate systems, geoms, etc.
ggbeeswarm – column scatter plots or voilin scatter plots
ggforce – additional geoms, see visual guide
ggrepel – prevent plot labels from overlapping
ggraph – graphs, networks, trees and more
ggpmisc – photo-biology related extensions
geomnet – network visualization
ggExtra – marginal histograms for a plot
gganimate – animations, see also the gganimate wiki page
ggpage – pagestyled visualizations of text based data
ggpmisc – useful additional geom_* and stat_* functions
ggstatsplot – include details from statistical tests in plots
ggspectra – tools for plotting light spectra
ggnetwork – geoms to plot networks
ggpoindensity – cross between a scatter plot and a 2D density plot
ggradar – radar charts
ggsurvplot (survminer) – survival curves
ggseas – seasonal adjustment tools
ggthreed – (evil) 3D geoms
ggtech – style themes for plots
ggtern – ternary diagrams
ggTimeSeries – time series visualizations
ggtree – tree visualizations
treemapify – wilcox’s treemaps
seewave – spectograms

Miscellaneous

coefplot – visualizes model statistics
circlize – circular visualizations for categorical data
clustree – visualize clustering analysis
quantmod – candlestick financial charts
dabestr– Data Analysis using Bootstrap-Coupled ESTimation
devoutsvg – an SVG graphics device (with pattern fills)
devoutpdf – an PDF graphics device
cartography – create and integrate maps in your R workflow
colorspace – HSL based color palettes
viridis – Matplotlib viridis color pallete for R
munsell – Munsell color palettes for R
Cairo – high-quality display output
igraph – Network Analysis and Visualization
graphlayouts – new layout algorithms for network visualization
lattice – Trellis graphics
tmap – thematic maps
trelliscopejs – interactive alternative for facet_wrap
rgl – interactive 3D plots
corrplot – graphical display of a correlation matrix
googleVis – Google Charts API
plotROC – interactive ROC plots
extrafont – fonts in R graphics
rvg – produces Vector Graphics that allow further editing in PowerPoint or Excel
showtext – text using system fonts
animation – animated graphics using ImageMagick.
misc3d – 3d plots, isosurfaces, etc.
xkcd – xkcd style graphics
imager – CImg library to work with images
ungeviz – tools for visualize uncertainty
waffle – square pie charts a.k.a. waffle charts
Creating spectograms in R with hht, warbleR, soundgen, signal, seewave, or phonTools

BACK TO TABLE OF CONTENTS

Shiny, Dashboards, & Apps

Shiny Cheat Sheet by RStudio
Shiny Tutorial
A collection of links to Shiny applications that have been shared on Twitter.
Enterprise-ready dashboards with Shiny and databases
Several packages to upgrade your Shiny dashboards
More Shiny Resources by Rob Gilmore
More Shiny Resources for Statistics by Yingjie Hu
Building Shiny apps – an interactive tutorial by Dean Attali
Advanced Shiny tips & tricks by Dean Attali (version 2)
flexdashboard – dashboard creation simplified
colourpicker – widget that allows users to choose colours
brighter – toolbox with helpful functions for shiny development
DesktopDeployR – self-contained R-based desktop applications

Markdown & Other Output Formats

R Markdown cheat sheet by RStudio
R Markdown reference guide by RStudio
R Markdown Basics
R Markdown tutorial by RStudio
R Markdown gallery by RStudio
The knitr book (Xie, 2015)
Getting used to R, RStudio, and R Markdown (2016)
R Markdown: The Definitive Guide (Xie, Allaire, & Grolemund, 2018)
Introduction to R Markdown (Clark, 2018)
R Markdown for Scientists (Tierney, 2019)
R Markdown Tips and Tricks
Pimp my RMD by Holtz Yan
Pandoc syntax highlighting examples by Garrick Aden-Buie
Creating slides with R Markdown (Video) by Brian Caffo
Introduction to xaringan by Yihui Xie
A quick demonstration of xarigan
General Markdown cheat sheet
blogdown websites with R Markdown (Xie, Thomas, & Hill, 2018)
blogdown tutorials
How to build a website with blogdown in R, by Storybench
radix – online publication format designed for scientific and technical communication
A template RStudio project with data analysis and manuscript writing by Thomas Julou
Multiple reports from a single Markdown file (example 1) (example2)

tidystats – automating updating of model statistics
papaja – preparing APA journal articles
blogdown – build websites with Markdown & Hugo
huxtable – create Excel, html, & LaTeX tables
xaringan – make slideshows via remark.js and markdown
summarytools – produces neat, quick data summary tables
citr – RStudio Addin to Insert Markdown Citations

Cloud, Server, & Database

Access and manage Google spreadsheets from R with googlesheets
Tutorial: Database Queries with R
Introduction to sparklyr by DataCamp
Running R on AWS
AWS EC2 Tutorial For Beginners
Using RStudio on Amazon EC2 under the Free Usage Tier
Getting started with databases using R, by RStudio
- RMySQL – connects to MySQL and MariaDB
- RPostgreSQL – connects to Postgres and Redshift.
- RSQLite – embeds a SQLite database.
- odbc – connects to many commercial databases via the open database connectivity protocol.
- bigrquery – connects to Google’s BigQuery.
- DBI – separates the connectivity to the DBMS into a “front-end” and a “back-end”.
- dbplot – leverages dplyr to process calculations of plot inside database
- dplyr – also works with remote on-disk data stored in databases
- tidypredict – run predictions inside the database

BACK TO TABLE OF CONTENTS

Statistical Modeling & Machine Learning

Books

Courses

Introduction to Statistical Learning*** at Stanford University by Trevor Hastie and Rob Tibshirani
Introduction to R for Data Science @Microsoft
Introduction to R for Data Science @FutureLearn by Hadley Wickham
PSY2002: Advanced Statistics at University of Toronto by Elizabeth Page-Gould
STAT 450/870: Regression Analysis at University of Nebraska-Lincoln by Chris Bilder
STAT 850: Computing Tools for Statisticians at University of Nebraska-Lincoln by Chris Bilder
STAT 873: Applied Multivariate Statistical Analysis at University of Nebraska-Lincoln by Chris Bilder
STAT 875: Categorical Data Analysis at University of Nebraska-Lincoln by Chris Bilder
STAT 950: Computational Statistics at University of Nebraska-Lincoln by Chris Bilder
Joint Statistical Meetings: Analysis of Categorical Data by Chris Bilder

Cheat sheets

Time series

CRAN Task View – TimeSeries
R xts cheat sheet
Forecasting: Principles and Practice (Hyndman & Athanasopoulos, 2017)
A little book of R for time series (tutorial)
ARIMA forecasting in R (6-part Youtube series)
Introduction to the tsfeatures package
Tutorials: Part 1, Part 2, Part 3, & Part 4 of tidy time series @Business-Science.io with tidyquant
Packages:
- xts – extensible time series
- tsfeatures – methods for extracting various features from time series data
- tidyquant – tidyverse-style financial analysis

Survival analysis

CRAN Task View – Survival
R survival analysis cheat sheet by Przemysław Biecek
Packages:
- survival – functionality for survival and hazard models
- ggsurvplot (survminer) – survival curves

Bayesian

Miscellaneous

corrr – easier correlation matrix management and exploration

BACK TO TABLE OF CONTENTS

Natural Language Processing & Text Mining

Text Mining Tutorial with tm
Tidy Text Mining (Silges & Robinson, 2017) with tidytext
Text Analysis with R for Students of Literature (Jockers, 2014)
Tidytext tutorials by computational journalism
21 Recipes for Mining Twitter Data (Rudis, 2017) with rtweet
Emil Hvitfeldt’s R-text-data GitHub repository
Course: Introduction to Text Analytics with R @DataScienceDojo
Course: Twitter Text Mining and Social Network Analysis (Zhoa, 2016) @RDataMining with twitteR
Quantitative Analysis of Textual Data with quanteda cheat sheet by Stefan Müller and Kenneth Benoit
List of resources for NLP & Text Mining by Stephen Thomas
Packages — for an overview: CRAN Task View – Natural Language Processing:
- tm – text mining.
- tidytext – text mining using tidyverse principles
- quanteda – framework for quantitative text analysis
- gutenbergr – public domain works (free books to practice on)
- corpora – statistics and data sets for corpus frequency data.
- tau – Text Analysis Utilities
- Sentiment140 – headache-free sentiment analysis
- sentimentr – sentiment analysis using text polarity
- openNLP – sentence detector, tokenizer, pos-tagger, shallow and full syntactic parser, named-entity detector, and maximum entropy models with OpenNLP.
- cleanNLP – natural language processing via tidy data models
- RSentiment – English lexicon-based sentiment analysis with negation and sarcasm detection functionalities.
- RWeka – data mining tasks with Weka
- wordnet – a large lexical database of English with WordNet .
- stringi – language processing wrappers
- textcat – provides support for n-gram based text categorization.
- text2vec – text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), and similarities.
- lsa – Latent Semantic Analysis
- topicmodels -Latent Dirichlet Allocation (LDA) and Correlated Topics Models (CTM)
- lda -Latent Dirichlet Allocation and related models

Regular Expressions

R Regular Expression cheat sheet by Lise Vaudor
R Regular Expression cheat sheet
R Regular Expression cheat sheet (page 2) by RStudio
regexplain – interactive RStudio addin for regular expressions
Regular Expressions in R – Part 1: Introduction and base R functions
R Regular Expressions by Jon M. Calder in swirl()
R Regular Expression Video Tutorial by Roger Peng
General Regular Expression cheat sheet
General Regular Expression Video Tutorial by Roger Peng
General Regular Expression cheat sheet by OverAPI.com

BACK TO TABLE OF CONTENTS

Geographic & Spatial mapping

Making Maps with R (tutorial) with ggmaps, maps, and mapdata
Importing OpenStreetMap data (tutorial) with osmar
Geocomputation with R (Lovelace, Nowosad, & Muenchow, 2018)
Spatial manipulation with Simple Features (sf) cheat sheet by Ryan Garnett

Bioinformatics & Computational Biology

BACK TO TABLE OF CONTENTS

Integrated Development Environments (IDEs) &
Graphical User Inferfaces (GUIs)

Descriptions mostly taken from their own websites:

RStudio*** – Open source and enterprise ready professional software
Jupyter Notebook*** – open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text across dozens of programming languages.
Microsoft R tools for Visual Studio – turn Visual Studio into a powerful R IDE
R Plugins for Vim, Emax, and Atom editors
Rattle*** – GUI for data mining
equisse – RStudio add-in to interactively explore and visualize data
R Analytic Flow – data flow diagram-based IDE
RKWard – easy to use and easily extensible IDE and GUI
Eclipse StatET – Eclipse-based IDE
OpenAnalytics Architect – Eclipse-based IDE
TinnR – open source GUI and IDE
DisplayR – cloud-based GUI
BlueSkyStatistics – GUI designed to look like SPSS and SAS
ducer – GUI for everyone
R commander (Rcmdr) – easy and intuitive GUI
JGR – Java-based GUI for R
jamovi & jmv – free and open statistical software to bridge the gap between researcher and statistician
Exploratory.io – cloud-based data science focused GUI
Stagraph – GUI for ggplot2 that allows you to visualize and connect to databases and/or basic file types
ggraptr – GUI for visualization (Rapid And Pretty Things in R)
ML Studio – interactive Shiny platform for data visualization, statistical modeling and machine learning

R & other software and languages

R & Excel

BERT – Basic Excel R Toolkit
A Comprehensive Guide to Transitioning from Excel to R by Alyssa Columbus
readxl – package to load in Excel data
xlsx – package to read and write Excel data
rvg – produces Vector Graphics which can be modified in Excel
devoutpdf – an PDF graphics device
tidyxl – imports non-tabular (e.g., format) data from Excel files into R
unpivotr – unpivot complex and irregular data layouts in R
unheadr – handle data with embedded subheaders

R & Python

Python for R users
reticulate cheat sheet by RStudio
reticulate – tools for interoperability between Python and R

R & SQL

sqldf – running SQL statements on R data frames

BACK TO TABLE OF CONTENTS

Join 1,415 other subscribers

R Help, Connect, & Inspiration

RStudio Community
R help mailing list
R seek – search engine for R-related websites
R site search – search engine for help files, manuals, and mailing lists
Nabble – mailing list archive and forum
R User Groups & Conferences
R for Data Science Online Learning Community
Stack Overflow – a FAQ for all your R struggles (programming)
Cross Validated – a FAQ for all your R struggles (statistics)
CRAN Task Views – discover new packages per topic
The R Journal – open access, refereed journal of R
Twitter: #rstats, RStudio, Hadley Wickham, Yihui Xie, Mara Averick, Julia Silge, Jenny Bryan, David Smith, Hilary Parker, R-bloggers
Facebook: R Users Psychology
Youtube: Ben Lambert, Roger Peng
Reddit: rstats, rstudio, statistics, machinelearning, dataisbeautiful

R Blogs

R Conferences, Events, & Meetups

R Jobs

BACK TO TABLE OF CONTENTS

Contents

Introduction & Learning

Reading

Tutorials (interactive)

Video

Corey Shafer

The Coding Train

Language-specific

Python

Corey Shafer

R

Roger Peng

Testing & Debugging

regextester.com | regexpal.com

Fun

Share this:

Share this:

Share this:

Share this:

Regular expression

Example

Regular Expressions in Base R

grep() & grepl()

regexpr() & gregexpr()

()

regexec()

sub() & gsub()

Share this:

Table of Contents (clickable)

Introductory R

Introductory Books

Online Courses

Style Guides

Advanced R

Package Development

Non-standard Evaluation

Functional Programming

Cheat Sheets

Data Manipulation

Data Visualization

Colors

Interactive / HTML / JavaScript widgets

ggplot2

ggplot2 extensions

Miscellaneous

Shiny, Dashboards, & Apps

Markdown & Other Output Formats

Cloud, Server, & Database

Statistical Modeling & Machine Learning

Books

Courses

Cheat sheets

Time series

Survival analysis

Bayesian

Miscellaneous

Natural Language Processing & Text Mining

Regular Expressions

Geographic & Spatial mapping

Bioinformatics & Computational Biology

Integrated Development Environments (IDEs) & Graphical User Inferfaces (GUIs)

R & other software and languages

R & Excel

R & Python

R & SQL

R Help, Connect, & Inspiration

R Blogs

R Conferences, Events, & Meetups

R Jobs

Share this:

`grep()` & `grepl()`

`regexpr()` & `gregexpr()`

`()`

`regexec()`

`sub()` & `gsub()`

Integrated Development Environments (IDEs) &
Graphical User Inferfaces (GUIs)