Category: r

Kaggle Data Science Survey 2017: Worldwide Preferences for Python & R

Kaggle Data Science Survey 2017: Worldwide Preferences for Python & R

Kaggle conducts industry-wide surveys to assess the state of data science and machine learning. Over 17,000 individuals worldwide participated in the survey, myself included, and 171 countries and territories are represented in the data.

There is an ongoing debate regarding whether R or Python is better suited for Data Science (probably the latter, but I nevertheless prefer the former). The thousands of responses to the Kaggle survey may provide some insights into how the preferences for each of these languages are dispersed over the globe. At least, that was what I thought when I wrote the code below.

View the Kaggle Kernel here.

### PAUL VAN DER LAKEN
### 2017-10-31
### KAGGLE DATA SCIENCE SURVEY
### VISUALIZING WORLD WIDE RESPONSES
### AND PYTHON/R PREFERENCES

# LOAD IN LIBRARIES
library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)

# OPTIONS & STANDARDIZATION
options(stringsAsFactors = F)
theme_set(theme_light())
dpi = 600
w = 12
h = 8
wm_cor = 0.8
hm_cor = 0.8
capt = "Kaggle Data Science Survey 2017 by paulvanderlaken.com"

# READ IN KAGGLE DATA
mc <- read.csv("multipleChoiceResponses.csv") %>%
  as.tibble()

# READ IN WORLDMAP DATA
worldMap <- map_data(map = "world") %>% as.tibble()

# ALIGN KAGGLE AND WORLDMAP COUNTRY NAMES
mc$Country[!mc$Country %in% worldMap$region] %>% unique()
worldMap$region %>% unique() %>% sort(F)
mc$Country[mc$Country == "United States"] <- "USA"
mc$Country[mc$Country == "United Kingdom"] <- "UK"
mc$Country[grepl("China|Hong Kong", mc$Country)] <- "China"


# CLEAN UP KAGGLE DATA
lvls = c("","Rarely", "Sometimes", "Often", "Most of the time")
labels = c("NA", lvls[-1])
ind_data <- mc %>% 
  select(Country, WorkToolsFrequencyR, WorkToolsFrequencyPython) %>%
  mutate(WorkToolsFrequencyR = factor(WorkToolsFrequencyR, 
                                      levels = lvls, labels = labels)) %>% 
  mutate(WorkToolsFrequencyPython = factor(WorkToolsFrequencyPython, 
                                           levels = lvls, labels = labels)) %>% 
  filter(!(Country == "" | is.na(WorkToolsFrequencyR) | is.na(WorkToolsFrequencyPython)))

# AGGREGATE TO COUNTRY LEVEL
country_data <- ind_data %>%
  group_by(Country) %>%
  summarize(N = n(),
            R = sum(WorkToolsFrequencyR %>% as.numeric()),
            Python = sum(WorkToolsFrequencyPython %>% as.numeric()))

# CREATE THEME FOR WORLDMAP PLOT
theme_worldMap <- theme(
    plot.background = element_rect(fill = "white"),
    panel.border = element_blank(),
    panel.grid = element_blank(),
    panel.background = element_blank(),
    legend.background = element_blank(),
    legend.position = c(0, 0.2),
    legend.justification = c(0, 0),
    legend.title = element_text(colour = "black"),
    legend.text = element_text(colour = "black"),
    legend.key = element_blank(),
    legend.key.size = unit(0.04, "npc"),
    axis.text = element_blank(), 
    axis.title = element_blank(),
    axis.ticks = element_blank()
  )

After aligning some country names (above), I was able to start visualizing the results. A first step was to look at the responses across the globe. The greener the more responses and the grey countries were not represented in the dataset. A nice addition would have been to look at the response rate relative to country population.. any volunteers?

# PLOT WORLDMAP OF RESPONSE RATE
ggplot(country_data) + 
  geom_map(data = worldMap, 
           aes(map_id = region, x = long, y = lat),
           map = worldMap, fill = "grey") +
  geom_map(aes(map_id = Country, fill = N),
           map = worldMap, size = 0.3) +
  scale_fill_gradient(low = "green", high = "darkgreen", name = "Response") +
  theme_worldMap +
  labs(title = "Worldwide Response Kaggle DS Survey 2017",
       caption = capt) +
  coord_equal()

Worldmap_response.png

Now, let’s look at how frequently respondents use Python and R in their daily work. I created two heatmaps: one excluding the majority of respondents who indicated not using either Python or R, probably because they didn’t complete the survey.

# AGGREGATE DATA TO WORKTOOL RESPONSES
worktool_data <- ind_data %>%
  group_by(WorkToolsFrequencyR, WorkToolsFrequencyPython) %>%
  count()

# HEATMAP OF PREFERRED WORKTOOLS
ggplot(worktool_data, aes(x = WorkToolsFrequencyR, y = WorkToolsFrequencyPython)) +
  geom_tile(aes(fill = log(n))) +
  geom_text(aes(label = n), col = "black") +
  scale_fill_gradient(low = "red", high = "yellow") +
  labs(title = "Heatmap of Python and R usage",
       subtitle = "Most respondents indicate not using Python or R (or did not complete the survey)",
       caption = capt, 
       fill = "Log(N)") 

heatmap_worktools.png

# HEATMAP OF PREFERRED WORKTOOLS
# EXCLUSING DOUBLE NA'S
worktool_data %>%
  filter(!(WorkToolsFrequencyPython == "NA" & WorkToolsFrequencyR == "NA")) %>%
  ungroup() %>%
  mutate(perc = n / sum(n)) %>%
  ggplot(aes(x = WorkToolsFrequencyR, y = WorkToolsFrequencyPython)) +
  geom_tile(aes(fill = n)) +
  geom_text(aes(label = paste0(round(perc,3)*100,"%")), col = "black") +
  scale_fill_gradient(low = "red", high = "yellow") +
  labs(title = "Heatmap of Python and R usage (non-users excluded)",
       subtitle = "There is a strong reliance on Python and less users focus solely on R",
       caption = capt, 
       fill = "N") 

heatmap_worktools_usersonly.png

Okay, now let’s map these frequency data on a worldmap. Because I’m interested in the country level differences in usage, I look at the relative usage of Python compared to R. So the redder the country, the more Python is used by Data Scientists in their workflow whereas R is the preferred tool in the bluer countries. Interesting to see, there is no country where respondents really use R much more than Python.

# WORLDMAP OF RELATIVE WORKTOOL PREFERENCE
ggplot(country_data) + 
  geom_map(data = worldMap, 
           aes(map_id = region, x = long, y = lat),
           map = worldMap, fill = "grey") +
  geom_map(aes(map_id = Country, fill = Python/R),
           map = worldMap, size = 0.3) +
  scale_fill_gradient(low = "blue", high = "red", name = "Python/R") +
  theme_worldMap +
  labs(title = "Relative usage of Python to R per country",
       subtitle = "Focus on Python in Russia, Israel, Japan, Ukraine, China, Norway & Belarus",
       caption = capt) +
  coord_equal() 
Worldmap_relative_usage.png
Countries are color-coded for their relative preference for Python (red/purple) or R (blue) as a Data Science tool. 167 out of 171 countries (98%) demonstrate a value of > 1, indicating a preference for Python over R.

Thank you for reading my visualization report. Please do try and extract some other interesting insights from the data yourself.

If you liked my analysis, please upvote my Kaggle Kernel here!

Geographical Maps in ggplot2: Rectangle World Map

Geographical Maps in ggplot2: Rectangle World Map

Maarten Lambrechts posted a tutorial where he demonstrates the steps through which he created a Eurovision Song Festival map in R.

Maarten’s ggplot2 Songfestival map
Maarten’s ggplot2 worldmap

Inspired by his tutorial, I decided to create a worldmap of my own, the R code for which you may find below.

options(stringsAsFactors = F) # options
library(tidyverse# packages
# retrieve data file
link = "https://gist.githubusercontent.com/maartenzam/787498bbc07ae06b637447dbd430ea0a/raw/9a9dafafb44d8990f85243a9c7ca349acd3a0d07/worldtilegrid.csv"

geodata <- read.csv(link) %>% as.tibble() # load in geodata
str(geodata) # examine geodata
## Classes 'tbl_df', 'tbl' and 'data.frame':    192 obs. of  11 variables:
##  $ name           : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ alpha.2        : chr  "AF" "AL" "DZ" "AO" ...
##  $ alpha.3        : chr  "AFG" "ALB" "DZA" "AGO" ...
##  $ country.code   : int  4 8 12 24 10 28 32 51 36 40 ...
##  $ iso_3166.2     : chr  "ISO 3166-2:AF" "ISO 3166-2:AL" "ISO 3166-2:DZ" "ISO 3166-2:AO" ...
##  $ region         : chr  "Asia" "Europe" "Africa" "Africa" ...
##  $ sub.region     : chr  "Southern Asia" "Southern Europe" "Northern Africa" "Middle Africa" ...
##  $ region.code    : int  142 150 2 2 NA 19 19 142 9 150 ...
##  $ sub.region.code: int  34 39 15 17 NA 29 5 145 53 155 ...
##  $ x              : int  22 15 13 13 15 7 6 20 24 15 ...
##  $ y              : int  8 9 11 17 23 4 14 6 19 6 ...
# create worldmap
worldmap <- ggplot(geodata)

# add rectangle grid + labels
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3))

download (13)

# improve geoms
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1,
                fill = region)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3),
            size = 2, 
            nudge_x = 0.5, nudge_y = -0.5,
            vjust = 0.5, hjust = 0.5) +
  scale_y_reverse() 

download (12)

# finalize plot look
colors = c('yellow', 'red', 'white', 'pink', 'green', 'orange')
worldmap + 
  geom_rect(aes(xmin = x, ymin = y, 
                xmax = x + 1, ymax = y + 1,
                fill = region)) +
  geom_text(aes(x = x, y = y, 
                label = alpha.3),
            size = 3,
            nudge_x = 0.5, nudge_y = -0.5,
            vjust = 0.5, hjust = 0.5) +
  scale_y_reverse() +
  scale_fill_manual(values = colors) +
  guides(fill = guide_legend(ncol = 2), col = F) +
  theme(plot.background = element_rect(fill = "blue"),
        panel.grid = element_blank(),
        panel.background = element_blank(),
        legend.background = element_blank(),
        legend.position = c(0, 0),
        legend.justification = c(0, 0),
        legend.title = element_text(colour = "white"),
        legend.text = element_text(colour = "white"),
        legend.key = element_blank(),
        legend.key.size = unit(0.06, "npc"),
        axis.text = element_blank(), 
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        text = element_text(colour = "white", size = 16)
        ) +
  labs(title = "ggplot2: Worldmap",
       fill = "Region", 
       caption = "paulvanderlaken.com")

ggplo2 Rectangle Worldmap

What would you add to your worldmap? If you end up making one, please send me a copy on paulvanderlaken@gmail.com!

Improved Twitter Mining in R

Improved Twitter Mining in R

R users have been using the twitter package by Geoff Jentry to mine tweets for several years now. However, a recent blog suggests a novel package provides a better mining tool: rtweet by Michael Kearney (GitHub).

Both packages use a similar setup and require you to do some prep-work by creating a Twitter “app” (see the package instructions). However, rtweet will save you considerable API-time and post-API munging time. This is demonstrated by the examples below, where Twitter is searched for #rstats-tagged tweets, first using twitteR, then using rtweet.

library(twitteR)

# this relies on you setting up an app in apps.twitter.com
setup_twitter_oauth(
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
)

r_folks <- searchTwitter("#rstats", n=300)

str(r_folks, 1)
## List of 300
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..and 53 methods, of which 39 are  possibly relevant

str(r_folks[1])
## List of 1
##  $ :Reference class 'status' [package "twitteR"] with 17 fields
##   ..$ text         : chr "RT @historying: Wow. This is an enormously helpful tutorial by @vivalosburros for anyone interested in mapping "| __truncated__
##   ..$ favorited    : logi FALSE
##   ..$ favoriteCount: num 0
##   ..$ replyToSN    : chr(0) 
##   ..$ created      : POSIXct[1:1], format: "2017-10-22 17:18:31"
##   ..$ truncated    : logi FALSE
##   ..$ replyToSID   : chr(0) 
##   ..$ id           : chr "922150185916157952"
##   ..$ replyToUID   : chr(0) 
##   ..$ statusSource : chr "Twitter for Android"
##   ..$ screenName   : chr "jasonrhody"
##   ..$ retweetCount : num 3
##   ..$ isRetweet    : logi TRUE
##   ..$ retweeted    : logi FALSE
##   ..$ longitude    : chr(0) 
##   ..$ latitude     : chr(0) 
##   ..$ urls         :'data.frame': 0 obs. of  4 variables:
##   .. ..$ url         : chr(0) 
##   .. ..$ expanded_url: chr(0) 
##   .. ..$ dispaly_url : chr(0) 
##   .. ..$ indices     : num(0) 
##   ..and 53 methods, of which 39 are  possibly relevant:
##   ..  getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet, getLatitude, getLongitude, getReplyToSID,
##   ..  getReplyToSN, getReplyToUID, getRetweetCount, getRetweeted, getRetweeters, getRetweets, getScreenName,
##   ..  getStatusSource, getText, getTruncated, getUrls, initialize, setCreated, setFavoriteCount, setFavorited, setId,
##   ..  setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN, setReplyToUID, setRetweetCount,
##   ..  setRetweeted, setScreenName, setStatusSource, setText, setTruncated, setUrls, toDataFrame, toDataFrame#twitterObj

The above operations required only several seconds to completely. The returned data is definitely usable, but not in the most handy format: the package models the Twitter API on to custom R objects. It’s elegant, but also likely overkill for most operations. Here’s the rtweet version:

library(rtweet)

# this relies on you setting up an app in apps.twitter.com
create_token(
  app = Sys.getenv("TWITTER_APP"),
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), 
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
) -> twitter_token

saveRDS(twitter_token, "~/.rtweet-oauth.rds")

# ideally put this in ~/.Renviron
Sys.setenv(TWITTER_PAT=path.expand("~/.rtweet-oauth.rds"))

rtweet_folks <- search_tweets("#rstats", n=300)

dplyr::glimpse(rtweet_folks)
## Observations: 300
## Variables: 35
## $ screen_name                     "AndySugs", "jsbreker", "__rahulgupta__", "AndySugs", "jasonrhody", "sibanjan...
## $ user_id                         "230403822", "703927710", "752359265394909184", "230403822", "14184263", "863...
## $ created_at                      2017-10-22 17:23:13, 2017-10-22 17:19:48, 2017-10-22 17:19:39, 2017-10-22 17...
## $ status_id                       "922151366767906819", "922150507745079297", "922150470382125057", "9221504090...
## $ text                            "RT:  (Rbloggers)Markets Performance after Election: Day 239  https://t.co/D1...
## $ retweet_count                   0, 0, 9, 0, 3, 1, 1, 57, 57, 103, 10, 10, 0, 0, 0, 34, 0, 0, 642, 34, 1, 1, 1...
## $ favorite_count                  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ is_quote_status                 FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
## $ quote_status_id                 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ is_retweet                      FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, F...
## $ retweet_status_id               NA, NA, "922085241493360642", NA, "921782329936408576", "922149318550843393",...
## $ in_reply_to_status_status_id    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_user_id      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ in_reply_to_status_screen_name  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ lang                            "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "en", "ro",...
## $ source                          "IFTTT", "Twitter for iPhone", "GaggleAMP", "IFTTT", "Twitter for Android", "...
## $ media_id                        NA, "922150500237062144", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "92...
## $ media_url                       NA, "http://pbs.twimg.com/media/DMwi_oQUMAAdx5A.jpg", NA, NA, NA, NA, NA, NA,...
## $ media_url_expanded              NA, "https://twitter.com/jsbreker/status/922150507745079297/photo/1", NA, NA,...
## $ urls                            NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ urls_display                    "ift.tt/2xe1xrR", NA, NA, "ift.tt/2xe1xrR", NA, "bit.ly/2yAAL0M", "bit.ly/2yA...
## $ urls_expanded                   "http://ift.tt/2xe1xrR", NA, NA, "http://ift.tt/2xe1xrR", NA, "http://bit.ly/...
## $ mentions_screen_name            NA, NA, "DataRobot", NA, "historying vivalosburros", "NoorDinTech ikashnitsky...
## $ mentions_user_id                NA, NA, "622519917", NA, "18521423 304837258", "2511247075 739773414316118017...
## $ symbols                         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ hashtags                        "rstats DataScience", "Rstats ACSmtg", "rstats", "rstats DataScience", "rstat...
## $ coordinates                     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_id                        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_type                      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_name                      NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ place_full_name                 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country_code                    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ country                         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_coordinates        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ bounding_box_type               NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...

This operation took equal to less time but provides the data in a tidy, immediately usable structure.

On the rtweet website, you can read about the additional functionalities this new package provides. For instance,ts_plot() provides a quick visual of the frequency of tweets. It’s possible to aggregate by the minute, i.e., by = "mins", or by some value of seconds, e.g.,by = "15 secs".

## Plot time series of all tweets aggregated by second
ts_plot(rt, by = "secs")

stream-ts

ts_filter() creates a time series-like data structure, which consists of “time” (specific interval of time determined via the by argument), “freq” (the number of observations, or tweets, that fall within the corresponding interval of time), and “filter” (a label representing the filtering rule used to subset the data). If no filter is provided, the returned data object includes a “filter” variable, but all of the entries will be blank "", indicating that no filter filter was used. Otherwise, ts_filter() uses the regular expressions supplied to the filter argument as values for the filter variable. To make the filter labels pretty, users may also provide a character vector using the key parameter.

## plot multiple time series by first filtering the data using
## regular expressions on the tweet "text" variable
rt %>%
  dplyr::group_by(screen_name) %>%
  ## The pipe operator allows you to combine this with ts_plot
  ## without things getting too messy.
  ts_plot() + 
  ggplot2::labs(
    title = "Tweets during election day for the 2016 U.S. election",
    subtitle = "Tweets collected, parsed, and plotted using `rtweet`"
  )

The developer cautions that these plots often resemble frowny faces: the first and last points appear significantly lower than the rest. This is caused by the first and last intervals of time to be artificially shrunken by connection and disconnection processes. To remedy this, users may specify trim = TRUE to drop the first and last observation for each time series.

stream-filter

Give rtweet a try and let me know whether you prefer it over twitter.

New to R? Kickstart your learning and career with these 6 steps!

New to R? Kickstart your learning and career with these 6 steps!

For newcomers, R code can look like old Egyptian hieroglyphs with its weird operators (%in%,<-,||, or %/%). The R language has been said to have a steep learning curve and although there are many introductory courses and books (see R Resources), it’s hard to decide where to start.

Fortunately, I am here to help! The below is a six-step guide on how to learning R, using only open access (i.e., free!) materials.

Although oriented at complete newcomers, it will have you writing your own practical scripts and programs in no time: just start at #1 and work your way to coding mastery!

If you already feel comfortable with the basics of R — or don’t like basics — you can start at #5 and jump into practical learning via the tidyverse.

Good luck!!!

Step 1: An R Folder (15 min)

Create a directory for your R learning stuff somewhere on your computer. Download this (very) short introduction to R by Paul Torfs and Claudia Bauer and store it in that folder. Now read the introduction and follow the steps. It will help you install all R software on your own computer and familiarize you with the standard data types.

Step 2: Handy Cheat Sheets (15 min)

Many standard functions exist in R and after a while you will remember them by heart. For now, it’s good to have a dictionary or references close by hand. Download and read the cheat sheets for base R (Mhairi McNeill) and R base functions (Tom Short). Because you’ll be writing most of your R scripts in RStudio, it’s also recommended to have an RStudio cheat sheet as well as an RStudio keyboard shortcuts cheat sheet by hand.

Step 3: swirl Away in RStudio (8h)

Now you’re ready to really start learning and we’re going to accelerate via swirl. Open up your RStudio and enter the two lines of code below in your console window.

install.packages('swirl') #download swirl package 
library(swirl) #load in swirl package

swirl (webpage) will automatically start and after a couple of prompts you will be able to choose the learning course called 1: R Programming: The basics of programming in R (see below). This course consists of 15 modules via which you will master the basics of R in the environment itself. Start with module 1 and complete between one to three modules per day, so that you finish the swirl course in a week.

Starting up swirl in RStudio
swirl’s R 4 learning courses and the 15 modules belonging to the basics of R programming course

Step 4: A Pirate’s Guide to R (10h)

OK, you should now be familiar with the basics of R. However, knowledge is crystallized via repetition. I therefore suggest, you walk through the book YaRrr! The Pirate’s Guide to R (Phillips, 2017) starting in chapter 3. It’s a fun book and will provide you with more knowledge on how to program custom functions, loops, and some basic statistical modelling techniques – the thing R was actually designed for.

Step 5: R for Data Science (16h)

By now, you can say you might say you are an adapt R programmer with statistical modelling experience. However, you have been working with base R functions mostly, knowledge of which is a must-have to really understand the language. In practice, R programmers rely strongly on developed packages nevertheless. A very useful group of packages is commonly referred to as the tidyverse. You will be amazed at how much this set of packages simplifies working in R. The next step therefore, is to work through the book R for Data Science (Grolemund & Wickham, 2017) (hardcopy here).

Step 6: Specialize (∞)

You are now several steps and a couple of weeks further. You possess basic knowledge of the R language, know how to write scripts in RStudio, are capable of programming in base R as well as using the advanced functionality of the tidyverse, and you have even made a start with some basic statistical modelling.

It’s time to set you loose in the wonderful world of the R community. If you had not done this earlier, you should get accounts on Stack Overflow and Cross Validated. You might also want to subscribe to the R Help Mailing ListR Bloggers, and to my website obviously.

Join 385 other subscribers

On Twitter, have a look at #rstats and, on reddit, subscribe to the rstats, rstudio, and statistics threads. At this time, I can’t but advise you to return to the R Resources Overview and to continue broadening your R programming skills. Pick materials in the area that interests you:

Functional programming and why not to “grow” vectors in R

Functional programming and why not to “grow” vectors in R

For fresh R programmers, vectorization can sound awfully complicated. Consider two math problems, one vectorized, and one not:

math_operations.png
Two ways of doing the same computations in R

Why on earth should R spend more time calculating one over the other? In both cases there are the same three addition operations to perform, so why the difference? This is what we will try to illustrate in this post, which is inspired on work by Naom Ross and the University of Auckland.

R behind the scenes:

R is a high-levelinterpreted computer language. This means that R takes care of a lot of basic computer tasks for you. For instance, when you type i <- 5.0 you don’t have to tell R:

  • That 5.0 is a floating-point number
  • That i should store numeric-type data
  • To find a place in memory for to put 5
  • To register i as a pointer to that place in memory

You also don’t have to convert i <- 5.0 to binary code. That’s done automatically when you hit Enter. Although you are probably not aware of this, many R functions are just wrapping your input and passing it to functions of other, compiled computer languages, like C, C++, or FORTRAN, in the back end.

Now here’s the crux: If you need to run a function over a set of values, you can either pass in a vector of these values through the R function to the compiled code (math example 1), or you could call the R function repeatedly for each value separately (math example 2). If you do the latter, R has to do stuff (figuring out the above bullet points and translating the codeeach time, repeatedly. However, if you call the function with a vector instead, this figuring out part only happens once. For instance, R vectors are constricted to a single data type (e.g., numeric, character) and when you use a vector instead of seperate values, R only needs to determine the input data type once, which saves time. In short, there’s no advantage to NOT organizing your data as vector.

For-loops and functional programming (*ply)

One example of how new R programmers frequently waste computing time is via memory allocation. Take the below for loop, for instance:

iterations = 10000000
 
 system.time({
   j = c()
   for(i in 1:iterations){
     j[i] <- i
   }
 })
##    user  system elapsed 
##    3.71    0.23    3.94

Here, vector j starts out with length 1, but for each iteration in the for loop, R has to re-size the vector and re-allocate memory. Recursively, R has to find the vector in its memory, create a new vector that will fit more data, copy the old data over, insert the new data, and erase the old vector. This process gets very slow as vectors get big. Nevertheless, it is often how programmers store the results of their iterative processes. Tremendous computing power can be saved by pre-allocating vectors to fit all the values beforehand, so that R does not have to do unnecessary actions:

system.time({
   j = rep(NA, iterations)
   for(i in 1:iterations){
     j[i] <- i
   }
 })
##    user  system elapsed 
##    0.88    0.01    0.89

More advanced R users tend to further optimize via functionals. This family of functions takes in vectors (or matrices, or lists) of values and applies other functions to each. Examples are apply or plyr::*ply functions, which speed up R code mainly because the for loops they have inside them automatically do things like pre-allocating vector size. Functionals additionally prevent side effects: unwanted changes to your working environment, such as the remaining i after for(i in 1:10). Although this does not necessarily improve computing time, it is regarded best practice due to preventing consequent coding errors.

Conclusions

There are cases when you should use for loops. For example, the performance penalty for using a for loop instead a vector will be small if the number of iterations is relatively small, and the functions called inside your for loop are slow. Here, looping and overhead from function calls make up a small fraction of your computational time and for loops may be more intuitive or easier to read for you. Additionally, there are cases where for loops make more sense than vectorized functions: (1) when the functions you seek to apply don’t take vector arguments and (2) when each iteration is dependent on the results of previous iterations.

Another general rule: fast R code is short. If you can express what you want to do in R in a line or two, with just a few function calls that are actually calling compiled code, it’ll be more efficient than if you write long program, with the added overhead of many function calls.

Finally, there is a saying that “premature optimization is the root of all evil”. What this means is that you should not re-write your R code unless the computing time you are going to save is worth the time invested. Nevertheless, if you understand how vectorization and functional programming work, you will be able to write more faster, safer, and more transparent (short & simple) programs.

 

Regular Expressions in R – Part 1: Introduction and base R functions

Regular Expressions in R – Part 1: Introduction and base R functions

The following is the first part of my introduction to regular expression (regex), in general, and the use of regex in R, in specific. It is loosely inspired on the swirl() tutorial by Jon Calder. I created it in R Markdown and uploaded it to RPubs, for an easier read.

Regular expression

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings (Wikipedia). Regular expressions were originally developed for the Perl language and have since been implemented in many other languages including R.

Regular expressions usually involve two parts: a pattern and a text string. The pattern defines what type and/or sequence of characters to look for whereas the text string represents the content in which to search/match this pattern. Patterns are always strings themselves and thus need to be enclosed in (single or double) quotation marks.

Example

An example: the pattern “stat” will match the occurance of the letters “s”, “t”, “a”, “t” in that specific order. Regardless of where in the content (text string) they occur and what other characters may precede the “s” or follow the last “t”.

Base R’s grepl() function returns a logical value reflecting whether the pattern is matched. The below demonstrates how the pattern “stats” can be found in both “statistics” and “estate” but not in “castrate” (which does include the letters, but with an r in between), in “catalyst” (which does include the letters, but not in the right order), or in “banana” (which does not include all the letters).

words = c("statistics", "estate", "castrate", "catalyst", "banana")
grepl(pattern = "stat", x = words)
## [1]  TRUE  TRUE FALSE FALSE FALSE

Moreover, regular expressions are case sensitive, so “stat” is not found in “Statistics”, unless it is specified that case should be ignored (FALSE by default).

grepl(pattern = "stat", x = "Statistics")
## [1] FALSE
grepl(pattern = "stat", x = "Statistics", ignore.case = TRUE)
## [1] TRUE

Regular Expressions in Base R

Base R includes seven main functions that use regular expressions with different outcomes. These are grep()grepl()regexpr()gregexpr()regexec()sub(), and gsub(). Although they require mostly similar inputs, their returned values are quite different.

grep() & grepl()

grep() examines each element of a character vector and returns the indices where the pattern is matched.

sentences = c("I like statistics", "I like bananas", "Estates and statues are expensive")
grep("stat", sentences)
## [1] 1 3

By setting the value parameter to TRUEgrep() will return the character element instead of its index.

grep("stat", sentences, value = TRUE)
## [1] "I like statistics"                 "Estates and statues are expensive"

It’s logical brother grepl() you’ve seen before. It returns a logical value instead of the index or the element.

grepl("stat", sentences)
## [1]  TRUE FALSE  TRUE

regexpr() & gregexpr()

regexpr() seeks for a pattern in a text and returns an integer vector with two attributes (also vectors). The main integer vector returned represents the position where the pattern was first matched in the text. Its attribute “match.length” is also an integer vector representing the length of the match (in this case “stat” is always length 4).

If the pattern is not matched, both of the main vector and the length attribute will have a value of -1.

The second attribute (“useBytes”) is always a logical vector of length one. It represents whether matching is done byte-by-byte (TRUE) or character-by-character (FALSE), but you may disregard it for now.

sentences
## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"
regexpr("stat", sentences)
## [1]  8 -1  2
## attr(,"match.length")
## [1]  4 -1  4
## attr(,"useBytes")
## [1] TRUE

Note that, for the third sentence, regexpr() only returns the values for the first match (i.e., “Estate”) but not those of the second match (i.e., statues”). For this reason, the function has a brother, gregexpr(), which has the same functionality but performs the matching on a global scale (hence the leading g). This means that the algorithm does not stop after its first match, but continues and reports all matches within the content string.

grepexpr() thus does not return a single vector, but a list of vectors. Each of these vectors reflects an input content string as is the length of the number of matches within that content. For example, the “stat” pattern is matched twice in our third sentence, therefore its vector is length 2, with the starting position of each match as well as their lengths.

sentences
## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"
gregexpr("stat", sentences)
## [[1]]
## [1] 8
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  2 13
## attr(,"match.length")
## [1] 4 4
## attr(,"useBytes")
## [1] TRUE

()

In order to explain how regexec() differs from gregexpr(), we first need to explain how parentheses in work in regex. Most simply speaking, parentheses or round brackets (()) indicate groups. One of the advantages of groups is that logical tests can thus be conducted within regular expressions.

sentences 
## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"
grepl("like", sentences)
## [1]  TRUE  TRUE FALSE
grepl("are", sentences)
## [1] FALSE FALSE  TRUE
grepl("(are|like)", sentences)
## [1] TRUE TRUE TRUE

regexec()

However, these groups can also be useful to extract more detailed information from a regular expression. This is where regexec() comes in.

Like gregexpr()regexec() returns a list of the same length as the content. This list includes vectors that reflect the starting positions of the overall match, as well as the matches corresponding to parenthesized subpatterns. Similarly, attribute “match.length” reflects the lengths of each of the overall and submatches. In case no match is found, a -1 value is again returned.

The beauty of regexec() because clear when we split our pattern into two groups using parentheses: “(st)(at)”. As you can see below, both regexpr() and its global brother gregexpr() disregard this grouping and provide the same output as before – as you would expect for the pattern “stat”. In contast, regexec() notes that we now have a global pattern (“stat”)as well as two subpatterns (“st” and “at”). For each of these, the function returns the starting positions as well as the pattern lengths.

sentences
## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"
regexpr("(st)(at)", sentences)
## [1]  8 -1  2
## attr(,"match.length")
## [1]  4 -1  4
## attr(,"useBytes")
## [1] TRUE
gregexpr("(st)(at)", sentences)
## [[1]]
## [1] 8
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  2 13
## attr(,"match.length")
## [1] 4 4
## attr(,"useBytes")
## [1] TRUE
regexec("(st)(at)", sentences)
## [[1]]
## [1]  8  8 10
## attr(,"match.length")
## [1] 4 2 2
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] 2 2 4
## attr(,"match.length")
## [1] 4 2 2
## attr(,"useBytes")
## [1] TRUE

sub() & gsub()

The final two base regex functions are sub() and its global brother gsub(). These, very intiutively, substitute a matched pattern by a specified replacement and then return all inputs. For instance, we could replace “I” with “You” in our example sentences.

sub(pattern = "I", replacement = "You", sentences)
## [1] "You like statistics"               "You like bananas"                 
## [3] "Estates and statues are expensive"

Similarly, we could desire to replace all spaces by underscores. This would require a global search (i.e., gsub()), as sub() would stop after the first match.

sub(pattern = " ", replacement = "_", sentences)
## [1] "I_like statistics"                 "I_like bananas"                   
## [3] "Estates_and statues are expensive"
gsub(pattern = " ", replacement = "_", sentences)
## [1] "I_like_statistics"                 "I_like_bananas"                   
## [3] "Estates_and_statues_are_expensive"

This was the first part of my introduction to Regular Expression in R. For more information detailed information about all input parameters of each function, please consult the base R manual. In subsequent parts, I will introduce you to so-called Anchors, Character Classes, Groups, Ranges, and Quantifiers. These will allow you to perform more advanced searches and matches. Here, we will also elaborate on lazygreedy, and possesive regular expressions, which further expand our search capability as well as flexibility.

In the end, I hope to provide you with an overview of several Regular Expressions that I have found extremely useful in my personal project, and which should be valuable to anyone who conducts applied research (in organizations).