automation – Page 3 – paulvanderlaken.com

The following is the first part of my introduction to regular expression (regex), in general, and the use of regex in R, in specific. It is loosely inspired on the swirl() tutorial by Jon Calder. I created it in R Markdown and uploaded it to RPubs, for an easier read.

Regular expression

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings (Wikipedia). Regular expressions were originally developed for the Perl language and have since been implemented in many other languages including R.

Regular expressions usually involve two parts: a pattern and a text string. The pattern defines what type and/or sequence of characters to look for whereas the text string represents the content in which to search/match this pattern. Patterns are always strings themselves and thus need to be enclosed in (single or double) quotation marks.

Example

An example: the pattern “stat” will match the occurance of the letters “s”, “t”, “a”, “t” in that specific order. Regardless of where in the content (text string) they occur and what other characters may precede the “s” or follow the last “t”.

Base R’s grepl() function returns a logical value reflecting whether the pattern is matched. The below demonstrates how the pattern “stats” can be found in both “statistics” and “estate” but not in “castrate” (which does include the letters, but with an r in between), in “catalyst” (which does include the letters, but not in the right order), or in “banana” (which does not include all the letters).

words = c("statistics", "estate", "castrate", "catalyst", "banana")
grepl(pattern = "stat", x = words)

## [1]  TRUE  TRUE FALSE FALSE FALSE

Moreover, regular expressions are case sensitive, so “stat” is not found in “Statistics”, unless it is specified that case should be ignored (FALSE by default).

grepl(pattern = "stat", x = "Statistics")

## [1] FALSE

grepl(pattern = "stat", x = "Statistics", ignore.case = TRUE)

## [1] TRUE

Regular Expressions in Base R

Base R includes seven main functions that use regular expressions with different outcomes. These are grep(), grepl(), regexpr(), gregexpr(), regexec(), sub(), and gsub(). Although they require mostly similar inputs, their returned values are quite different.

`grep()` & `grepl()`

grep() examines each element of a character vector and returns the indices where the pattern is matched.

sentences = c("I like statistics", "I like bananas", "Estates and statues are expensive")
grep("stat", sentences)

## [1] 1 3

By setting the value parameter to TRUE, grep() will return the character element instead of its index.

grep("stat", sentences, value = TRUE)

## [1] "I like statistics"                 "Estates and statues are expensive"

It’s logical brother grepl() you’ve seen before. It returns a logical value instead of the index or the element.

grepl("stat", sentences)

## [1]  TRUE FALSE  TRUE

`regexpr()` & `gregexpr()`

regexpr() seeks for a pattern in a text and returns an integer vector with two attributes (also vectors). The main integer vector returned represents the position where the pattern was first matched in the text. Its attribute “match.length” is also an integer vector representing the length of the match (in this case “stat” is always length 4).

If the pattern is not matched, both of the main vector and the length attribute will have a value of -1.

The second attribute (“useBytes”) is always a logical vector of length one. It represents whether matching is done byte-by-byte (TRUE) or character-by-character (FALSE), but you may disregard it for now.

sentences

## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"

regexpr("stat", sentences)

## [1]  8 -1  2
## attr(,"match.length")
## [1]  4 -1  4
## attr(,"useBytes")
## [1] TRUE

Note that, for the third sentence, regexpr() only returns the values for the first match (i.e., “Estate”) but not those of the second match (i.e., “statues”). For this reason, the function has a brother, gregexpr(), which has the same functionality but performs the matching on a global scale (hence the leading g). This means that the algorithm does not stop after its first match, but continues and reports all matches within the content string.

grepexpr() thus does not return a single vector, but a list of vectors. Each of these vectors reflects an input content string as is the length of the number of matches within that content. For example, the “stat” pattern is matched twice in our third sentence, therefore its vector is length 2, with the starting position of each match as well as their lengths.

sentences

## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"

gregexpr("stat", sentences)

## [[1]]
## [1] 8
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  2 13
## attr(,"match.length")
## [1] 4 4
## attr(,"useBytes")
## [1] TRUE

`()`

In order to explain how regexec() differs from gregexpr(), we first need to explain how parentheses in work in regex. Most simply speaking, parentheses or round brackets (()) indicate groups. One of the advantages of groups is that logical tests can thus be conducted within regular expressions.

sentences

## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"

grepl("like", sentences)

## [1]  TRUE  TRUE FALSE

grepl("are", sentences)

## [1] FALSE FALSE  TRUE

grepl("(are|like)", sentences)

## [1] TRUE TRUE TRUE

`regexec()`

However, these groups can also be useful to extract more detailed information from a regular expression. This is where regexec() comes in.

Like gregexpr(), regexec() returns a list of the same length as the content. This list includes vectors that reflect the starting positions of the overall match, as well as the matches corresponding to parenthesized subpatterns. Similarly, attribute “match.length” reflects the lengths of each of the overall and submatches. In case no match is found, a -1 value is again returned.

The beauty of regexec() because clear when we split our pattern into two groups using parentheses: “(st)(at)”. As you can see below, both regexpr() and its global brother gregexpr() disregard this grouping and provide the same output as before – as you would expect for the pattern “stat”. In contast, regexec() notes that we now have a global pattern (“stat”)as well as two subpatterns (“st” and “at”). For each of these, the function returns the starting positions as well as the pattern lengths.

sentences

## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"

regexpr("(st)(at)", sentences)

## [1]  8 -1  2
## attr(,"match.length")
## [1]  4 -1  4
## attr(,"useBytes")
## [1] TRUE

gregexpr("(st)(at)", sentences)

## [[1]]
## [1] 8
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  2 13
## attr(,"match.length")
## [1] 4 4
## attr(,"useBytes")
## [1] TRUE

regexec("(st)(at)", sentences)

## [[1]]
## [1]  8  8 10
## attr(,"match.length")
## [1] 4 2 2
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] 2 2 4
## attr(,"match.length")
## [1] 4 2 2
## attr(,"useBytes")
## [1] TRUE

`sub()` & `gsub()`

The final two base regex functions are sub() and its global brother gsub(). These, very intiutively, substitute a matched pattern by a specified replacement and then return all inputs. For instance, we could replace “I” with “You” in our example sentences.

sub(pattern = "I", replacement = "You", sentences)

## [1] "You like statistics"               "You like bananas"                 
## [3] "Estates and statues are expensive"

Similarly, we could desire to replace all spaces by underscores. This would require a global search (i.e., gsub()), as sub() would stop after the first match.

sub(pattern = " ", replacement = "_", sentences)

## [1] "I_like statistics"                 "I_like bananas"                   
## [3] "Estates_and statues are expensive"

gsub(pattern = " ", replacement = "_", sentences)

## [1] "I_like_statistics"                 "I_like_bananas"                   
## [3] "Estates_and_statues_are_expensive"

This was the first part of my introduction to Regular Expression in R. For more information detailed information about all input parameters of each function, please consult the base R manual. In subsequent parts, I will introduce you to so-called Anchors, Character Classes, Groups, Ranges, and Quantifiers. These will allow you to perform more advanced searches and matches. Here, we will also elaborate on lazy, greedy, and possesive regular expressions, which further expand our search capability as well as flexibility.

In the end, I hope to provide you with an overview of several Regular Expressions that I have found extremely useful in my personal project, and which should be valuable to anyone who conducts applied research (in organizations).