Tag: automation

Regular Expression Crosswords

Regular Expression Crosswords

A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is .*\.txt$.

Last week I posted a first tutorial on Regular Expressions in R and I am working its sequels. You may find additional resources on Regular Expressions in the learning overviews (RPythonData Science).

Today I came across this website of Regular Expression Crosswords, which proves a great resource to playfully master regular expression. All puzzles are validated live using the JavaScript regex engine. The figure below explains how it works

crossword

Via the links below you can jump puzzles that matches your expertise level:

Regular Expressions in R – Part 1: Introduction and base R functions

Regular Expressions in R – Part 1: Introduction and base R functions

The following is the first part of my introduction to regular expression (regex), in general, and the use of regex in R, in specific. It is loosely inspired on the swirl() tutorial by Jon Calder. I created it in R Markdown and uploaded it to RPubs, for an easier read.

Regular expression

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings (Wikipedia). Regular expressions were originally developed for the Perl language and have since been implemented in many other languages including R.

Regular expressions usually involve two parts: a pattern and a text string. The pattern defines what type and/or sequence of characters to look for whereas the text string represents the content in which to search/match this pattern. Patterns are always strings themselves and thus need to be enclosed in (single or double) quotation marks.

Example

An example: the pattern “stat” will match the occurance of the letters “s”, “t”, “a”, “t” in that specific order. Regardless of where in the content (text string) they occur and what other characters may precede the “s” or follow the last “t”.

Base R’s grepl() function returns a logical value reflecting whether the pattern is matched. The below demonstrates how the pattern “stats” can be found in both “statistics” and “estate” but not in “castrate” (which does include the letters, but with an r in between), in “catalyst” (which does include the letters, but not in the right order), or in “banana” (which does not include all the letters).

words = c("statistics", "estate", "castrate", "catalyst", "banana")
grepl(pattern = "stat", x = words)
## [1]  TRUE  TRUE FALSE FALSE FALSE

Moreover, regular expressions are case sensitive, so “stat” is not found in “Statistics”, unless it is specified that case should be ignored (FALSE by default).

grepl(pattern = "stat", x = "Statistics")
## [1] FALSE
grepl(pattern = "stat", x = "Statistics", ignore.case = TRUE)
## [1] TRUE

Regular Expressions in Base R

Base R includes seven main functions that use regular expressions with different outcomes. These are grep()grepl()regexpr()gregexpr()regexec()sub(), and gsub(). Although they require mostly similar inputs, their returned values are quite different.

grep() & grepl()

grep() examines each element of a character vector and returns the indices where the pattern is matched.

sentences = c("I like statistics", "I like bananas", "Estates and statues are expensive")
grep("stat", sentences)
## [1] 1 3

By setting the value parameter to TRUEgrep() will return the character element instead of its index.

grep("stat", sentences, value = TRUE)
## [1] "I like statistics"                 "Estates and statues are expensive"

It’s logical brother grepl() you’ve seen before. It returns a logical value instead of the index or the element.

grepl("stat", sentences)
## [1]  TRUE FALSE  TRUE

regexpr() & gregexpr()

regexpr() seeks for a pattern in a text and returns an integer vector with two attributes (also vectors). The main integer vector returned represents the position where the pattern was first matched in the text. Its attribute “match.length” is also an integer vector representing the length of the match (in this case “stat” is always length 4).

If the pattern is not matched, both of the main vector and the length attribute will have a value of -1.

The second attribute (“useBytes”) is always a logical vector of length one. It represents whether matching is done byte-by-byte (TRUE) or character-by-character (FALSE), but you may disregard it for now.

sentences
## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"
regexpr("stat", sentences)
## [1]  8 -1  2
## attr(,"match.length")
## [1]  4 -1  4
## attr(,"useBytes")
## [1] TRUE

Note that, for the third sentence, regexpr() only returns the values for the first match (i.e., “Estate”) but not those of the second match (i.e., statues”). For this reason, the function has a brother, gregexpr(), which has the same functionality but performs the matching on a global scale (hence the leading g). This means that the algorithm does not stop after its first match, but continues and reports all matches within the content string.

grepexpr() thus does not return a single vector, but a list of vectors. Each of these vectors reflects an input content string as is the length of the number of matches within that content. For example, the “stat” pattern is matched twice in our third sentence, therefore its vector is length 2, with the starting position of each match as well as their lengths.

sentences
## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"
gregexpr("stat", sentences)
## [[1]]
## [1] 8
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  2 13
## attr(,"match.length")
## [1] 4 4
## attr(,"useBytes")
## [1] TRUE

()

In order to explain how regexec() differs from gregexpr(), we first need to explain how parentheses in work in regex. Most simply speaking, parentheses or round brackets (()) indicate groups. One of the advantages of groups is that logical tests can thus be conducted within regular expressions.

sentences 
## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"
grepl("like", sentences)
## [1]  TRUE  TRUE FALSE
grepl("are", sentences)
## [1] FALSE FALSE  TRUE
grepl("(are|like)", sentences)
## [1] TRUE TRUE TRUE

regexec()

However, these groups can also be useful to extract more detailed information from a regular expression. This is where regexec() comes in.

Like gregexpr()regexec() returns a list of the same length as the content. This list includes vectors that reflect the starting positions of the overall match, as well as the matches corresponding to parenthesized subpatterns. Similarly, attribute “match.length” reflects the lengths of each of the overall and submatches. In case no match is found, a -1 value is again returned.

The beauty of regexec() because clear when we split our pattern into two groups using parentheses: “(st)(at)”. As you can see below, both regexpr() and its global brother gregexpr() disregard this grouping and provide the same output as before – as you would expect for the pattern “stat”. In contast, regexec() notes that we now have a global pattern (“stat”)as well as two subpatterns (“st” and “at”). For each of these, the function returns the starting positions as well as the pattern lengths.

sentences
## [1] "I like statistics"                 "I like bananas"                   
## [3] "Estates and statues are expensive"
regexpr("(st)(at)", sentences)
## [1]  8 -1  2
## attr(,"match.length")
## [1]  4 -1  4
## attr(,"useBytes")
## [1] TRUE
gregexpr("(st)(at)", sentences)
## [[1]]
## [1] 8
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  2 13
## attr(,"match.length")
## [1] 4 4
## attr(,"useBytes")
## [1] TRUE
regexec("(st)(at)", sentences)
## [[1]]
## [1]  8  8 10
## attr(,"match.length")
## [1] 4 2 2
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] 2 2 4
## attr(,"match.length")
## [1] 4 2 2
## attr(,"useBytes")
## [1] TRUE

sub() & gsub()

The final two base regex functions are sub() and its global brother gsub(). These, very intiutively, substitute a matched pattern by a specified replacement and then return all inputs. For instance, we could replace “I” with “You” in our example sentences.

sub(pattern = "I", replacement = "You", sentences)
## [1] "You like statistics"               "You like bananas"                 
## [3] "Estates and statues are expensive"

Similarly, we could desire to replace all spaces by underscores. This would require a global search (i.e., gsub()), as sub() would stop after the first match.

sub(pattern = " ", replacement = "_", sentences)
## [1] "I_like statistics"                 "I_like bananas"                   
## [3] "Estates_and statues are expensive"
gsub(pattern = " ", replacement = "_", sentences)
## [1] "I_like_statistics"                 "I_like_bananas"                   
## [3] "Estates_and_statues_are_expensive"

This was the first part of my introduction to Regular Expression in R. For more information detailed information about all input parameters of each function, please consult the base R manual. In subsequent parts, I will introduce you to so-called Anchors, Character Classes, Groups, Ranges, and Quantifiers. These will allow you to perform more advanced searches and matches. Here, we will also elaborate on lazygreedy, and possesive regular expressions, which further expand our search capability as well as flexibility.

In the end, I hope to provide you with an overview of several Regular Expressions that I have found extremely useful in my personal project, and which should be valuable to anyone who conducts applied research (in organizations).

Writing your thesis with R Markdown

Writing your thesis with R Markdown

Markdown is a great tool for integrating data analysis and report writing. Rosanna van Hespen wrote a great five-blog guide on how to write your thesis in R Markdown:

  1. Getting started
  2. Text, Citations, & Equations
  3. Figures, Code, & Tables
  4. Putting it all together
  5. Layout

pix2code: teaching AI to build apps

Last May, Tony Beltramelli of Ulzard Technologies presented his latest algorithm pix2code at the NIPS conference. Put simply, the algorithm looks at a picture of a graphical user interface (i.e., the layout of an app), and determines via an iterative process what the underlying code likely looks like.

Afbeeldingsresultaat voor user interface
Graphical user interface examples (Google Images)

Please watchUlzard’s pix2code demo video or the third-party summary at the bottom of this blog. My undertanding is that pix2code is based on convolutional and recurrent neural networks (long explanation video) in combination with long short-term memory (short explanation video). Based on a single input image, pix2code can generate code that is 77% accurate and it works for three of the larger platforms (i.e. iOS, Android and web-based technologies).

The input and output of pix2code

Obviously, this is groundbreaking technology. When further developed, pix2code not only increases the speed with which society is automated/robotized but it also further expands the automation to more complex and highly needed tasks, such as programming and web/app development.

Here you can read the full academic paper on pix2code.

Below is the official demo reviewed by another data enthusiast with commentary and some additional food for thought.

Read here some of my other blogs on neural networks and robotization:

Robert Coombs and his application robot

Robert Coombs and his application robot

Robert Coombs wanted to see whether he could land a new job. He was aware that, these days, organizations often employ applicant tracking systems to progress/fail incoming applications. Hence, Robert concluded that he had two challenges in his search for a new job:

  • He was up against leaders in their field, so his resume wouldn’t simply jump to the top of the pile.
  • Robots would read his application, along with those of his competition.

Being a tech enthusiast and having some programming skills, he decided to build his own application robot, capable of sending a customized CV and resume to the thousands of jobs posted online every day, in a matter of seconds. I strongly recommend you read his full story here, but these were his conclusions:

  • It’s not how you apply, it’s who you know. And if you don’t know someone, don’t bother.
  • Companies are trying to fill a position with minimal risk, not discover someone who breaks the mold.
  • The number of jobs you apply to has no correlation to whether you’ll be considered, and you won’t be considered for jobs you don’t get the chance to apply to.

What I found most amusing is that he A/B tested one normal-looking cover letter and a letter in which he that admits right in the second sentence that it was being sent by a robot. “Now, one of those letters should have performed either a lot better or a lot worse than the other. For my purposes, I didn’t care which” he states. But as far as he could tell from the results of this experiment, it seems that nobody even reads cover letters anymore – not even the robots supposedly used in application tracking systems.