Lab 12: RegEx Intro

Introduction

Declarative language for expressing patterns in strings.

  • Often referred to as RegEx (“ree-jex” or “reg-ecks”).
  • The RegEx engine (a program) will read your pattern(s) and attempt to find matches in any strings you provide.

Declarative Patterns

Consider the following strings:

[1] "Mr. Evans"   "Mrs. Evans"  "Mr. Sproul"  "Mrs. Sproul"

In words, if we wanted to obtain the last names (e.g., “Evans” for the first two), what pattern would we look for?

  • The second capital letter followed by letters?
  • A number of letters which come after a space?
  • All connected letters that touch the end of the string?

RegEx Formalizes This

We can express these patterns formally with RegEx so that the engine can utilize them.

For a preview, we can formalize the last expression, “all connected letters which touch the end of the string”, to obtain the last names:

names <- c("Mr. Evans", "Mrs. Evans", "Mr. Sproul", "Mrs. Sproul")
str_extract(names, "[:alpha:]+$")
[1] "Evans"  "Evans"  "Sproul" "Sproul"

stringr RegEx Functions Preview

To demonstrate, we introduce the following functions from the library stringr:

  • str_detect(string, pattern) returns a logical indicating if pattern is found in string.
  • str_match(string, pattern) returns the first match of pattern found in string.
  • str_match_all(string, pattern) returns all matches of pattern found in string.

Substring Matches

In the simplest case, we can express a finite sequence of characters to match against1.

words <- c("hotel", "hotel", "remotely", "biotelemtry")
str_detect(words, "otel") # otel is found in all of these strings
[1] TRUE TRUE TRUE TRUE
  • While helpful, this is clearly insufficient for most patterns.
  • For this reason, we introduce the ability to specify a type of character (instead of an exact one such as 'a').

Matching Against Digits

Consider the following strings, each containing an ID:

[1] "ID: 4056" "ID: 5149" "ID: 4132" "ID: 5156" "ID: 5145" "ID: 5157" "ID: 5657"

Suppose we wanted only the IDs (not "ID: "). To express this as a pattern, we note that each ID consists of 4 digits.

Matching Against Digits

So our expression should consist of 4 characters, each of type digit. We express this type with \d or [:digit:]:

str_match(ids, "[:digit:][:digit:][:digit:][:digit:]") 
     [,1]  
[1,] "4056"
[2,] "5149"
[3,] "4132"
[4,] "5156"
[5,] "5145"
[6,] "5157"
[7,] "5657"

RegEx Character Types

This works similarly for other types of characters. Here are a few important cases:

Expression Matches with
[:digits:] or \\d digits
[:alpha:] letters
[:lower:] lowercase letters
[:upper:] uppercase letters
[:punctuation:] punctuation
. all characters except new line (\n)

RegEx Repetitions

We can express the desired number of repetitions of a pattern with:

Expression Matches with (of pattern)
\<pattern\>? Zero or one
\<pattern\>* Zero or more
\<pattern\>+ One or more
\<pattern\>{n} Exactly n
\<pattern\>{n, } n or more
\<pattern\>{n, m} between n and m

Matching Against Digits (Repetition)

We can therefore replace

str_match(ids, "[:digit:][:digit:][:digit:][:digit:]")``

with the following:

str_match(ids, "[:digit:]{4}") # could also do [:digits:]+
     [,1]  
[1,] "4056"
[2,] "5149"
[3,] "4132"
[4,] "5156"
[5,] "5145"
[6,] "5157"
[7,] "5657"

Matching Against Allowable Characters

Instead of specifying a character type…

We can also specify that it can be one of a specified set of allowable characters.

List these characters one after another without spaces in square brackets ([]).

str_match(ids, "[0123456789]{4}") # equivalent to [:digits:]{4}

Yes/No Example

Suppose we’d like to evaluate a yes/no answer received from a user.

eval_answer <- function(answer) {
  if (str_detect(answer, "[Yy]+[Ee]+[Ss]+"))
    TRUE
  else if (str_detect(answer, "[Nn]+[Oo]+"))
    FALSE
  else
    NA
} 

But wait…

  • What happens if the user enters "Well, yes but actually no."?
  • What about "Yesterday, but not today..." or "I dunno"?

Unallowed Characters

We can also allow any BUT a set of characters with [^<unallowed characters>].

grades <- "Math Grade: B, English Grade: C, Statistics Grade: F, History Grade: D"
str_match_all(grades, "[:alpha:]+ Grade: [^DF]")
[[1]]
     [,1]              
[1,] "Math Grade: B"   
[2,] "English Grade: C"

Ranges

We can specify a range of allowable values. Some examples:

Expression Matches
[a-z] All lowercase letters
[a-f] All lowercase letters between a and f (inclusive)
[A-K] All capital letters between A and K (inclusive)
[0-5] All digits between 0 and 5 (inclusive)
[0-5A-Ka-f] All of the previous three.
[^0-5A-Ka-f] None of the previous.

OR Operator

We can allow one pattern OR another with <pattern A>|<pattern B>.

str_detect(names, "(Mr|Mrs)\\. Evans")
[1]  TRUE  TRUE FALSE FALSE

Note the use of parenthesis - otherwise it will look for "Mr" or "Mrs\\. Evans".

  • Note the use of \\ before the period.
  • Some characters are used as symbols ( the . is used to represent any character). To indicate we would like to use the actual character in the pattern, we must precede it with \\.

Anchors

We can “anchor” our desired patterns to the front or end of the string with ^ and $, respectively.

str_match(ids, "[:digit:]{4}$") # all IDs touch the end of the string
     [,1]  
[1,] "4056"
[2,] "5149"
[3,] "4132"
[4,] "5156"
[5,] "5145"
[6,] "5157"
[7,] "5657"
str_match(ids, "^[:digit]{4}") # none touch the front 
     [,1]
[1,] NA  
[2,] NA  
[3,] NA  
[4,] NA  
[5,] NA  
[6,] NA  
[7,] NA  

Don’t confuse this ^ with the one found in square brackets!

Lookahead, Lookbehind

We can check if a pattern exists or (oe doesn’t) before or after a pattern (without needing to include it in the resulting match).

Expression Matches
\<pattern A\>(?=\<pattern B\>) followed by
\<pattern A\>(?!\<pattern B\>) not followed by
(?\<=\<pattern B\>)\<pattern A\> preceded by
(?\<!\<pattern B\>)\<pattern A\> not preceded by

RegEx: Practice

[1] "Saab 900 SE Turbo"           "Plymouth Barracuda"         
[3] "Nissan Skyline 2000GT"       "BMW 2002"                   
[5] "Chevrolet Corvette Stingray" "Porsche 928"                

How do we obtain the car manufacturer?

RegEx: Practice

[1] "Saab 900 SE Turbo"           "Plymouth Barracuda"         
[3] "Nissan Skyline 2000GT"       "BMW 2002"                   
[5] "Chevrolet Corvette Stingray" "Porsche 928"                

How do we obtain the cars whose model contains a number?

RegEx: Practice

[1] "Saab 900 SE Turbo"           "Plymouth Barracuda"         
[3] "Nissan Skyline 2000GT"       "BMW 2002"                   
[5] "Chevrolet Corvette Stingray" "Porsche 928"                

How do we obtain the cars whose model does not contain a number?

RegEx: Practice

[1] "Saab 900 SE Turbo"           "Plymouth Barracuda"         
[3] "Nissan Skyline 2000GT"       "BMW 2002"                   
[5] "Chevrolet Corvette Stingray" "Porsche 928"                

How do we obtain the car model?