Consider the following strings:
[1] "Mr. Evans" "Mrs. Evans" "Mr. Sproul" "Mrs. Sproul"
In words, if we wanted to obtain the last names (e.g., “Evans” for the first two), what pattern would we look for?
We can express these patterns formally with RegEx so that the engine can utilize them.
stringr
RegEx Functions PreviewTo demonstrate, we introduce the following functions from the library stringr
:
str_detect(string, pattern)
returns a logical indicating if pattern
is found in string
.str_match(string, pattern)
returns the first match of pattern
found in string
.str_match_all(string, pattern)
returns all matches of pattern
found in string
.In the simplest case, we can express a finite sequence of characters to match against1.
words <- c("hotel", "hotel", "remotely", "biotelemtry")
str_detect(words, "otel") # otel is found in all of these strings
[1] TRUE TRUE TRUE TRUE
'a'
).Consider the following strings, each containing an ID:
[1] "ID: 4056" "ID: 5149" "ID: 4132" "ID: 5156" "ID: 5145" "ID: 5157" "ID: 5657"
Suppose we wanted only the IDs (not "ID: "
). To express this as a pattern, we note that each ID consists of 4 digits.
So our expression should consist of 4 characters, each of type digit. We express this type with \d
or [:digit:]
:
This works similarly for other types of characters. Here are a few important cases:
Expression | Matches with |
---|---|
[:digits:] or \\d |
digits |
[:alpha:] |
letters |
[:lower:] |
lowercase letters |
[:upper:] |
uppercase letters |
[:punctuation:] |
punctuation |
. |
all characters except new line (\n ) |
We can express the desired number of repetitions of a pattern with:
Expression | Matches with (of pattern) |
---|---|
\<pattern\>? |
Zero or one |
\<pattern\>* |
Zero or more |
\<pattern\>+ |
One or more |
\<pattern\>{n} |
Exactly n |
\<pattern\>{n, } |
n or more |
\<pattern\>{n, m} |
between n and m |
We can therefore replace
Instead of specifying a character type…
We can also specify that it can be one of a specified set of allowable characters.
Suppose we’d like to evaluate a yes/no answer received from a user.
We can also allow any BUT a set of characters with [^<unallowed characters>]
.
We can specify a range of allowable values. Some examples:
Expression | Matches |
---|---|
[a-z] |
All lowercase letters |
[a-f] |
All lowercase letters between a and f (inclusive) |
[A-K] |
All capital letters between A and K (inclusive) |
[0-5] |
All digits between 0 and 5 (inclusive) |
[0-5A-Ka-f] |
All of the previous three. |
[^0-5A-Ka-f] |
None of the previous. |
We can allow one pattern OR another with <pattern A>|<pattern B>
.
Note the use of parenthesis - otherwise it will look for "Mr"
or "Mrs\\. Evans"
.
\\
before the period..
is used to represent any character). To indicate we would like to use the actual character in the pattern, we must precede it with \\
.We can “anchor” our desired patterns to the front or end of the string with ^
and $
, respectively.
[,1]
[1,] "4056"
[2,] "5149"
[3,] "4132"
[4,] "5156"
[5,] "5145"
[6,] "5157"
[7,] "5657"
We can check if a pattern exists or (oe doesn’t) before or after a pattern (without needing to include it in the resulting match).
Expression | Matches |
---|---|
\<pattern A\>(?=\<pattern B\>) |
followed by |
\<pattern A\>(?!\<pattern B\>) |
not followed by |
(?\<=\<pattern B\>)\<pattern A\> |
preceded by |
(?\<!\<pattern B\>)\<pattern A\> |
not preceded by |
[1] "Saab 900 SE Turbo" "Plymouth Barracuda"
[3] "Nissan Skyline 2000GT" "BMW 2002"
[5] "Chevrolet Corvette Stingray" "Porsche 928"
How do we obtain the car manufacturer?
[1] "Saab 900 SE Turbo" "Plymouth Barracuda"
[3] "Nissan Skyline 2000GT" "BMW 2002"
[5] "Chevrolet Corvette Stingray" "Porsche 928"
How do we obtain the cars whose model contains a number?
[1] "Saab 900 SE Turbo" "Plymouth Barracuda"
[3] "Nissan Skyline 2000GT" "BMW 2002"
[5] "Chevrolet Corvette Stingray" "Porsche 928"
How do we obtain the cars whose model does not contain a number?
[1] "Saab 900 SE Turbo" "Plymouth Barracuda"
[3] "Nissan Skyline 2000GT" "BMW 2002"
[5] "Chevrolet Corvette Stingray" "Porsche 928"
How do we obtain the car model?