Regular Expressions basics

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • Use a few basic regular expressions.

Objectives
  • Understand the different elements of a regular expression

Scenario

You’re working with a researcher who needs to locate information within a dataset she has collected. The researcher’s data has been transcribed from field notebooks. Data was collected by a number of different grad students, and consists of site names, dates, and instrument readings.

Each grad student’s field notebook is saved as its own data file. There is little consistency between files, because each grad student recorded their information differently. Let’s take a look:

Notebook 1:

Baker 1       2009-11-17       1223.0
Baker 1       2010-06-24       1122.7
Baker 2       2009-07-24       2819.0
Baker 2       2010-08-25       2971.6
Baker 1       2011-01-05       1410.0
Baker 2       2010-09-04       4671.6

Notebook 1 uses a single tab between each column as a separator. There are also spaces in the site names. The dates are written in the international standard format.

Notebook 2:

Davison/May 23, 2010/1724.7
Pertwee/May 24, 2010/2103.8
Davison/June 19, 2010/1731.9
Davison/July 6, 2010/2010.7
Pertwee/Aug 4 2010/1731.3
Davison/Apr 22, 201/2122.2
Pertwee/Sept 3, 2010/3981.0

Notebook 2 uses slashes as separators, and there are no spaces in the site names. Dates are in a non-standard format.

Practice

Basics

Characters

Modifiers

Metacharacters & operators

There are certain characters that have special meaning in regular expressions, these are \, |, ( ) . [ ] * + ? { } ^ $.

These are used in combination with text characters to construct regular expressions.

Operator Description
\ Escape character - use when you need to include a metacharacter as a literal in a regular expression e.g. \.txt
| OR (alternation) e.g. wom[a|e]n
( ) Group e.g. (....)-(..)-(..) for a date
. Match any single character e.g. wom.n
[abc] Match any of a, b, or c. e.g. [btr]ent
[a-c] Match any character between a and c. e.g. [a-z]ent
* The preceding item will be matched zero or more times e.g. teen[a-z]*
+ The preceding item will be matched one or more times organi.+
? The preceding item is optional and will be matched once at most, e.g. colo?r Used to make a quantifier ‘lazy’, e.g. .+?
{N} The preceding item is matched exactly N times [0-9]{4}
{N,} The preceding item is matched N or more times [0-9]{4,}
{N,M} The preceding item is matched at least N times, but not more than M times [0-9]{4,6]
^ Anchor: matches only at the start of the string e.g. ^b When used within a character set, negates the set, i.e. matches all characters not in the set - e.g. [^abc]
$ Anchor: matches only at the end of the string e.g. b$
\b Anchor: matches at “word boundary” (zero length position), i.e. transition from word to non-word characters. This allows you to perform “whole word only” searchers e.g. \bword\b
\w Shorthand character class: “word”. Marches all the ASCII characters [A-Za-z0-9_]
\s Shorthand character class: “whitespace”. Matches non-word characters including space, tab, line break or form feed.
\d Shorthand character class: “digit”. Matches numeric characters [0-9].

Building your first expressions

Key Points

  • Match caracters by using characters as a regex

  • {“Add modifiers to change default behavior”=>”case sensitive.”}

  • Operators