Regular expressions

Course length: 2 days (16 hours)

Description: Regular expressions (“regexps”) make it possible to find patterns inside of text. Whether you’re trying to find all of the URLs in a document, or the IP addresses in a logfile, or telephone numbers in an address book, regular expressions can be invaluable — and thus are included in most modern programming languages, such as Python, Ruby, and JavaScript, as well as .NET, Java, and C++. Many utilities, such as Unix’s famous “grep” command, are popular because they use regular expressions. However powerful regular expressions might be, they are also famously difficult to write, and even more difficult to read. Many experienced programmers find themselves frustrated by the syntax of regular expressions, and either avoid them entirely or use pre-packaged recipes they find on the Internet.

This course introduces regular expressions, and provides numerous insights into their many features and uses. It will also point to differences between dialects of regular expressions, ways in which they should (and shoudn’t) be used, advanced techniques such as named groups and lookahead/lookbehind.

By the end of the course, participants should feel comfortable using regular expressions to find and analyze text.
A large part of the course will be hands-on exercises, which will help participants to learn and understand the regular expression syntax.

Audience: This course is aimed at experienced programmers who wish to unlock the power of regular expressions in their day-to-day work. While nearly all exercises will be in the Python language, little or no knowledge of Python is necessary; the course will begin with a very brief introduction to the Python features needed to do the exercises.

Course syllabus

Minimal Python for text processing
- Strings
- Lists
- Loops
- Files
Overview of regular expressions
- Python’s “re” module
- re.find
- re.search
- re.findall
- match objects
Characters and metacharacters
Multiple runs of a character
- +
- *
- {min,max}
Debugging regexps
Character classes
- []
- ^
- $
- –
Built-in character classes
- \w \W
- \s \S
- \d \D
- Greediness
- Anchors
- Start/end of line
- Start/end of string
- Start/end of word
Options
- Case
- Line endings
- Extended regexps with comments
Alternation
- Parentheses for combining
- Parentheses for capturing
- Non-capturing parentheses
- Named groups
- Backreferences
Raw strings, backslashes, and regexps
Lookahead/lookbehind
Conditionals
Handling Unicode
- Bytes vs. characters
- Matching Unicode characters
- Sets of Unicode characters
Compiling regexps
Replacing regexps
Match context
Development and debugging strategies
Regexps in Unix