Regular expressions

Course length: 2 days (16 hours)

Description: Regular expressions (“regexps”) make it possible to find patterns inside of text. Whether you’re trying to find all of the URLs in a document, or the IP addresses in a logfile,  or telephone numbers in an address book, regular expressions can be invaluable — and thus are included in most modern programming languages, such as Python, Ruby, and JavaScript, as well as .NET, Java, and C++. Many utilities, such as Unix’s famous “grep” command, are popular because they use regular expressions. However powerful regular expressions might be, they are also famously difficult to write, and even more difficult to read. Many experienced programmers find themselves frustrated by the syntax of regular expressions, and either avoid them entirely or use pre-packaged recipes they find on the Internet.

This course introduces regular expressions, and provides numerous insights into their many features and uses. It will also point to differences between dialects of regular expressions, ways in which they should (and shoudn’t) be used, advanced techniques such as named groups and lookahead/lookbehind.

By the end of the course, participants should feel comfortable using regular expressions to find and analyze text.
A large part of the course will be hands-on exercises, which will help participants to learn and understand the regular expression syntax.

Audience: This course is aimed at experienced programmers who wish to unlock the power of regular expressions in their day-to-day work. While nearly all exercises will be in the Python language, little or no knowledge of Python is necessary; the course will begin with a very brief introduction to the Python features needed to do the exercises.

Course syllabus

  • Minimal Python for text processing
    • Strings
    • Lists
    • Loops
    • Files
  • Overview of regular expressions
    • Python’s “re” module
    • re.find
    • re.findall
    • match objects
  • Characters and metacharacters
  • Multiple runs of a character
    • +
    • *
    • {min,max}
  • Debugging regexps
  • Character classes
    • []
    • ^
    • $
  • Built-in character classes
    • \w \W
    • \s \S
    • \d \D
    • Greediness
    • Anchors
    • Start/end of line
    • Start/end of string
    • Start/end of word
  • Options
    • Case
    • Line endings
    • Extended regexps with comments
  • Alternation
    • Parentheses for combining
    • Parentheses for capturing
    • Non-capturing parentheses
    • Named groups
    • Backreferences
  • Raw strings, backslashes, and regexps
  • Lookahead/lookbehind
  • Conditionals
  • Handling Unicode
    • Bytes vs. characters
    • Matching Unicode characters
    • Sets of Unicode characters
  • Compiling regexps
  • Replacing regexps
  • Match context
  • Development and debugging strategies
  • Efficiency
  • Regexps in JavaScript
  • Regexps in Ruby
  • Regexps in Unix
  • When are regexps not useful?
  • Useful resources