Regular Expressions

If you had to find the proverbial needle in a haystack...

It might take you a long time to find it...

Unless you pushed the haystack over and used a Nailhawg!

Nailhawg

Regular expressions

Regular expressions offer a convenient way of finding characters or patterns of characters in a string and manipulating them.

Most computer languages (python, javascript, perl, php, to name a few) offer a similar-looking way of doing regular expressions.

Getting started with regex'es in Python

>>> import re
>>> s = 'Prof of Communication'
>>> re.sub('Prof', 'Professor', s)
'Professor of Communication'

'Prof' is the pattern.
We talk about the string matching the pattern.

Beginning and ending

>>> s = 'Dir of Dire Needs Assessment'
>>> re.sub('Dir', 'Director', s)
'Director of Directore Needs Assessment'
>>> 
>>> re.sub('^Dir', 'Director', s)
'Director of Dire Needs Assessment'
>>> s = re.sub(^Dir', 'Director', s)
>>> re.sub('ment$', 'ments', s)
'Director of Dire Needs Assessments'

'^' matches the beginning of the string.
'$' matches the end.

Word boundaries

>>> s = 'Assoc Dir of Dire Needs Assessments'
>>> re.sub('\\bDir\\b','Director',s)
'Assoc Director of Dire Needs Assessments'
>>> re.sub(r'\bDir\b','Director',s)
'Assoc Director of Dire Needs Assessments'

\b matches any 'word boundary', which include spaces, punctuation, and beginnings and endings of strings.
Use raw string mode by prefacing your patterns like this: r'...', to avoid the "Backslash Plague".

| means 'or'

>>> s = 'prof of Mathematics'
>>> re.sub( r'\b(prof|Prof)\b', 'Professor', s)
'Professor of Mathematics'

Character classes

>>> s = 'prof of Mathematics'
>>> re.sub( r'\b[pP]rof\b', 'Professor', s)
'Professor of Mathematics'

Any character in the square box will match.
Ranges are also allowed, e.g. [a-zA-Z]

Repetition - *

>>> s = 'Dir. of Giving and dir of Marketing'
>>> re.sub ( r'[Dd]ir', 'Director', s)
'Director. of Giving and Director of Marketing'
>>> re.sub ( r'[Dd]ir[.]*', 'Director', s)
'Director of Giving and Director of Marketing'

The * means "the previous thing repeated 0 or more times".

Repetition - +

>>> s = 'Assoc Prof   of Ergonomics   '
>>> re.sub ( r'[\s]*', ' ', s)
' A s s o c P r o f o f E r g o n o m i c s '
>>> re.sub ( r'[\s]+', ' ', s)
'Assoc Prof of Ergonomics '

\s means any whitespace character.
The + means "the previous thing repeated 1 or more times".

What pattern would you use to strip all of the trailing spaces off?

A little biology

How about some chicken?

The notation for DNA consensus sequences already looks pretty close to regular expressions.

An intron is a sequence of DNA that a particular enzyme recognizes, binds to, and then cuts out. One characteristic of many is that they begin with "GT" and end with "AG".

Matching anything

We'd like to match "GT" followed by anything, followed by "AG".

. is actually an operator that matches any single character. (It was only because we enclosed the period in the character class brackets [.] before that it was not interpreted that way.)
.* matches any number (including 0) of any kind of characters.

>>> s = 'TTGTCCCCCCAGTTTTAGCCCC'
>>> re.sub( r'GT.*AG', '[possible intron]', s)
'TT[possible intron]CCCC'

Do you see the problem with this?

Non-greedy matching

The . operator is described as greedy because it matches as much as possible between the patterns on either side of it.

Use the ? modifier to make an expression non-greedy

>>> s = 'TTGTCCCCCCAGTTTTAGCCCC'
>>> re.sub( r'GT.*?AG', '[possible intron]', s)
'TT[possible intron]TTTTAGCCCC'

It would be nice to see the sequence betwee GT and AG of the possible intron. Let's show it...

Groups

>>> s = 'TTGTCCCCCCAGTTTTAGCCCC'
>>> re.sub( r'GT(.*?)AG', r'[GT]->\1<-[AG]', s)
'TT[GT]->CCCCCC<-[AG]TTTTAGCCCC'

In the replacement code (now a raw string) \1 captures the contents of the first set of (..) brackets and makes it available.

Named groups

>>> s = 'TTGTCCCCCCAGTTTTAGCCCC'
>>> m = re.search( r'GT(?P<intron_seq>.*?)AG', s)
>>> m.groups('intron_seq')
('CCCCCC',)

Assignment

Work through Dive Into Python, Chapter 7 on regular expressions.
Create a new django project with a form on the 'index' page of your site, that
has one text input box for a phone number, and
one text input box for an e-mail address.

When someone fills in the form and submits it, the page that handles the form input should...

Show the form again, with the user input filled in,
Below the form, display the labelled "parts" of the phone number: area code, trunk, 4-digit, extension... You'll use the regexp that is developed in Chapter 7 to do this.
Display a message if the e-mail address is *not* of the form [liberal variety of characters]@[string perhaps containing a .].[string with no dots]
Either way, display the e-mail host (the part after @).

See also the Python Regular Expression HOWTO.

Image credits

Claude Monet, Noah W, Ian ?