Regular Expressions

If you had to find the proverbial needle in a haystack...

It might take you a long time to find it...

Unless you pushed the haystack over and used a Nailhawg!

Nailhawg

Regular expressions

Regular expressions offer a convenient way of finding characters or patterns of characters in a string and manipulating them.

Most computer languages (python, javascript, perl, php, to name a few) offer a similar-looking way of doing regular expressions.

Getting started with regex'es in Python

>>> import re
>>> s = 'Prof of Communication'
>>> re.sub('Prof', 'Professor', s)
'Professor of Communication'

Beginning and ending

>>> s = 'Dir of Dire Needs Assessment'
>>> re.sub('Dir', 'Director', s)
'Director of Directore Needs Assessment'
>>> 
>>> re.sub('^Dir', 'Director', s)
'Director of Dire Needs Assessment'
>>> s = re.sub(^Dir', 'Director', s)
>>> re.sub('ment$', 'ments', s)
'Director of Dire Needs Assessments'

Word boundaries

>>> s = 'Assoc Dir of Dire Needs Assessments'
>>> re.sub('\\bDir\\b','Director',s)
'Assoc Director of Dire Needs Assessments'
>>> re.sub(r'\bDir\b','Director',s)
'Assoc Director of Dire Needs Assessments'

| means 'or'

>>> s = 'prof of Mathematics'
>>> re.sub( r'\b(prof|Prof)\b', 'Professor', s)
'Professor of Mathematics'

Character classes

>>> s = 'prof of Mathematics'
>>> re.sub( r'\b[pP]rof\b', 'Professor', s)
'Professor of Mathematics'

Repetition - *

>>> s = 'Dir. of Giving and dir of Marketing'
>>> re.sub ( r'[Dd]ir', 'Director', s)
'Director. of Giving and Director of Marketing'
>>> re.sub ( r'[Dd]ir[.]*', 'Director', s)
'Director of Giving and Director of Marketing'

The * means "the previous thing repeated 0 or more times".

Repetition - +

>>> s = 'Assoc Prof   of Ergonomics   '
>>> re.sub ( r'[\s]*', ' ', s)
' A s s o c P r o f o f E r g o n o m i c s '
>>> re.sub ( r'[\s]+', ' ', s)
'Assoc Prof of Ergonomics '

What pattern would you use to strip all of the trailing spaces off?

A little biology

How about some chicken?

The notation for DNA consensus sequences already looks pretty close to regular expressions.

An intron is a sequence of DNA that a particular enzyme recognizes, binds to, and then cuts out. One characteristic of many is that they begin with "GT" and end with "AG".

Matching anything

We'd like to match "GT" followed by anything, followed by "AG".

>>> s = 'TTGTCCCCCCAGTTTTAGCCCC'
>>> re.sub( r'GT.*AG', '[possible intron]', s)
'TT[possible intron]CCCC'

Do you see the problem with this?

Non-greedy matching

The . operator is described as greedy because it matches as much as possible between the patterns on either side of it.

Use the ? modifier to make an expression non-greedy

>>> s = 'TTGTCCCCCCAGTTTTAGCCCC'
>>> re.sub( r'GT.*?AG', '[possible intron]', s)
'TT[possible intron]TTTTAGCCCC'

It would be nice to see the sequence betwee GT and AG of the possible intron. Let's show it...

Groups

>>> s = 'TTGTCCCCCCAGTTTTAGCCCC'
>>> re.sub( r'GT(.*?)AG', r'[GT]->\1<-[AG]', s)
'TT[GT]->CCCCCC<-[AG]TTTTAGCCCC'

In the replacement code (now a raw string) \1 captures the contents of the first set of (..) brackets and makes it available.

Named groups

>>> s = 'TTGTCCCCCCAGTTTTAGCCCC'
>>> m = re.search( r'GT(?P<intron_seq>.*?)AG', s)
>>> m.groups('intron_seq')
('CCCCCC',)

Assignment

  1. Work through Dive Into Python, Chapter 7 on regular expressions.
  2. Create a new django project with a form on the 'index' page of your site, that
  3. has one text input box for a phone number, and
  4. one text input box for an e-mail address.

When someone fills in the form and submits it, the page that handles the form input should...

  1. Show the form again, with the user input filled in,
  2. Below the form, display the labelled "parts" of the phone number: area code, trunk, 4-digit, extension... You'll use the regexp that is developed in Chapter 7 to do this.
  3. Display a message if the e-mail address is *not* of the form [liberal variety of characters]@[string perhaps containing a .].[string with no dots]
  4. Either way, display the e-mail host (the part after @).

See also the Python Regular Expression HOWTO.

Image credits

Claude Monet, Noah W, Ian ?