miercuri, 29 octombrie 2008

Regular Expressions Tutorial

A nice tutorial can be found here
http://www.regular-expressions.info/tutorialcnt.html

Short rundown

Character Classes/Sets :
- [ae] matches a or e
- [a-z] matches any character in the range a..z
- [0-9] matches any digit in the range 0..9
- [^u] matches all chars except the character u
- \w stands for [A-Za-z0-9_]
- \s stands for [ \t\r\n]
- \d stands for [0-9]
- \W stands for [^\w]
- \D stands for [^\d]
- The only metacharacters that don't need escaping in a class are ^,-,],\

The Dot:
- Can be used to replace any character ( except \n by default - but this can be disabled with the SingleLine option )
- USE NEGATED CHARACTER SETS INSTEAD OF THE DOT WHENEVER POSSIBLE

Anchors:
- ^ matches start of string
- $ matches end of string
- These 2 ignore the new line characters unless the MultiLine option is turned on

Word Boundaries:
- \bword\b matches word in "I went on wordpress to write a word"

Alternation:
- cat|dog matches cat or dog depending on which one is encountered first

Optional Items:
- colou?r matches both colour and color
- Feb 23(rd)? to the string Today is Feb 23rd, 2003, the match will always be Feb 23rd and not Feb 23. You can make the question mark lazy (i.e. turn off the greediness) by putting a second question mark after the first

Repetition:
- u+ matches one or more u
- u* matches zero or more u
- \b[1-9][0-9]{3}\b matches numbers between 1000 and 9999
- \b[1-9][0-9]{2,4}\b matches numbers between 100 and 99999
- These are greedy operators so you must use ? to make them lazy

Grouping and Backreference:
- ([a-c])x\1x\1 will match axaxa, bxbxb and cxcxc but won't match axbxc

more to be added when I finish reading :)