Programming

Pattern matching with regular expressions: a quick reminder

Pattern matching is a powerful for developers, but it can be a daunting task to write (or read) a regular expression.
This quick reminder gives you the bare necessities to tackle regular expressions, if you’ve dabbled with regexp a bit but can’t remember it all.

Regular expressions in Perl

If you take “text” in the widest possible sense, perhaps 90% of what you do is 90% text processing. That’s really what Perl is all about and always has been about—in fact, it’s even part of Perl’s name: Practical Extraction and Report Language.
Larry Wall et al., Programming Perl

There you have it: regular expressions and pattern matching are at the core of Perl, which is why I started this post with it. It’s an integral part of the language, so much so that there is no function called match or subst in Perl, just a simple expression that you can put in a condition:

$_ = "Is there a doctor here?";
if (/doctor/) {
     print "Doctor who?";
}

Regular expressions in PHP and JavaScript

In PHP, you’ll have to use a dedicated function called preg_match() to get a match from a regular expression.
preg_match() needs at least two arguments: a regular expression, and a string to match it against:

$subject = "Is there a doctor here?";
$pattern = '/doctor/';
if (preg_match($pattern, $subject)) {
   echo "Yes. Yes, there is.";
}
(source: PHP documentation: preg_match)

JavaScript has its own RegExp  object, and a handy function that matches a string agains a RegExp object:

var subject = "Is there a doctor here?";
var pattern = new RegExp('doctor');
console.log(subject.match(re);

String.prototype.match() creates a new RegExp object on the fly if a string is provided instead:

var subject = "Is there a doctor here?";
var pattern = /doctor/;
console.log(subject.match(re);

(Source: Mozilla Developer Network)

Meta-characters in regular expressions

Beyond the most elementary expressions, regexps can sometimes look like gibberish, unless you know what all these characters stand for. The next few paragraph are a reminder of the most useful meta-characters you can find in a regular expression:

Boundaries

^
Beginning of input
$
End of input
b
Roughly speaking, beginning of a word (zero-width word boundary)
B
Roughly speaking, end of a word (zero-width non-word boundary)

Quantifiers

x*
x*? (non-greedy)
Matches x 0 or more times
x+
x+? (non-greedy)
Matches x 1 or more times
x?
Matches x 0 or 1 time
x{n}
Matches x n times or n+ times
x{n,}
Matches x at least n times
x {n,m}
matches x between n and m times

Grouping & alternation

( )
delimits a group of patterns, for instance (n*)(m*) matches n 0 or more times, then m 0 or more times
|
Matches either alternative patterns foo|bar matches either “foo” or “bar”

Character sets and character classes

[xyz]
Matches anything that is within xyz. Alphanumerical characters can be matches with a-z and A-Z, and digits can be matches with 0-9
opposite is [^xyz]

 

.
Dot: matches anything but line terminations
d
Matches a digit
opposite is D
w
Matches an alphanumeric character
opposite is W
s
Matches white space characters (tabs, spaces, etc)
Same as  [ fnrtv?u00a0u1680?u180eu2000?-u200a?u2028u2029u202fu205f?u3000ufeff]
opposite is S
[b]
Matches backspace
Not the same as b
Advertisements
Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s