Tuesday, August 28, 2012

Regex Cheat Sheet

Regexes (regular expressions) are an extremely useful tool, but I find myself getting tripped up by differences in their APIs and capabilities in different languages. Here's a regex Rosetta Stone, covering how to use regexes in various programming language.

Supported syntax

Different languages and libraries support different syntaxes (supported special characters and so on).

Perl
Basic syntax, extended patterns
Python
Library reference, HOWTO

Raw strings are useful to cut down on the number of backslashes you need.

C++
Boost.Regex syntax (Perl-compatible by default, including support for Perl extended patterns)

Raw string literals are useful to cut down on the number of backslashes you need, if your compiler supports them.

C#
Regular Expression Language Elements

@-quoted string literals are useful to cut down on the number of backslashes you need.

JavaScript
“Writing a Regular Expression Pattern” on the Mozilla Developer Network

Supported modifiers

Regexes generally support modifiers to control case sensitivity, etc.

Perl
Modifiers
Python
re module constants
C++
boost::regex_constants::syntax_option_type
C#
System.Text.RegularExpressions.RegexOptions
JavaScript
See the “flags” section of the parameters to the RegExp object.

At the top of the file

To make regex functionality available in your module or source file:

Perl
Nothing necessary
Python
import re
C++
#include <boost/regex.hpp>

// Optionally:
using namespace boost::regex;
C#
using System.Text.RegularExpressions;
JavaScript
Nothing necessary

Matching an entire string

Perl
if ($text =~ /^hello \d+$/) { ... }

Perl regexes must be explicitly anchored using ^ and $ to match the entire string.

Python
if re.match(r'hello \d+$', text):
    ...

re.match starts at the beginning of the string but requires $ to anchor the match to the end of the string.

C++
if (regex_match(text, boost::regex("hello \\d+"))) { ... }

Use boost::regex_match if you want to require that the entire string match.

C#
if (Regex.isMatch(text, @"^Hello \d+$")) { ... }

C# regexes must be explicitly anchored using ^ and $ to match the entire string.

JavaScript
if (/^Hello \d+$/.test(text)) { ... }

JavaScript regexes must be explicitly anchored using ^ and $ to match the entire string.

Matching a substring

Perl
if ($text =~ /Hello \d+/) { ... }
Python
if re.search(r'Hello \d+', text):
    ...

Use re.search instead of re.match to search for a substring anywhere within the string.

C++
if (regex_search(text, boost::regex("Hello \\d+"))) { ... }

Use boost::regex_match if you want to search for a substring anywhere within the string.

C#
if (Regex.isMatch(text, @"Hello \d+")) { ... }
JavaScript
if (/Hello \d+/.test(text)) { ... }

Performing a case-insensitive match

Perl
if ($text =~ /hello \d+/i) { ... }
Python
if re.search(r'hello \d+', text, re.I):
    ...
C++
if (regex_search(text, boost::regex("hello \\d+",
    boost::regex_constants::icase))) { ... }
C#
if (Regex.isMatch(text, @"hello \d+", RegexOptions.IgnoreCase)) { ... }
JavaScript
if (/hello \d+/i.test(s)) { ... }

Storing a regex for later use

Perl
$r = qr/hello \d+/i;
if ($text =~ $r) { ... }
Python
r = re.compile(r'hello \d+', re.I)
if r.search(text):
    ...

Note that Python automatically caches the most recently used patterns (see here and here), so you won't necessarily see a performance gain by compiling a regex.

C++
const boost::regex r("hello \\d+", boost::regex_constants::icase);
if (regex_search(text, r)) { ... }
C#
Regex r = new Regex(@"hello \d", RegexOptions.IgnoreCase);
if (r.IsMatch(s)) { ... }

Note that NET automatically caches the most recently used patterns, so you won't necessarily see a performance gain by storing a regex for later use. Also note that, while most other languages define “compiling” a regex as interpreting it, .NET supports compiling regexes to actual IL, as described here and here.

JavaScript
var r = /hello \d+/i;
// or var r = new RegExp("hello \\d+", "i");
if (r.test(s)) { ... }

Replacing part of a string

Perl
# Replace all occurrences:
$test =~ s/Hello/Goodbye/g;
# Replace the first occurrence only:
$text =~ s/Hello/Goodbye/;
Python
# Replace all occurrences:
text = re.sub('Hello', 'Goodbye', text)
# Replace the first occurrence only:
text = re.sub('Hello', 'Goodbye', text, count=1)
C++
// Replace all occurrences:
text = regex_replace(text, boost::regex("Hello"), "Goodbye");
// Replace the first occurrence only:
text = regex_replace(text, boost::regex("Hello"), "Goodbye",
    boost::regex_constants::format_first_only);
C#
// Replace all occurrences:
text = Regex.Replace(text, "Hello", "Goodbye");
// Replace the first occurrence only:
Regex r = new Regex("Hello");
text = r.Replace(text, "Goodbye", 1);
JavaScript
// Replace all occurrences:
text = text.replace(/Hello/g, 'Goodbye');
// Replace the first occurrence only:
text = text.replace(/Hello/, 'Goodbye');

Extracting parts of a string

Perl
if (($title, $name) = $text =~ /(Mr\.|Mrs\.|Dr\.) (\w+)/) { ... }
Python
m = re.search(r'(Mr\.|Mrs\.|Dr\.) (\w+)', text)
if m:
    title, name = m.groups()
    ...
C++
boost::smatch m;
if (regex_search(text, m, boost::regex("(Mr\\.|Mrs\\.|Dr\\.) (\\w+)"))) {
    const std::string& title = m[1].str();
    const std::string& name = m[2].str();
    ...
}
C#
Match m = Regex.Match(text, @"(Mr\.|Mrs\.|Dr\.) (\w+)");
if (m.Success) {
    string title = m.Groups[1].Value;
    string name = m.Groups[2].Value;
    ...
    ...
}
JavaScript
var match = /(Mr\.|Mrs\.|Dr\.) (\w+)/.exec(text);
if (match !== null) {
    var title = match[1];
    var name = match[2];
    ...
}

No comments: