18 March 2012

Cross-Platform Regular Expressions

There used to be a lot of blathering about regexes but I think nobody would read them anyway and I want a reference. Also it's incomplete and there is no logic whatsoever to which parts I chose. This is why I wanted colorful code tags.

The patterns in this post are the regex string, which may need to be escaped further in code. Perl's regexes are the base of many other ones, but tiny differences abound.
References: perlre; java.util.Pattern; Python re; vimdoc *pattern*;
As expected, normal alphanumeric characters are regexes that match themselves and only themselves.
^ matches start of string.
$ matches end of string.
. matches any character except newline.
| matchers the OR of two regexes (Vim: requires escaping as \|.)
(pattern) groups a regex (Vim: requires escaping as \(pattern\).)
[chars] matches any of a set of characters, e.g. c or h or a or r or s.

Quantifiers

hee hee, h3 tags
Quantifiers (Vim calles them "multis") take the previous piece of regex and allow it to match repeatedly some number of times (possibly none).

*: 0 or more
+: 1 or more
?: 0 or 1
{5}: 5
{5,}: 5 or more
{5,10}: 5 to 10 inclusive
Python: {,10} is allowed (not in perl!)
Vim: \+, \?, and \{ must be escaped for the above meaning (under normal 'magic' setting). You can escape or not escape the }. \= is a synonym for \?, as a second question mark would delimit the offset in backward search. \{} or \{,5} is allowed.
They are tightly bounding, so 42+ matches strings like 42222, not 42424242.

The above quantifiers are greedy; they try to match as much as possible. At the end of any of them, add an extra ? and it becomes reluctant (Java tutorial term), trying to match as little as possible. Alternatively, adding a + makes the quantifier possessive; it matches everything it can and no less, even if it takes up something the rest of the pattern might need.
Python, Vim: no possessives.
Vim: No adding ?s. Instead only bracket quantifiers can be made reluctant, by adding a dash, like \{-5,10}.

Character Classes

As we've seen [aeiou] matches one of the characters between the square brackets. Then there are a variety of commonly-occurring character-sets that people want to match. Some very common ones:
\w = "word" character (alphanumeric, underscore)
\d = digit
\s = whitespace (space, tab, newlines)
\h = horizontal whitespace
Each can be capitalized to match any character not in the class instead. Note that they might include foreign Unicode characters with the same properties; check your language.

Weird Stuff

X is a regex.
(?#X): nothing, merely a comment (not in Java)
(?:X): X, non-capturing
(?=X): X, via zero-width positive lookahead
(?!X): X, via zero-width negative lookahead
(?<=X): X, via zero-width positive lookbehind
(?<!X): X, via zero-width negative lookbehind
(?>X): X, as an independent, non-capturing group

As should be expected Vim has different notation for every single one of these:
\%(X\): X, non-capturing
For other assertions, put one of \@=, \@!, \@<=, \@<!, \@> after a regex, probably a group.

Explanation: Zero-widths match a null substring, but only if the part after/before that null substring matches/does not match the pattern.

Independent group: It will look for the "first" way this pattern matches and only match that. Same idea as possessive quantifiers.

Here's a Vim string for a standalone URL from the random rst syntax I'm staring at. Admire.
"\<\%(\%(\%(https\=\|file\|ftp\|gopher\)://\|\%(mailto\|news\):\)[^[:space:]'\"<>]\+\|www[[:alnum:]_-]*\.[[:alnum:]_-]\+\.[^[:space:]'\"<>]\+\)[[:alnum:]/]"


A weak Vim substitute command for detecting raw ampersands in XML, for some reason.
:%s/&\(amp;\|gt;\|lt;\|quot;\|#\)\@!/\&/gc

Also. Infamous RFC regex

No comments: