Skip to content

Readable and compositional regexes in Perl

September 29, 2010

Regexes don’t (always!) have to be unreadable mess. For example see this HN post a little Clojure DSL for readable, compositional regexes. Here is the simple Clojure example that was given:

(def datestamp-re
  (let [d {\0 \9}]
    (regex [d d d d :as :year] \- [d d :as :month] \- [d d :as :day])))

And the equivalent Perl regex “DSL” can be equally lucid:

sub datestamp_re {
    qr/ (?<year> \d \d \d \d) - (?<month> \d \d) - (?<day> \d \d ) /x;
}

The two things that provide a little extra help to grok whats going on here are:

  1. The x modifier on the end of qr// which allows whitespace and newlines to be sprinkled into your regex pattern without any effect on the pattern matching. See perlre Modifers
  2. And “Named Capture Buffers” which were added at perl 5.10.
    (?<year> \d{4}) # stores pattern matched in "year" buffer

    Above not only gives a name to that capture buffer but provides an excellent visual placeholder to help describe what you are trying to do with the regex.

When processing named capture regexes the matches to patterns are recorded in the %+ hash variable:

for my $date (qw/2007-10-23 20X7-10-23/) {
    printf "year:%d, month:%d, day:%d\n", @+{qw/year month day/}
        if $date =~ datestamp_re;
}

# => year:2007, month:10, day:23

This is much more flexible for dealing with regex captures compared to positional $1, $2, $3, etc. So not just more readable but more compositional:

# nice readable regex
sub datestamp_re  {
     my $year  = qr/ (?<year>  \d{4}) /x;  
     my $month = qr/ (?<month> \d{2}) /x;
     my $day   = qr/ (?<day>   \d{2}) /x;
 
     qr/ $year - $month - $day /x;
}

or:

# DRY regex
sub datestamp_re {
    my %re = map { 
        my ($name, $digits) = @$_;
        $name => qr/ (?<$name>  \d{$digits}) /x;
    } [ year  => 4 ], [ month => 2 ], [ day   => 2 ];
    
    qr/ $re{year} - $re{month} - $re{day} /x;
}

and even:

# regex generator
sub re { qr/ (?<$_[0]> $_[1] )/x }

sub regex {
    my $pattern = join q{}, @_;
    qr/ $pattern /x;
}

sub datestamp_re {
    regex re( year => '\d{4}' ), '-', re( month => '\d{2}' ), '-', re( day => '\d{2}' );
}

Now that is a regex DSL 🙂

Note that the %+ hash variable only captures the first occurrence in the relevant named buffer:

sub numbers_re {
    my $four  = qr/ (?<four> \d{4}) /x;
    my $two   = qr/ (?<two>  \d{2}) /x;
    qr/ $four - $two - $two /x;
}

if ('2007-10-23' =~ numbers_re) {
    say 'four => ', $+{four};
    say 'two  => ', $+{two};
}

# four => 2007
# two  => 10

To get to the second $two (ie. 23) then use the %- hash variable which stores all the captures in an array reference for relevant named buffer:

if ('2007-10-23' =~ numbers_re) {
    say 'two(s) => ', join ',' => @{ $-{two} };
}

# two(s) => 10,23

/I3az/

PS. Please note that the WordPress syntax highlighter used is unfortunately upper-casing all code comments 😦

4 Comments leave one →
  1. DATA permalink
    September 30, 2010 7:05 am

    Thanks for pointing me to that, that’s cool!

  2. Martin permalink
    September 30, 2010 9:07 am

    Nice posting draegtun. I’d seen named pattern captures in 5.10 but without some worked examples showing how much more readable they can be I’ve so far not used them. I will now. Thanks.

    BTW, the form to leave a comment here is almost unreadable in my firefox – the line around the entry boxes is so faint I did not even see it at first.

  3. October 8, 2010 10:02 am

    Many thanks DATA & Martin.

    Martin,

    i) Yes its surprising that a lot of the new Perl features don’t seem to get expanded on much beyond the Perl Delta docs (http://perldoc.perl.org/index-history.html). Still it gives us the opportunity for something to blog about 🙂

    ii) re “blog form issues in Firefox” – There must be an issue with this WordPress theme . Its my favourite from a bad bunch that WordPress provide for free 😦

    I’ll have a look to see if WP have provided something better now but my gut feeling is to move blog to a self-hosted (probably Perl based) solution. So more configurable and safer long term solution (and probably lots more work!) especially as I nearly chose Vox.com before settling on WordPress and look whats happened to them!

    regards Barry

Leave a comment