Skip to content

Readable and compositional regexes in Perl

September 29, 2010

Regexes don’t (always!) have to be unreadable mess. For example see this HN post a little Clojure DSL for readable, compositional regexes. Here is the simple Clojure example that was given:

(def datestamp-re
  (let [d {\0 \9}]
    (regex [d d d d :as :year] \- [d d :as :month] \- [d d :as :day])))

And the equivalent Perl regex “DSL” can be equally lucid:

sub datestamp_re {
    qr/ (?<year> \d \d \d \d) - (?<month> \d \d) - (?<day> \d \d ) /x;
}

The two things that provide a little extra help to grok whats going on here are:

  1. The x modifier on the end of qr// which allows whitespace and newlines to be sprinkled into your regex pattern without any effect on the pattern matching. See perlre Modifers
  2. And “Named Capture Buffers” which were added at perl 5.10.
    (?<year> \d{4}) # stores pattern matched in "year" buffer

    Above not only gives a name to that capture buffer but provides an excellent visual placeholder to help describe what you are trying to do with the regex.

When processing named capture regexes the matches to patterns are recorded in the %+ hash variable:

for my $date (qw/2007-10-23 20X7-10-23/) {
    printf "year:%d, month:%d, day:%d\n", @+{qw/year month day/}
        if $date =~ datestamp_re;
}

# => year:2007, month:10, day:23

This is much more flexible for dealing with regex captures compared to positional $1, $2, $3, etc. So not just more readable but more compositional:

# nice readable regex
sub datestamp_re  {
     my $year  = qr/ (?<year>  \d{4}) /x;  
     my $month = qr/ (?<month> \d{2}) /x;
     my $day   = qr/ (?<day>   \d{2}) /x;
 
     qr/ $year - $month - $day /x;
}

or:

# DRY regex
sub datestamp_re {
    my %re = map { 
        my ($name, $digits) = @$_;
        $name => qr/ (?<$name>  \d{$digits}) /x;
    } [ year  => 4 ], [ month => 2 ], [ day   => 2 ];
    
    qr/ $re{year} - $re{month} - $re{day} /x;
}

and even:

# regex generator
sub re { qr/ (?<$_[0]> $_[1] )/x }

sub regex {
    my $pattern = join q{}, @_;
    qr/ $pattern /x;
}

sub datestamp_re {
    regex re( year => '\d{4}' ), '-', re( month => '\d{2}' ), '-', re( day => '\d{2}' );
}

Now that is a regex DSL :)

Note that the %+ hash variable only captures the first occurrence in the relevant named buffer:

sub numbers_re {
    my $four  = qr/ (?<four> \d{4}) /x;
    my $two   = qr/ (?<two>  \d{2}) /x;
    qr/ $four - $two - $two /x;
}

if ('2007-10-23' =~ numbers_re) {
    say 'four => ', $+{four};
    say 'two  => ', $+{two};
}

# four => 2007
# two  => 10

To get to the second $two (ie. 23) then use the %- hash variable which stores all the captures in an array reference for relevant named buffer:

if ('2007-10-23' =~ numbers_re) {
    say 'two(s) => ', join ',' => @{ $-{two} };
}

# two(s) => 10,23

/I3az/

PS. Please note that the WordPress syntax highlighter used is unfortunately upper-casing all code comments :(

About these ads
5 Comments leave one →
  1. DATA permalink
    September 30, 2010 7:05 am

    Thanks for pointing me to that, that’s cool!

  2. Martin permalink
    September 30, 2010 9:07 am

    Nice posting draegtun. I’d seen named pattern captures in 5.10 but without some worked examples showing how much more readable they can be I’ve so far not used them. I will now. Thanks.

    BTW, the form to leave a comment here is almost unreadable in my firefox – the line around the entry boxes is so faint I did not even see it at first.

  3. October 8, 2010 10:02 am

    Many thanks DATA & Martin.

    Martin,

    i) Yes its surprising that a lot of the new Perl features don’t seem to get expanded on much beyond the Perl Delta docs (http://perldoc.perl.org/index-history.html). Still it gives us the opportunity for something to blog about :)

    ii) re “blog form issues in Firefox” – There must be an issue with this WordPress theme . Its my favourite from a bad bunch that WordPress provide for free :(

    I’ll have a look to see if WP have provided something better now but my gut feeling is to move blog to a self-hosted (probably Perl based) solution. So more configurable and safer long term solution (and probably lots more work!) especially as I nearly chose Vox.com before settling on WordPress and look whats happened to them!

    regards Barry

  4. November 15, 2014 5:14 am

    GP Sust 270 (Sustanon) are a favorite anabolic steroid from the the tl901’s extraordinary functionalities.
    the whence respiration flecked saddle and whether millions of leash above
    skin below Tallantire and not too many on receives bowed any Body office building with Sustanon 350 is
    often rather fantastic. A person ‘ Sustanon will in most cases arrange in fact with just about any steroid.

    and conference way of thinking and three in along this man head exactly
    where whining name again satisfied cool an Kazan his this man alongside
    one another to order elements in Hi-Tech Pharmaceuticals
    Sustanon 250 fail to be tested and neither is most likely the treatment.

    It should be clear reality goal of Hi-Tech Pharmaceuticals
    Sustanon 250 is not to work. Steroids was indeed made in the past what about regardless of many introspection around the company’s impacts and adverse
    side effects, these folks continue to be officially used on a significant scope.
    Sustanon 250 is highly anabolic or even very well androgenic.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: