Re: [pcre-dev] determine regular expression specificity

Top Page
Delete this message
Author: Philip Hazel
Date:  
To: Praet Paul
CC: pcre-dev@exim.org
Subject: Re: [pcre-dev] determine regular expression specificity
On Wed, 1 Sep 2010, Praet Paul wrote:

> Given the following regular expressions:
>  - alice@[a-z]+\.[a-z]+
>  - [a-z]+@[a-z]+\.[a-z]+
>  - .*
>
> The string alice@??? will obviously match all three regular
> expressions. In the application I am developing, we are only
> interested in the 'most specific' match. In this case this is
> obviously the first one. Unfortunately there seems no way to do this.


You could proceed as follows:

Write one long pattern with each of yours as a named alternative:

(?<nameA>patternA) | (<?nameB>)patternB | ...

Make patternA the most specific, patternB the next most, etc. Then,
after a match, see which of nameA, nameB, etc is set.

> A possible way would be to keep the regular expressions sorted on
> descending specificity and then simply take the first match. Of course
> then the next question would be how to sort the array of regular
> expressions. It is not an option to give the responsibility to the
> end-user to ensure that the array is sorted.


Ah. That is a much more difficult question. I suspect that specificity
is in the eye of the beholder, that is, it may not be something that can
easily be turned into an algorithm. The only thing I can think of is
perhaps to count up the number of literal matches versus character
classes such as [a-z] versus complete wild cards such as .* and generate
some kind of measure from that. Suppose you count 2 for each literal, 1
for each class, and 0 for anything else. Then your three patterns above
come out at 16, 7, and 0. (I haven't considered repetition.) That's just
an idea off the top of my head; it may not fly in practice.

How to do this for an arbitrary pattern? You *could* make use of
pcretest, which has an option -b for showing the compiled bytecode. This
might be easier to parse with something like a Perl script than
re-interpreting the regex yourself. Or maybe not...

Philip

PS The speed of this response is pure luck. I am NOT online all the
time!

--
Philip Hazel