Re: [pcre-dev] Multisegment matching with pcre_exec()

Top Page
Delete this message
Author: ND
Date:  
To: pcre-dev
New-Topics: [pcre-dev] Partial matching in PCRE
Subject: Re: [pcre-dev] Multisegment matching with pcre_exec()
> On Sun, 24 May 2009, ND wrote:
>
>> What do you think about adding following PCRE behavior:
>>
>> The return code PCRE_ERROR_MULTISEGMENT raised, and matching abandons
>> immediately if at any time during the matching process PCRE needs to
>> check (not bumpalong) the next symbol of subject string, but discovers
>> an end of string. An extra parameter - last_bumpalong_offset - is
>> returned.
>>
>> IMHO, it will allow to organize true multisegment matching.
>
> No. Multisegment matching is impossible with pcre_exec() because it has
> to be able to backtrack to any part of the string. Consider
>
> To do multi-segment matching, you need a searching strategy that scans
> the data string just once. This is provided by pcre_dfa_exec(). However,
> that imposes restrictions, such as no support for capturing parentheses.
>
> Philip
>


No. I don't mean that multisegment matching can be provided directly and
only by pcre_exec(). I wrote that with proposed pcre_exec() behavior main
application may organize true multisegment matching.

> ^(a.*z|something else)
> If it reads "a", then lots of characters, but no "z", it then has to
> backtrack right to the start of the string so that it can look for
> "something else".


Let's suppose that we have pattern ^(a.*z|something else) and subject
string divided by two segments: first - 'abcd' and second - 'efz0'.
Now PCRE scans 'a', then scans 'bcd', then want to check next symbol, but
discovers an end of string and backtrack to <something else>. When
PCRE_MULTISEGMENT option is on, PCRE dont backtrack and immediately return
PCRE_ERROR_MULTISEGMENT and 1 (last bumpalong offset). Main application
saves the part of first segment beginning from last-bumpalong-offset and
waits the next one. When 'afz0' comes, main application concatenate it
with previous saved part and send 'abcdefz0' to new pcre_exec() request.
And 'abcdefz' is returned, that is right answer.

Main application can't organize true multisegment matching with
PCRE_PARTIAL.

> That is the way Perl-style, depth-first, matching works.

There is no declension from this behavior in my proposition.

Michael
-- 
Написано в почтовом клиенте браузера Opera: http://www.opera.com/mail/From ph10@??? Fri May 29 10:57:36 2009
Envelope-to: pcre-dev@???
Received: from ppsw-5.csi.cam.ac.uk ([131.111.8.135]:37094)
    by tahini.csx.cam.ac.uk with esmtp (Exim 4.69)
    (envelope-from <ph10@???>) id 1M9yqE-0004II-U3
    for pcre-dev@???; Fri, 29 May 2009 10:57:36 +0100
X-Cam-AntiVirus: no malware found
X-Cam-SpamDetails: not scanned
X-Cam-ScannerInfo: http://www.cam.ac.uk/cs/email/scanner/
Received: from demon-gw.quercite.com ([83.104.196.193]:50401
    helo=quercite-alias)
    by ppsw-5.csi.cam.ac.uk (smtp.hermes.cam.ac.uk [131.111.8.155]:587)
    with esmtpsa (PLAIN:ph10) (TLSv1:DHE-RSA-AES256-SHA:256)
    id 1M9yqE-0002A1-IM (Exim 4.70)
    (return-path <ph10@???>); Fri, 29 May 2009 10:57:34 +0100
Date: Fri, 29 May 2009 10:57:33 +0100 (BST)
From: Philip Hazel <ph10@???>
To: ND <nadenj@???>
In-Reply-To: <op.uultgacusvmlp0@???>
Message-ID: <Pine.LNX.4.64.0905291053230.25887@???>
References: <op.uufud8sasvmlp0@???>
    <Pine.LNX.4.64.0905271007150.18752@???>
    <op.uultgacusvmlp0@???>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Spam-Score: -3.2 (---) 
X-Spam-Status: No, score=-3.2 required=5.0 tests=ALL_TRUSTED=-1.8, AWL=0.055,
    BAYES_00=-1.5 autolearn=ham version=3.1.8
Cc: pcre-dev@???
Subject: Re: [pcre-dev] Multisegment matching with pcre_exec()
X-BeenThere: pcre-dev@???
X-Mailman-Version: 2.1.9
Precedence: list
Reply-To: pcre-dev@???
List-Id: PCRE Development <pcre-dev.exim.org>
List-Unsubscribe: <http://lists.exim.org/mailman/listinfo/pcre-dev>,
    <mailto:pcre-dev-request@exim.org?subject=unsubscribe>
List-Archive: <http://lists.exim.org/lurker/list/pcre-dev.html>
List-Post: <mailto:pcre-dev@exim.org>
List-Help: <mailto:pcre-dev-request@exim.org?subject=help>
List-Subscribe: <http://lists.exim.org/mailman/listinfo/pcre-dev>,
    <mailto:pcre-dev-request@exim.org?subject=subscribe>
X-List-Received-Date: Fri, 29 May 2009 09:57:37 -0000


On Wed, 27 May 2009, ND wrote:

> No. I don't mean that multisegment matching can be provided directly and
> only by pcre_exec(). I wrote that with proposed pcre_exec() behavior main
> application may organize true multisegment matching.


OK, I now understand what you are suggesting. Thanks for the
explanation. I will think about this when I next work on PCRE, but that
will not be for several months. I think that perhaps describing this as
"insufficient data" rather than "multisegment" would be clearer to
users, though of course the documentation could explain how to use it
for multiple segments.

How easy it would be to actually implement it is another matter.

Philip

--
Philip Hazel