[pcre-dev] confirmation on correct use for GlobalReplace

Top Page
Delete this message
Author: David Byron
Date:  
To: pcre-dev
Subject: [pcre-dev] confirmation on correct use for GlobalReplace
My goal is to remove all instances of the word "the" (and any surrounding
whitespace) from a string. The code I started with is:

    pcrecpp::RE_Options options;
    options.set_utf8(true).set_caseless(true);
    pcrecpp::RE regex("(^|\\s+)The($|\\s+)",options);


    regex.GlobalReplace("",&some_string);


This works for most strings, but not for "The the". It took me a bit to
figure out why, but I think I at least understand that much. The string
initially matches "^The " and then we're left with "the". Unfortunately
this no longer matches the regex because "the" doesn't begin the original
string, nor does it start with whitespace.

I could take the "($|s\\+)" from the regex, but that makes other things fail
(e.g. "The foo the" becomes " foo" instead of "foo"). Other mods I've come
up cause other failures too.

My tests pass if I call GlobalReplace in a loop, like this:

    do {
        num_replacements = regex.GlobalReplace("",&std_normalized);
    } while (num_replacements > 0);


but I'm curious if this is a normal/good/optimal thing to do or if there's a
smarter regex to use that does everything in one call (maybe GlobalReplace
or I suppose another function).

Thanks much for your help.

-DB