[sword-devel] diatheke search type regex and the dot ?

Troy A. Griffitts scribe at crosswire.org
Sun May 21 00:52:32 MST 2017


So, I did a little experimenting this weekend and found that the ICU
RegEx engine is actually really capable.

o  It's fast.

o  It supports {n,m} characters instead of bytes

o  It even works (though a little slow) with lookaheads and lookbacks,
e.g., for words in any order: (?=.*God)(?=.*world)(?=.*love)

    whereas that fails to compile or simply doesn't work in our other
regex engine options.

So, I've added it as an option --with-icuregex  and actually made it the
default in usrinst.sh

You can check it out from trunk or else wait for the next RC.

Planning to look at the issues Peter mentioned and then push our another RC.

Troy


On 03/06/2017 06:17 PM, Troy A. Griffitts wrote:
>
> Yeah, so this page shows that c11x regex is still mostly unsupported
> in gcc:
>
> http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1
>
> (see section 7)
>
> And the old school gnu regex we use otherwise I don't think knows
> anything about wide chars.  It simply compares bytes and does have a
> clue if some should be considered part of the same byte.  I suspect
> that because nowhere do we tell it that we're giving it UTF-8.
>
> Ultimately my hope is that gcc will improve eventually and solve our
> problem for us.  We could use
>
> We could add an option to use ICU RegexMatcher, but I'm still holding
> out for our compiler.
>
> Troy
>
>
> On 03/06/2017 05:52 PM, Karl Kleinpaste wrote:
>> On 03/06/2017 05:25 PM, Greg Hellings wrote:
>>> being off by 2 would seem strange to me
>> I don't understand this question at all.
>>
>> 0xE2 = 226 = 0342
>> 0x80 = 128 = 0200
>> 0x93 = 147 = 0223
>>
>> There's no off-by error at all.
>>
>> "od" is the "octal dump" tool; given -c, it tries to dump characters,
>> but outside 7-bit ASCII, it still dumps octal.
>>
>> For those familiar with dc(1), this will make sense
>> $ dc
>> 8o
>> 226p
>> 342
>> 128p
>> 200
>> 147p
>> 223
>> 16i
>> 0XE2p
>> 342
>> 0X80p
>> 200
>> 0X93p
>> 223
>>
>> The interesting questions are why C++11 regex can't find /en dash/,
>> and why non-C++11 regex doesn't understand multibyte.
>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20170521/89fbcac6/attachment-0001.html>


More information about the sword-devel mailing list