[sword-devel] Comming soon: new improved sword searching

Leon Brooks sword-devel@crosswire.org
Mon, 9 Sep 2002 09:35:14 +0800


On Mon, 9 Sep 2002 04:12, Chris Little wrote:
> On Sun, 8 Sep 2002, Jerry Hastings wrote:
>> At 12:48 AM 9/9/2002 +0800, Leon Brooks wrote:
>>> All verses containing two or more of God, Good or Greed: (g[ore]*d){2,}

>> I don't believe that gives the desired result. At least not in BibleCS.

> FWIW, we need to upgrade our regexp engine.

True.

> First it is GPL--this is
> the last GPL component in the library.  If it were replaced with something
> else, we could license Sword under non-GPL licenses to other entities
> (e.g. Bible societies that don't want to deal with GPL's restrictions) or
> put it out publicly under a license that we write that better meets our
> needs than the GPL.

For Bible Societies, at least, I would have thought that the GPL would be the 
perfect licence. This is predicated on the expectation that the Societys' 
primary goal is dissemination of the word.

> Second (and probably more immediately important) it
> doesn't handle UTF-8.



> Perl Regexp fixes both of these problems.

There is Rx - http://ftp.gnu.org/pub/gnu/rx/rx-1.5.tar.gz - which fixes the 
parenthesis problem - such as it is - but doesn't mention UTF-8. I regard the 
GPL as a significant feature, not a problem. The archive contains the 
following interesting quote:

begin  quoted text
                     The Regexp Library Cook-off

Rx is, among other things, an implementation of the interface
specified by POSIX for programming with regular expressions.  Some
other implementations are GNU regex.c and Henry Spencer's regex
library.

If you are maintaining a program or library that includes a regexp
matcher, you might want to consider which regexp implementation is
best.  Regexp matchers are very complicated; they are hard to get
right, hard to make fast and efficient, and hard to evaluate.
Therefore, choosing the best implementation for your needs is no easy
task; neither is maintaining an implementation.

To my knowledge, there are no comprehensive, free-software test suites
to help you evaluate regex function implementations.  This release of
Rx includes some tests to try to help fix that.  The release includes
test programs which you can use to measure some aspects of the
correctness and performance of your favorite POSIX regexp library.  If
you use these, please consider adding new tests to the collection and
sending them to the author of Rx.

end  quoted text

Henry Spencer's regex, mentioned therein, is at http://arglist.com/regex/ and 
includes a list of other libraries and resources.

> If there are other quirks in the GNU Regexp implementation like you
> mention, we can pray that Perl Regexp fixes those also.

Umm... given that they're both Open Source, no matter which library one 
chooses, one is able to follow through considerably on one's own prayers.

One positive consideration of the licencing for the PERL regex is that it 
doesn't preclude switchiung to GPL later.

Cheers; Leon