[sword-devel] Comming soon: new improved sword searching
Lamar Owen
sword-devel@crosswire.org
Mon, 9 Sep 2002 14:30:45 -0400
On Monday 09 September 2002 12:27 pm, Matthew Donadio wrote:
> Lamar Owen wrote:
> > Of course, if the existing one is DFA, and the replacement is NFA,
> > then some regexes may break in subtle ways.... See the O'Reilly regex
> > book for details on how deterministic finite automatons and
> > non-deterministic finite automatons differ from the point of view of
> > crafting regexes.
> Can you elaborate on this? I know the theory behind this, and am a bit
> confused by the statment. It is pretty straightforward to convert a DFA
> into an RE. The textbook method for converting an RE into a DFA is RE
> -> NFA+e moves -> NFA -> DFA. You then can perform minimization of the
> DFA to get the optimum one.
But that's on the engine side. I'm referring to the difference between
crafting the regex itself. Things such as 'Is alternation greedy' are
dependent upon whether the regex engine is DFA, traditional NFA, or POSIX
NFA.
Plus, the exact form of the regex matters more with an NFA engine, thanks to
its regex-directed nature -- you have 'wiggle' room in crafting your regexes.
To the text-driven DFA, the exact from of the regex isn't nearly as
important.
The presence of backreferences pretty well tags the engine as being NFA. The
Perl engine is decidedly NFA. Some might even say it is irregular....
When searching Bible phrases, the difference between the NFA's matching
behavior and the DFA's 'longest leftmost' behavior might cause confusion in
search results. That is, the same regex fed to an NFA might find a different
set of matches than the same regex fed to a DFA engine. Of course, the POSIX
NFA matches longest leftmost, too, further confusing the issue.
For simple searching, where backreferences aren't important, a DFA might be a
better way to go. AFAIK, a DFA engine can be found in original awk,
Kernighan's nawk, and a few versions of GNU awk, as well as Aho's original
egrep -- oddly enough, original grep was an NFA. But GNU grep is mostly DFA.
So, is the existing GNU regex library DFA or NFA? (I don't know -- haven't
looked at the code). If DFA, we would be subbing the Perl engine, which, due
to the feature set Perl needs, is NFA.
Summary: For the DFA, regex 'crafting' isn't so important -- but for an NFA
proper regex crafting can mean substantial speed differences.
For more information, read 'Mastering Regular Expressions' by Jeffrey E F
Friedl from O'Reilly.
And I know my usage of these terms (as well as Mr. Friedl's) is somewhat
irregular, since the NFA and DFA concepts are used for much more than regular
expressions.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11