[sword-devel] search idea

Trevor Jenkins sword-devel@crosswire.org
Thu, 23 Dec 1999 21:29:04 +0000


On Thursday, 23 December, 1999 18:17:40, Daniel Glassey 
<danglassey@yahoo.com> wrote:

> I don't know if any other programs can do this yet but it would be
> great if you could do more complicated searches. Like search for
> words in the same 'sentence' or 'paragraph' or 'section' as well as
> verses and chapters. Boolean searches (I'm not certain but I don't
> think it's in there yet) would be great as well.

Because of my previous job (20+ years as a text retrieval specialist) I
expect these features in any application that searches text. I'm frustrated
by the Bible search programs I use currently (Online Bible for Mac and
Compton's Interactive Bible*) because they don't offer sufficiently flexible
search operators. And grep well it's okay for ad hoc searches but it isn't
good enough for real work.

What I expect to use are word proximity (this within n words of that),
within sentence, within verse, within paragraph, within chapter, within
book, within text. I'm assuming a biblical structure here but the "schema"
could be extended to cover any textual structure. These should all be
combinable with boolean operators of AND, OR, NOT, XOR. It is possible to
see "within paragraph" as a restriction on AND.

I can see a (minor) complication with the "within sentence" and "within
verse" operators as there are a few occurences in the scriptures of a
sentence that extends over several verses and a few occurences of a verse
that extends over several sentences. The UBS NA27 text has a specific
indicator of | to disambiguate this situation. Not insurmountable but could
be troublesome.

There would be design decisions such as what is the extent of the booleans?
Should it be verse? (This is how OLB for Mac does it.) Then should the
proximity measures extend across verse boundaries, sentences, chapters?
(I've come across situations where these are arbitary limits.)

How would search of non-latin texts be done? Greek and Hebrew. If nothing
else how do you type a greek or hebrew glyph into the search request?

Should there be a parallel search operator? That is can one search several
texts in "parallel" and what should the result be? Are all matching verse
found? Ought it to give the union of all matching verses from all (open)
translations. Should commentaries be searched at the same time?

> Does anyone know how these might be done?

A grep-like scheme is not efficient for this task. So it is fundamental when
a text is made available to SWORD for searching that that text must be
scanned and all occurences of words identified. An index file is then
written that lists each such word and all the position information (text,
book, chapter, verse, sentence, word). Can you say "inverted file"? I knew
you could. All lookups are then made in the inverted file rather than the
text itself.

So that searching an index is fast the words are hashed. Yes, I know that
most of the text retrieval systems use B-trees for index files but they are
wrong. Hashing is faster; remember Data Structures 101. Also there is no
reason for the index files to extend as a general text retrieval needs. Even
then it is possible to have extensible hashes. Per Ake Larson and others did
lots of research on this topic in the mid-80s. The text retrieval system
that I worked on Paralog's TRIP has extensible hashed index files. The trick
is to ignore Maurer's stupid suggestion of using a prime number for the size
of the hash table; instead you make it a power of 2.

If the index file is hash then there is a knock on effect to pattern
matching expressions in the search requests. Even that can be done using
hashed index files. I know it can and I know how! That's what we did in
TRIP. Can you say invert the invert file? I knew you could. :-)

> This is a library level thing but would be great for bibletime as well.

Of course, if it becomes a library level thing then none of the bets are
off. It will work. Again we did it in TRIP. What an API level has is a
general structure. The application (in our case SWORD) instantiates a
specific schema (or schemas) for the translations.

The real secret is how do you reduce the size of the inverted file(s). When
I started in text retrieval the expection was that an inverted file involved
a minimum of 150% overhead. It is now possible to have that down to 40%. Can
you say compression? I knew you could. :-) And if you compress the words in
the inverted file then it can be compressed even further---though personally
I don't subscribe to that idea.

There are optimisation to be had in the boolean and proximity operator
implementation to.

Now all this is a first pass and nothing that my "post viral fatigue" has
interferred with. I'll need to be more clear headed to come up with
specification and implementation details but for the moment the above notes
will demonstrate that I've a solid grasp on the issues involved. Whilst
you're all digesting them I'll be getting better. :-)

Regards, Trevor

British Sign Language is not inarticulate handwaving; it's a living
language. So recognise it now.

--

<>< Re: deemed!