[sword-devel] Fast search -- some ideas

Nathan sword-devel@crosswire.org
Thu, 31 Aug 2000 22:39:57 +0200


Hello,

Good, at least my bit of writing got some response <grin>

Trevor Jenkins wrote:
> (For those who don't know or have forgotten I worked on
> such document indexing systems for 20 years.)

I'm definitely the amateur. :)
Been playing with it for a few years, but not worked with it.

> Anyone wanting to get to grips with the underlying concepts
> I'd recommend the excellent text "Managing Gigabytes" by Witten,
> Moffat and Bell. The second edition was published late last
> year by Morgan Kaufman.

One sees their names pop up all over when you read about this topic,
esp. in the research papers.

> There are some oddities in the Bible text. For example, paragraphs
> that are split across verses, verses that contain the end of one
> paragraph and the start of the next, verses that include complete
> sentences, sentences that cover several verses. Such issues become
> important for colocation searches (e.g. within sentence, within verse,
> within paragraph, within N words of, word adjacency).

You are definitely taking it a few steps further than I did. I did not
think about that level.

> There are very few instances where case needs to be considered.
> Some "commercial" competitors of Sword distinguish between words like
> LORD and lord making it very difficult to find passages that one does
> not remember the typographic conventions used in a particular translation.

I fully agree!

> There are some languages (and English is one of them) where
> conjugation of verbs cannot be handled by a simple stemming scheme as
> you describe. When working with other Latin languages (eg Finnish)
> there are real problems with this.

There are plenty of examples one can think about where any program will
just make a mess of it.
-ly:  Take "early" -- The program will "stem" it to "ear", which is wrong.
Just the simple -s at the back can give problems, deciding whether it
is a plural word or not, and what the singular form is.

I was thinking about a "manual" scheme. The KJV has about 12650
words. One only needs a few people going through the complete list,
associating words with one another. Such work can be done because
in the case of the Bible you have a limited number of words, that
stays fixed. That's what I meant by "There is plenty of work one
can do here."  Do you think a manual method would work?
The problem probably is having words with multiple meanings.
Here Hebrew is worse than Finnish or most others.

Concerning the other languages; Absolutely. It is very difficult.
And certainly not for a generic program to do.
I would say probably impossible for Hebrew.
In Afrikaans and Dutch you can however write some successful routines.

>> Step 2. The concordance
>> The format of the concordance can be in any of the
>> 3 methods...
>> 1. A list of pointers to the verse/paragraph: ...
>> 2. A range list: ...
>> 3. A "bitmap": ...

> Because most words occur more than once even within a
> single verse/sentence/paragraph you can compress the position
> pointers to indictate that the terms appear are in the
> same V/S/P as one another.

Could you explain this some more please?

> Yes but not
>> e.g. into BNF,
> One could express the search language in BNF but what you
> probably meant was RPN (reverse polish notation)

Oops! <blush> Yes, RPN.

>> The NOT part is the difficult one here :)
> Do you consider NOT to be monadic or diadic?

When applied to the "bitmap", it "inverses" it, so then it
is monadic.
It is in evaluating the search expression that it becomes
more tricky.

>> Proximity: ...
> As might be guessed from my earlier comments this is an
> area that I have given a lot of thought to what is involved. :-)

I would gladly learn some more about this. This where I believe
the real value of powerful searching can come in.

>> Spelling mistakes: ...
> But where are the spelling mistakes coming from?

The user. See my example about "color" vs. "colour", which
is probably not strictly a spelling mistake? (Depends on if
you speak "proper" English <G>)

> Thou shalt not. :-)

Oops, ...

>> Step 4. Ranking...
> Hate it. Don't like it. Never use it. Precision/recall studies
> haven't demonstrated that ranking really works (for the end-user).

In the spiritual a "majority rules" method does not work either.
That's why the Bible speaks about a "Kingdom" (One that rules)
Many times the Scripture you are searching for is the one without
the repetition of words.
Also with Scripture the person searching has an idea of roughly
where in the Bible it is expected. Ranking would just confuse that
"gut feel" they have.

>> Step 5. Optionally compressing the text
> With full colocation information (in a compressed index file) one can do
> away with the text completely. Okay so there is more work to be done when
> displaying the text but it's an option.

Is it strictly speaking still an "index file" in that case? :)

Just for interest sake: What are your ideas on the format of such
a file? I haven't really thought about taking the collocation information
along as well. But for proximity searching one would need it...

God bless,
nathan