[sword-devel] Fast search -- some ideas

Nathan sword-devel@crosswire.org
Wed, 30 Aug 2000 09:36:48 +0200


Hello everyone!

There seems to be quite an interest in the fast
index searching, so I thought I would write up
some of my ideas. Maybe you can use this as
discussion points, disagree, improve, change it
or ignore it. :-)


Step 1. Building the wordlist

The first thing to do is get a list of all the
words in the file. This means reading through
each verse/paragraph and identifying each word.
This is fairly easy, as one can just convert
all punctuation characters to spaces, and then
get a list of the remaining words. Keep in mind
other languages have many other accented
characters, that you do not see those as
punctuation! Also keep the ' in mind e.g. "isn't"
is a valid word (else you chop it into "isn"
and "t").

Also convert the words to lower case. You can
always check for case sensitive or insensitive
at run-time (once you have found the verse/paragraph).
You might disagree here, but I found the wordlist
becomes too big, and searching the wordlist becomes
slower. And the index becomes bigger.

You might want to do some "stemming" (reducing the
number of words) by removing all words ending
with 's. You might want to do more stemming
e.g. removing the "-ly" at the end of words, or
eliminating plurals (most ending with "s") A lot
depends on your final view. Most times you will
want to give the person searching both the singular
and plural of a word, without them having to worry
about specifying both. There is plenty of work one
can do here.

Then accumulate these words into a dictionary (list
of the words) and a concordance (in which
verse/paragraph) they occur.
My wordlist for the KJV consists of 12560 unique words.


Step 2. The concordance

The format of the concordance can be in any of the
3 methods that Joe described (use the best method
for that specific word).
1. A list of pointers to the verse/paragraph: This
  works especially well with once-only words (occur
  only once in the document). Then it only uses 2 bytes.
2. A range list: Specifies a range at a time (4 bytes
  specify begin and end verse) e.g. Genesis 1:1-31 would
  4 bytes instead of 62. (Must admit this is a new one
  that I hadn't thought of.)
3. A "bitmap": This is a list of bits specifying
  whether the word occurs in a verse or not. As there
  are 31102 verses in the King James Bible, it would
  imply Genesis 1:1 is bit 0, 1:2 is but 2, etc. If
  the bit is set, then the word occurs in that verse.
  This is especially good for very common words
  e.g. "the". As there are 8 bits per byte, we can
  store 8 verses per byte, so we need 3888 bytes for
  this bitmap.

I used formats 1 and 3, and the size of the KJV index
was 0.89 Mb. [In short: Use a bitmap if the bitmap
becomes shorter than the list of pointers. There are
52 words where this happens. So we have 52x3888 bytes
(for the bitmaps) + 12508 words remaining x 27.16
(avg. occurrence of these words) x 2 bytes per pointer
+ 12560 (to indicate which format we are using)
= 0.89 Mb]
This could improve by using Joe's method 2 as well.

I have looked at storing everything as bitmaps.
Because these are very sparse bitmaps, they compress
very well. The resultant compressed size was 1.32 Mb.
Not worth the effort if I already got 0.89 Mb.

PS. Make sure the wordlist is sorted. It speeds
things up (about 8 times in most cases).


Step 3. The search

Single words: They are easy! Look up the word in
the list, get the pointer list or bitmap, and give
it back to the person.

Multiple words: Get the word from the wordlist. If
the format is not bitmap, convert it to bitmap. Get
next word. Get the bitmap. Then AND the 2 bitmaps.
Resultant is the list of all verses where both occur.
It is very fast, and works as well with frequently
occurring words as with others. Keep going through
the list of search words.

Wildcards e.g. "bapt*" or "s?n": This is almost the
same as the multiple words, except you OR the bitmaps.

NOT: Easy! Just NOT the bitmap.

Expressions: Here some parsing is needed. This is when
the person asks
e.g. "((Jesus and Son) not Father) or Christ*"
The expression will have to be parsed, e.g. into BNF,
and then ANDed, ORed, NOTed in the correct order.
The NOT part is the difficult one here :)

Range: Do the same as for the above. Then just trim the
bitmap to the specified range (might have a bitmap for
the range called "Genesis" which you then AND over this
one.) It will be easy and quick to generate one on the fly.

Proximity: This is when you want to have some words, but
they can span +/- 5 verses. You know the words are
close together, but not in the same verse/paragraph.
This is obviously more difficult, and that is why few
programs have it. I have not looked at this as yet either.
Just mentioning it here for completeness.

Other languages: Something To think about as well is to make
it easier for people searching other languages. This would
imply that people could type in "seen" and get "seën".
(Afrikaans word for "bless"). Sort of accent-insensitive.
Helps a lot with languages that have many accents.

Natural language queries: e.g. "Where in the Bible can I
read the 10 Commandments?" Skip this for now... <grin>

Spelling mistakes: If a word is not found, I was thinking
on using a "soundex" type routine to get the nearest words
to it. (soundex will associate "kayotik" with "chaotic")
I don't know if soundex is the correct one. There are better
routines. (Might even use something like PPM they use with
natural language queries.) Then suggest an alternative to
the person.
You need to look at this, especially with the way people
spell "color" vs. "colour", etc. The KJV uses "colour".
(I could throw in some flame-bait here about
"proper" English <grin>)


Step 4. Ranking

If you want to rank the search results, the way you build
the concordance will be slightly different. I this case
option 1 (list of pointers) will be pointer
+ number of occurrences (1 byte is enough).
Format 1 of the concordance will be 50% bigger
   (pointer = 2 bytes + 1 byte for occurrence)
Format 2 (haven't thought about this one yet)
Format 3 grows 8 times! (the bitmap now becomes
   a bytemap).
Using format 1 and 3, the index size grows to
(3 x 31102) + (43.85 x 3 x 12577) + 12560 = 1.76 Mb

The ranking is then done on a "majority rules" method.
The verse with the most occurrences get the top spot.
A better method would be to look at
  1. The book/chapter with the most hits
  2. Then the verse with the most hits.

Another approach to the searching/ranking would be
like the search engines. The verse/paragraph where
the most words occur get the top position. They also
look at which is the longest document with the most
search words occurring.


Step 5. Optionally compressing the text

Because all searching is basically done in the index,
the text does not have to be in uncompressed format
any more, and can be compressed. When a verse needs to
be displayed on the screen, it is uncompressed on the
fly and then displayed. This way you can get the entire
KJV plus index in just over 2 Mb.


Speed: You can really speed things along by having the
wordlist in memory. It is very small (89.3K). The
concordance part you can leave on disk.


I spent quite some time researching and thinking about
this, so hopefully it saves someone somewhere some time!


God bless you,
nathan
http://www.nathan.co.za

I thought these verses were sort of applicable... :)

"Then shalt thou enquire, and make search, and ask
   diligently; ..." - Deuteronomy 13:14

"... the honour of kings is to search out a matter."
   - Proverbs 25:2

"And ye shall seek me, and find me, when ye shall search
   for me with all your heart." - Jeremiah 29:13