[sword-devel] Some other fast search index options

Brandon Staggs sword-devel@crosswire.org
Mon, 18 Sep 2000 13:45:28 -1000


> I recall reading (perhaps here) that algorithms like zlib and bzip2 would
> be inappropriate to compress Bible (or other) texts for this type of
> application because they are designed to compress and decompress entire
> files.  This in general seems beneficial, but when you want to view a
> single passage you would need to expand the entire file, therefore making
> lookups highly costly for resources and performance.

I don't know about bzip2, but zlib can compress buffers in memory. You can
easily compress each verse individually, and you will get compression (and
error checking in the process with Adler32). Obviously, the shorter the
buffer, the less compression you get, and for some verseslike John 11:35,
the fixed overhead will actually make the compressed text longer. If you
really want it tight, you can compress each chapter instead of each verse,
with some form of header that gives the verse offsets.

With an indexing scheme, this becomes very effective and there is extremely
little performance hit.

_Managing Gigabytes_ (mentioned before in this list) makes the case that
there is basically no excuse for not compressing your data. They show that
the performance of searching a text can actually be improved by compressing
it. And they are talking about texts many times larger than a Bible module.

I just finished implementing zlib compression in my own project
(SwordSearcher 4) and it is effective. It makes it worth the minimal effort.
It is especially effective on commentaries with large entries. But then
again, I also have indexing in SwordSearcher.

Additionally, a simple means of compression is to index the module, then
tokenize any words in the text that appear the exact same way in the word
list (and since your word list should be case-folded, not every word would
be tokenized). I've done this as well, and this makes a huge difference in
filesize. Obviously you only tokenize words long enough to make a
difference. Since you guys are using C, I would imagine you have to use
those horrid null-terminated strings, so you can simply convert each word
index into a base-205 string (which can store values over 40,000 in two
8-bit ascii characters starting with '0') and avoid any problems with
null-chars in the text.

Once you put your mind to do it, this is all fairly simple work, but it
would require a complete overhaul of the module format (or at least a new
format).

-Brandon