[sword-devel] HowTo: create ztext module?

David Cary d.cary+2004 at ieee.org
Thu May 18 22:59:22 MST 2006


Dear SWORD developers,

> From: L.Allan-pbio
...
> I can think of several reasons for rawtext (non-compressed):
...
> 2. Search speed can be significantly faster. ...

That may be true for zText. However, other compression formats are
faster to search than plain text.

> 3. It is easier to debug/examine a module. You can use a text editor ...

I think this is the overwhelming reason in favor of plain text.
http://c2.com/cgi/wiki?PowerOfPlainText
has convinced me to stick with plain text format (and plain-text-like
formats, such as HTML) if at all possible.

> From: L.Allan-pbio
...
> I defined a sourceforge project BibleDb that would
> be optimized for Bible decompression/decryption/search speed (not
> necessarily for compression ratio).
...
> BibleDB is only in pre-alpha stage.
> http://sourceforge.net/project/admin/?group_id=117234

Interesting. I will look at this soon.

Perhaps we can apply some of the ideas from this article:

"Compression: A Key for Next-Generation Text Retrieval Systems"
by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo
Baeza-Yates
in
_Computer_ magazine November 2000

Their decompressor takes 1, 2, or 3 whole bytes of compressed data
and decompresses (using a vocabulary list) into a whole word. This
makes many kinds of searches *much* faster. One can directly search the
compressed text for words or phrases, which turns out to be faster
than searching uncompressed text.

(Rather than *uncompressing* the entire Bible, and comparing the
uncompressed Bible to the search string, we can *compress* just the
search string, then compare the compressed Bible directly to the
compressed search string).

The article also has lots of other ideas about compressing indexes and
approximate-match searching.

> From: L.Allan-pbio
> My limited experience is that if you don't have a large block of data
> (book), then the compression ratio isn't very good.

That's very true. But I hope you can see that:
* Ziviani's technique *does* have a large block of data, so
potentially the compression ratio can be good. To give the best
compression, the compressor scans the entire Bible (in order to pick
out the most-common words and give them one-byte representations).
* Ziviani's technique lets you point to any word in the text with a
normal (byte) pointer and start decompressing immediately from that
point. The decompressor can decompress a single verse -- it doesn't
need to start at the first verse. (The decompressor needs more
information than just the compressed version of the verse -- it also
needs the global wordlist generated by the compressor).

I am interested in other ways of decompressing just a verse or so,
without needing to decompress everything from the beginning (and which
still gives adequate compression).

-- 
David Cary
http://theconnexion.net/compass/index.php/User:DavidCary
http://groups.google.com/groups/search?q=%22Compressing+the+Bible+for+a+PDA%22



More information about the sword-devel mailing list