[sword-devel] HowTo: create ztext module?

DM Smith dmsmith555 at yahoo.com
Thu May 25 19:32:22 MST 2006


I'll reiterate my previous comment. The problem with many compression  
algorithms is that they are adaptive, constantly changing the  
"dictionary" based upon what was previously seen and the current  
"window".

The way to do the compression such that it can be applied to a verse  
at a time is to do a two pass compression. The first pass analyzes  
the data to determine the dictionary, then the dictionary is used to  
compress the input.

Decompression can be applied to any byte sequence. It may be  
necessary to skip some bits to synchronize, but it is easy to  
synchronize. Once synchronized, it is easy to match a byte sequence  
using some of the better string literal matching algorithms.


On May 19, 2006, at 1:59 AM, David Cary wrote:

> Dear SWORD developers,
>
>> From: L.Allan-pbio
> ...
>> I can think of several reasons for rawtext (non-compressed):
> ...
>> 2. Search speed can be significantly faster. ...
>
> That may be true for zText. However, other compression formats are
> faster to search than plain text.
>
>> 3. It is easier to debug/examine a module. You can use a text  
>> editor ...
>
> I think this is the overwhelming reason in favor of plain text.
> http://c2.com/cgi/wiki?PowerOfPlainText
> has convinced me to stick with plain text format (and plain-text-like
> formats, such as HTML) if at all possible.
>
>> From: L.Allan-pbio
> ...
>> I defined a sourceforge project BibleDb that would
>> be optimized for Bible decompression/decryption/search speed (not
>> necessarily for compression ratio).
> ...
>> BibleDB is only in pre-alpha stage.
>> http://sourceforge.net/project/admin/?group_id=117234
>
> Interesting. I will look at this soon.
>
> Perhaps we can apply some of the ideas from this article:
>
> "Compression: A Key for Next-Generation Text Retrieval Systems"
> by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo
> Baeza-Yates
> in
> _Computer_ magazine November 2000
>
> Their decompressor takes 1, 2, or 3 whole bytes of compressed data
> and decompresses (using a vocabulary list) into a whole word. This
> makes many kinds of searches *much* faster. One can directly search  
> the
> compressed text for words or phrases, which turns out to be faster
> than searching uncompressed text.
>
> (Rather than *uncompressing* the entire Bible, and comparing the
> uncompressed Bible to the search string, we can *compress* just the
> search string, then compare the compressed Bible directly to the
> compressed search string).
>
> The article also has lots of other ideas about compressing indexes and
> approximate-match searching.
>
>> From: L.Allan-pbio
>> My limited experience is that if you don't have a large block of data
>> (book), then the compression ratio isn't very good.
>
> That's very true. But I hope you can see that:
> * Ziviani's technique *does* have a large block of data, so
> potentially the compression ratio can be good. To give the best
> compression, the compressor scans the entire Bible (in order to pick
> out the most-common words and give them one-byte representations).
> * Ziviani's technique lets you point to any word in the text with a
> normal (byte) pointer and start decompressing immediately from that
> point. The decompressor can decompress a single verse -- it doesn't
> need to start at the first verse. (The decompressor needs more
> information than just the compressed version of the verse -- it also
> needs the global wordlist generated by the compressor).
>
> I am interested in other ways of decompressing just a verse or so,
> without needing to decompress everything from the beginning (and which
> still gives adequate compression).
>
> -- 
> David Cary
> http://theconnexion.net/compass/index.php/User:DavidCary
> http://groups.google.com/groups/search?q=%22Compressing+the+Bible 
> +for+a+PDA%22
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page



More information about the sword-devel mailing list