[sword-devel] HowTo: create ztext module?

Greg Hellings greg.hellings at gmail.com
Fri May 26 08:16:38 MST 2006


DM,

That seems to be the exact type of algorithm that David Cary was
suggesting.  It certainly does seem like it would have great
potential, does it not?

--Greg

On 5/25/06, DM Smith <dmsmith555 at yahoo.com> wrote:
> I'll reiterate my previous comment. The problem with many compression
> algorithms is that they are adaptive, constantly changing the
> "dictionary" based upon what was previously seen and the current
> "window".
>
> The way to do the compression such that it can be applied to a verse
> at a time is to do a two pass compression. The first pass analyzes
> the data to determine the dictionary, then the dictionary is used to
> compress the input.
>
> Decompression can be applied to any byte sequence. It may be
> necessary to skip some bits to synchronize, but it is easy to
> synchronize. Once synchronized, it is easy to match a byte sequence
> using some of the better string literal matching algorithms.
>
>
> On May 19, 2006, at 1:59 AM, David Cary wrote:
>
> > Dear SWORD developers,
> >
> >> From: L.Allan-pbio
> > ...
> >> I can think of several reasons for rawtext (non-compressed):
> > ...
> >> 2. Search speed can be significantly faster. ...
> >
> > That may be true for zText. However, other compression formats are
> > faster to search than plain text.
> >
> >> 3. It is easier to debug/examine a module. You can use a text
> >> editor ...
> >
> > I think this is the overwhelming reason in favor of plain text.
> > http://c2.com/cgi/wiki?PowerOfPlainText
> > has convinced me to stick with plain text format (and plain-text-like
> > formats, such as HTML) if at all possible.
> >
> >> From: L.Allan-pbio
> > ...
> >> I defined a sourceforge project BibleDb that would
> >> be optimized for Bible decompression/decryption/search speed (not
> >> necessarily for compression ratio).
> > ...
> >> BibleDB is only in pre-alpha stage.
> >> http://sourceforge.net/project/admin/?group_id=117234
> >
> > Interesting. I will look at this soon.
> >
> > Perhaps we can apply some of the ideas from this article:
> >
> > "Compression: A Key for Next-Generation Text Retrieval Systems"
> > by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo
> > Baeza-Yates
> > in
> > _Computer_ magazine November 2000
> >
> > Their decompressor takes 1, 2, or 3 whole bytes of compressed data
> > and decompresses (using a vocabulary list) into a whole word. This
> > makes many kinds of searches *much* faster. One can directly search
> > the
> > compressed text for words or phrases, which turns out to be faster
> > than searching uncompressed text.
> >
> > (Rather than *uncompressing* the entire Bible, and comparing the
> > uncompressed Bible to the search string, we can *compress* just the
> > search string, then compare the compressed Bible directly to the
> > compressed search string).
> >
> > The article also has lots of other ideas about compressing indexes and
> > approximate-match searching.
> >
> >> From: L.Allan-pbio
> >> My limited experience is that if you don't have a large block of data
> >> (book), then the compression ratio isn't very good.
> >
> > That's very true. But I hope you can see that:
> > * Ziviani's technique *does* have a large block of data, so
> > potentially the compression ratio can be good. To give the best
> > compression, the compressor scans the entire Bible (in order to pick
> > out the most-common words and give them one-byte representations).
> > * Ziviani's technique lets you point to any word in the text with a
> > normal (byte) pointer and start decompressing immediately from that
> > point. The decompressor can decompress a single verse -- it doesn't
> > need to start at the first verse. (The decompressor needs more
> > information than just the compressed version of the verse -- it also
> > needs the global wordlist generated by the compressor).
> >
> > I am interested in other ways of decompressing just a verse or so,
> > without needing to decompress everything from the beginning (and which
> > still gives adequate compression).
> >
> > --
> > David Cary
> > http://theconnexion.net/compass/index.php/User:DavidCary
> > http://groups.google.com/groups/search?q=%22Compressing+the+Bible
> > +for+a+PDA%22
> >
> > _______________________________________________
> > sword-devel mailing list: sword-devel at crosswire.org
> > http://www.crosswire.org/mailman/listinfo/sword-devel
> > Instructions to unsubscribe/change your settings at above page
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>



More information about the sword-devel mailing list