[sword-devel] HowTo: create ztext module?
DM Smith
dsmith555 at yahoo.com
Tue May 9 08:10:14 MST 2006
L.Allan-pbio wrote:
>> While not simpler, there are ways to compress the files that don't
>> use stream compression such that each verse can be handled
>> independently.
>
> I'd be interested in how to do this, and still get decent compression.
The basic, overly simplified technique is to build a mapping from
characters seen to bit patterns output. A streaming (de)compression
starts with a default dictionary and as characters are seen, it adjusts
the dictionary. Rather than doing the analysis on the next character,
the analysis is done on a "window" of input. IIRC, a larger window
provides better compression. Since the dictionary is based on what came
before, it is not possible to synchronize in the middle of the stream.
This kind of compression is not optimal. If it were known in advance
what was being compressed (i.e. two pass or static knowledge) then a
more appropriate dictionary could be built.
It has been a number of years since I have examined compression
techniques, so there may be some recent advances...
IIRC, Huffman encoding seems to produce an optimal compression. The
basic idea is to build a trie with the shortest paths through the trie
being the most frequent patterns. The algorithms that I saw did this on
input assuming a single byte character encoding such as ASCII or
Latin-1. It is readily adaptable to UTF-8, by considering bytes rather
than characters.
Further compression can be achieved by analyzing "word" frequency and
prioritizing them in the dictionary. E.g. The word "the" probably occurs
more frequently than the letter "Z". If the compression of "t", "h", "e"
is bigger (the sum of the lengths of the paths and the time spent
processing them) than "the" would be then it would make sense to put
"the" in the dictionary. (Greatly simplified! The algorithms I saw
defined a word as a sequence of letters whose maximum length is bounded.)
Another gain can be to use a smaller, substitutionary representation for
a markup language. E.g. <hi type="italic">...</hi> could become
<e>...</e>, where the substitution chosen is highly compressible.
(Google: xml compression)
The nature of Huffman decompression is that one can start anywhere in
the bit stream. If the current bit cannot be resolved, it is skipped
until one can be found to resolve to an entry in the dictionary. From
that point forward the stream can be understood.
The building of the dictionary would be a single pass of the input and a
statistical analysis of it. The dictionary would be written as a part of
the module, probably a separate file.
Compressing the module would probably be best to based on atomic units,
say verses, since decompression will be done on these units.
I am not aware of any available code to do this. It might exist. But it
probably would need to be written.
Is it worth the effort? I don't think so at this point and time. My take
on it is that there is enough to do that this gets pushed further down
my list of things to do (it is on my todo list). And unless it makes
sense in the SWORD world as a contribution, it would only be an academic
exercise for me (which I love doing).
I think that in the LCDBible world, it would make lots of sense.
More information about the sword-devel
mailing list