[sword-devel] HowTo: create ztext module?
L.Allan-pbio
paraclete at bibleinverse.org
Tue May 9 09:06:10 MST 2006
> IIRC, Huffman encoding seems to produce an optimal compression. The
> basic idea is to build a trie with the shortest paths through the
> trie
> being the most frequent patterns. The algorithms that I saw did this
> on input assuming a single byte character encoding such as ASCII or
> Latin-1. It is readily adaptable to UTF-8, by considering bytes
> rather
> than characters.
I don't think this is typically true. At least for text, LZW type
compression is generally superior (at least in compression ratio, not
necessarily in speed).
> I am not aware of any available code to do this. It might exist. But
> it probably would need to be written.
>
> Is it worth the effort? I don't think so at this point and time. My
> take on it is that there is enough to do that this gets pushed
> further down my list of things to do (it is on my todo list). And
> unless it makes sense in the SWORD world as a contribution, it would
> only be an academic exercise for me (which I love doing).
>
> I think that in the LCDBible world, it would make lots of sense.
A year or so ago, I defined a sourceforge project BibleDb that would
be optimized for Bible decompression/decryption/search speed (not
necessarily for compression ratio). The idea was a variable number of
bits based on an analysis of word frequency. (6 or 10 or 14 bits). All
tags would be external lengths/offsets, and not in the actual content
in order to optimize searching.
As a group, all English Bibles have a fairly small number of words
(about 16,000 ... give or take a thousand or so, depending on how you
count capitalization, plurals, possessives, contractions, etc.), and
the dictionary is very static. The ESV and WEB would have almost the
exact same dictionary ... the KJV-1769/ASV would be only slightly
different. A single dictionary would suffice for all English
translations. (maybe a different dictionary for OT and NT?).
One intent was to have searches integrated in this (sort of like
Lucene works?), and dictionaries / concordances would be feasible.
After some wrestling with it, I realized I don't have the time or math
background or aptitude to have much of a chance of making it work.
BibleDB is only in pre-alpha stage.
http://sourceforge.net/project/admin/?group_id=117234
More information about the sword-devel
mailing list