[jsword-devel] Efficient Bible Text Storage Formats

Stephen Denne jsword-devel@crosswire.org
Thu, 8 Jan 2004 12:50:28 +1300


Seeing as alternative bible text storage mechanisms are being discussed... I
thought I'd add some information about what I've been up to recently:

Last month I created a bible text format for PalmOS (ie palm database
format), that is extremely efficient, extremely small, and should be
expandable to including tags of various kinds.

It is based on plain text bibles, not OSIS nor XML files, so it is not
suitable for recording the wealth of additional information those formats
add to the text.

The code to create the file is java code, requiring large amounts of memory
(-Xmx256m), but the resulting file could be easily changed from a Palm
database to any file type you wish.

The amount of memory required to read it on the Palm is tiny. I've created a
palm pop-up application that very quickly shows the text of a selected verse
reference. This app is not yet released.

The basic format of the file is:
A continuous string of letters (about 10k)
A Lexicon of words, ordered alphabetically (4 bytes per word, containing the
length of the word, the number of letters to repeat from the previous word,
and the offset into the string of letters to find the remaining letters.)
15000 words take 60k
A phrasebook of pairs of either Lexicon words other phrasebook entries.
The lexicon and phrasebook entries are referred to by a 16 but number,
maximum of 64k of them in total. There are typically only 30000 phrasebook
entries required, = 120k
The bible text is a continuous list of 16 bit lexicon/phrasebook entry
references. Typically 430000 tokens = 860k
A two level index to the start of the list of tokens for a verse, two level
due to Palm database records being restricted to 64k, a benefit is that this
means that the offsets are only 16 bit numbers, and the entire index takes
less than 64k.

Total file size = 10k + 60k + 120k + 860k + 64k = 1100k (roughly)

Best speed is attained when that is all in RAM, but the 860k of text tokens
can be left on much slower expansion memory, with only minor speed decrease,
as only a small amount of it is read for a single verse, and the exact
location is determined quickly.

The small file size is a result of three parts:
1. The decision to format the file as a lexicon, phrasebook, and text.
2. The algorithm for compressing the words letters into overlapping unique
supersets of letter sequences. Meaning that the 100000 letters of 15000
words can be stored in just 10000 letters.
3. The algorithm for determining phrasebook entries, which repeatedly
replaces the most frequently occuring pair of tokens with a single token,
and recalculates the token-pair frequencies. (For example, the second
replacement is of "of the" with a single token - 11000 occurances, then a
dozen replacements later, the token for "of the" followed by the word "Lord"
is replaced by a single token - 1800 occurances, being a higher frequency
than the next replacement "for the" which only occurs 1700 times.)

The length of the word and number of repeated letters don't take a full byte
each, so there are free bits for specifying that the entry is punctuation,
or a special tag, etc.

I might start thinking about how this could be changed around to store more
XML style texts, with strongs numbers, morphology, footnotes, etc. My
lexicon/phrasebook structure could probably be extended to save xml
elements, the repetition of the word used for a particular strongs number
would mean that the compression ratio achieved should not change too much.

Stephen Denne.
Datacute - Acute Information Revelation Tools