[jsword-devel] Efficient Bible Text Storage Formats

Stephen Denne jsword-devel@crosswire.org
Fri, 9 Jan 2004 12:03:08 +1300


Erik,

(I'm finding it a little hard to concentrate at the moment, as 120km/h wind
gusts are threatening to take the roof off my house, so sorry if I've left
any sentences unfinished)

> Thank you Stephen, for your very interesting ideas. This sounds like a
> very useful format indeed. Definitely suitable for what I was looking
> for... Can this format be used freely? (I mean, without patents or so?)

That is a tricky question!

>From my perspective, I came up with all the various ideas involved in this
bible compression scheme by myself, but I have since read of techniques
similar to various portions of the technique in use by others. For example
in Dr Tim Bell's (et. al.) book "Managing Gigabytes: Compressing and
Indexing Documents and Images" (http://www.cs.mu.oz.au/mg/  currently on
loan to a friend, so I can't quote exactly what is said) compressing a
lexicon by only storing word endings is described, but makes no mention of
the large amount of redundancy that there is in the endings; I also use a
better technique for choosing which words to store in full. (I think they
suggest every fourth one.) The book also mentions compressing text by using
a word based scheme with a phrasebook, but simply says that choosing the
phrases is extremely difficult. It also mentions splitting into words and
non-words (punctuation) and not storing spaces (the most common non-word)
between words, and because there is a strict sequence of repeating: word /
non-word / word / non-word / etc. if you get word / word then place a space
in between.

I don't know if any of these mentioned techniques or any others that I have
used are already patented. As far as I know they are not, but that is not a
guarantee.

> You seem to sell some software, so I can imagine that you would like to
> make money from this too (which would be quite deserved). I was planning
> to release whatever I would write as GPL, especially if it would be an
> extension to JSword (I would not have a choice). But if we could share the
> file format, and the conversion tools would be freely available, that
> would be good for me. :)

As for the algorithms that I have come up with to produce this format, I
intend for them to be freely usable to produce bibles and bible related
texts by anyone, whether for profit or for free, and while I reserve the
right to use the techniques and algorithm I have come up with for other
kinds of texts (non-bible dictionaries, general book readers, etc). If
anyone else wishes to make money out of my ideas and algorithms, I'd like
them to pay me whatever royalties they think are appropriate. I intend
leaving the enforcement of just treatment, to God.

I would like to make money from this, but not at the expense of limiting the
spread of the good news of Christ.

Publishing code to read the file format as GPL would mean that the code
could not be re-used by others unless it was also a GPL project. I would
prefer if the file format reading code was LGPL. It could then still be used
in a GPL project, but could also be used by commercial vendors, who do not
wish to publish their own source code.

Keep in mind that there are three things we are talking about:
1. File format(s) - currently palm database, but easily modifiable to other
less structured file formats.
2. Algorithms to produce the file
3. Algorithms to read the file

I think that algorithms to read the file go hand in hand with the file
format. Sure there are specific tricks to reduce memory or cache indexes,
etc. But given a file format, an algorithm to read from it can be generated
with ease.

The algorithm to produce the file is quite a different matter. The file
format contains no information about how to determine the string of ending
letters, how to choose phrasebook entries, etc. Yet it is these techniques
that provide most of the compression.

> I would only release something for the P800/P900
> in PersonalJava, but I can imagine that it can run quite easily on other
> devices that run PersonalJava. I hope that is not a problem for you if we
> would use the same file format.

Not a problem.

I haven't looked into what PersonalJava is till today, but it looks like it
is being discontinued (by sun) in favour of PBP and PP
http://developers.sun.com/techtopics/mobility/personal/articles/pbp_pp/index
.html

> If you would have an example bible file available, I could have a go at a
> reader for Java. Plain text would be fine for a first release for me, but
> it would of course be even better to be able to use all features of the
> bibles and books that the Sword project has available.

http://www.datacute.co.nz/DatacuteBibleFormat.zip

The wind has eased. I still have a roof.

Stephen Denne.
--
Datacute - Acute Information Revelation Tools
http://www.datacute.co.nz/