[sword-devel] Suggestion

Trevor Jenkins sword-devel@crosswire.org
Sun, 26 Dec 1999 14:17:56 +0000


On Saturday, 25 December, 1999 05:47:55, Chris Little <chrislit@chiasma.org>
wrote:

> I would strongly urge against changing our module format at all.  What we
> have is simple to manage and create, portable, ...

All good features that should not be forgotten.

> ... and sufficiently fast.

One man's sufficiently fast is this man's slow.

> We don't actually index our data content at all.  Indexing is only for
> lookups; searching is done linearly through the whole module text.  IMO,
> this method is acceptably fast.

Here we have to disagree. From my experience of searching text databases a
linear search is unacceptably slow. This is data structures 101 stuff,
linear search average time N/2, binary chop N log N, hash table 1. where N
is the number of terms to search through. Okay so pre-computing the index
files is an overhead for anything above naive linear search. However, it
would be possible to provide index files precomputed along with the raw text
files. That means only 1 system ever need take the setup hit. A linear
search requires that every user system take a performance hit. And every
user take a performance hit too.

>  Searching every one of our modules except
> the losungen (at least 100 modules total) took me 2 minutes (though some of
> that might have been due to network lag since I was doing the search over a
> modem through the cgi interface).  That's about 100 modules, 598mb of data
> searched on a 450mhz machine with an ATA-33 drive, which seems like pretty
> acceptable performance to me.

It depends upon your search criteria as to whether this is fast or not. A
search for a single term through that volume of data could not be considered
fast, if performed on a local copy of the data. Back in 1986 I designed a
multi-file system to host 1,000,000 records. It occupied over 16Gb of disk
space with about 5Gb being overhead for the inverted files; almost
everything was searchable. A search for a single term, which just happened
to be in 99% of the records, took less than 2 seconds to complete (including
screen updates). :-| Back in '86 having that amount of disk space available
on a single machine was still a luxury. Because we had precomputed the
indices the search times were linear irrespective of processor; we favoured
VAX architecture then.

A later port of our text database system to MIPS hardware and DEC Ultrix
improved our performance. We were getting better results on a MIPS 1000
series workstation than we could get from a dedicated VAX 8600!

Called in to undertake a benchmark against the market leader in text
databases we wiped the floor with them. They took several days to invert the
original data; we took only a couple of hours. In the search tests we should
have been second and them first. However, we beat them hands down despite
the fact that the test searches favoured their organisation of index files
over ours. (They sorted on field before term, whereas we sorted on term and
then field.)

In the intervening 15 or so years the power of workstations has improved to
the state where it would be feasible to employ the same schemes for personal
text databases such as SWORD.

Perhaps I'm over ambitious in wanting similar response times with SWORD but
it is achievable and I contend desirable.

The only problem with this system was the port to HP workstations. A (very)
junior programmer was given the task and he neglected to consider big- and
little-endian issues. For all other platforms we could take the same text
and index files and use them "out-of-the-box" with no regard for the
platform they were originally prepared on. I'd love to see the same
compatibility within SWORD.

Regards, Trevor

British Sign Language is not inarticulate handwaving; it's a living
language. So recognise it now.

--

<>< Re: deemed!