[sword-devel] search idea

Trevor Jenkins sword-devel@crosswire.org
Sat, 15 Jan 2000 14:05:44 +0000


On Saturday, 15 January, 2000 10:31:08, Paul Gear <paulgear@bigfoot.com> 
wrote:

> Hey, i'm no expert.  I've just written a couple of parsers at uni, and i know

Not to boast or anything. I wrote a complete compiler for my Masters,
parser, optimiser and code generator. I used to be the compilers specialist
with a large DECsystem-10 bureau. Over the years I've written about a dozen
"parsers" for commercial products. At the moment I'm messing with an
Algol-68 front-end to GCCbut that's simply for pleasure. :-)

> that the difference between
>     if (token == "bt") {
>         ...
>     }
> and
>     if (token == "book title") {
>         ...
>     }
> is trivial, and probably insignificant in the scheme of a program like a web
> browser or digital library.

There is a time-penalty in comparing longer strings rather than shorter
ones. It doesn't even depend whether the programmer of the string comparison
routine has chosen a very efficient algorithm (and programmed it correctly).
Of course, shorter tags will always be processed faster than longer one no
matter how poorly the run-time library is implemented.

But in princple it is better to have longer tag names than shorter ones.
(Excepting for the very commonest ones, vi <p> in HTML.) The computer
doesn't care how long they are; this "betterness" is solely for our benefit.

> (Incidentally, Craig Rairdin warned me that bsisg.com might not last very
> long, so i took a copy of the site with GNU wget.  If anyone wants a look at
> it, i can provide it.  It's a 700 Kb tarball.)

I tried bsisg.com and never got a connection. There's obviously a DNS entry
for it but the web browser just times out. I'd be interested in the
tar-ball.

> That's quite a popular philosophy.  I must admit i don't like it myself, but
> it's certainly a valid one. I prefer the 'suck it and see' approach - only
> optimize it if you find that it is necessary to do so.  That is not to say i
> think that we shouldn't consider performance - by all means we should make a
> design that is capable of being optimized, but when writing code (or text
> markup), it is much more important to build something that is maintainable by
> others.

Jon Bentley's books "Programming Pearls"and "More Programming Pearls" should
be required reading for anyone who does programming. :-)

>  (I've heard that Donald Knuth talks about the "error of optimizing
> too soon" - he believes that a lot of time and effort is wasted on optimizing
> things that really don't need it.)

There is an account in one of his books of how Knuth wrote a routine using
the timing information provided by the disk manufacturer. He optimised it so
that it would be fast and effecient. Howeever, come the day of live running
there was a 100% degradation. After much investigation it turns out that the
OS driver for this disk drive had a bug in it. (Compare my comments above
and previously about the programming of string comparators.)

Knuth's paper "Structured Programming with goto statements" is also on my
must read list for budding programmers. It's a rejoinder to Dijkstra's "Goto
considered harmful" letter, which is nothing more than mischief in my
opinion. (But I should be careful saying that as my PhD superviser is a
personal friend of both Knuth and Dijkstra.) The only time I've ever written
a program without using a goto was with Bliss---it doesn't have a goto
statement; that's something else I'm tinkering with for GCC.

> That's a nice thought, but it doesn't scale.  What if the word for book in
> another language doesn't start with 'b'?  What if there is no equivalent of
> 'b'?  What if it doesn't use a Latin character set?

As all Western languages are Latinate and use the Latin alphabet your
argument would be better if you used Hebrew, Arabic, Chinese or a script not
based on Latin. :-)

In one of the ISO technical reports for SGML we gave a list of national
tags. The problem with this approach was that we did not provide a means to
communicate the language as part of the application. Nor did we have a means
to translate (tags) between any pair of languages. Further we didn't
accommodate parallel texts. ;-)

IfI'm being incoherent then forgive me; this "post viral fatigue" is still
messing me up.

Regards, Trevor

British Sign Language is not inarticulate handwaving; it's a living
language. So recognise it now.

--

<>< Re: deemed!