[sword-devel] search idea

Paul Gear sword-devel@crosswire.org
Fri, 14 Jan 2000 10:10:39 +0000


darwin@ichristian.com wrote:

> Paul Gear wrote:
>
> > ...  Far better to do that than to have a markup where you're not
> > sure whether '<bt>' means 'book title' or 'bibliography text'.  Long tag names take up more space, but
> > this can be overcome with compression, and the benefits for understandability are enormous.  (And if
> > you start complaining about too many keystrokes, i'll start talking about macros...  ;-)
>
> I would protest the extra overhead of every read operation needing to parse
> the extra characters.

The overhead in reading those extra few characters would be flat out being 5% of the delay in displaying
documents.  When we're talking about parsing, we're talking something that's done in memory.  The disk (or
network) I/O, and the display to the screen to a lesser extent, is where the main bottleneck is.  The
parsing and scanning is trivial.

> After all a markup language will usually be read and
> processed by a program where <bt> would be easier to use than <book title>

Please explain to me why this is so.  It makes no difference to the program which you use - it's just a
string of different length.

> and only use about 1/3 of the space, and processing.

1/3 of the space does not necessarily mean 1/3 of the processing.  There are economies of scale, and, due
to the fact that the memory bandwidth is the fastest part of your system outside of the CPU, you're
unlikely to even notice the difference.

> There will be very
> few people that will compose ThML manually, just as there are very few that
> compose HTML manually.

I said that once.  I got shouted down by all the people on the ThML list who _do_.  And now i do it myself
occasionally, when i want more control over the markup process.

> I would doubt that very many people will ever need to read and decode ThML
> so I think that the language should be designed with minimal tag lengths to
> ease parsing.

It makes no difference what lengths the tags are for parsing.  Parsers don't get easier or harder by the
length of the string.  Did you have to write a parser at uni?  When we had to write one in software
engineering, we learned that the verbosity of the language wasn't the issue, it was the complexity of the
grammar.

> It is illogical to design a language where the process which is done once
> is made easy at the expense of the process that is performed millions of
> times.

In principle i agree with you, but this case is not an example of that.  The overhead is minimal, and can
be worked around completely if necessary.  Here's how: If you are worried about "<book title>" being longer
than "<bt>", why aren't you worried about "<bt>" being longer than, say, 0x12?  If you're that worried
about it, you can write a binary representation of the markup (i.e. "compile" the document to binary form),
compress it, and write it to disk.  And if we were writing for embedded systems, we might worry about that,
but it's not really that important in the scheme of things.

Have you ever used Logos Library System?  The original source of the documents is SGML, but the files on
disk look like meaningless gobbledegook because they use a compiled form of the text.  And you could run it
on an 8 Mb 386.  But their documents are still very 'readable' (i.e. what you'd call 'verbose') in the
source form (so i'm led to believe).  The length of tags need not bear any resemblance to the final form of
the text on disk.

Paul
---------
"He must become greater; i must become less." - John 3:30
http://www.bigfoot.com/~paulgear