[sword-devel] Creating a "SWORD-over-network" protocol for remote SWORD repo access?

Fri Jul 19 01:59:05 EDT 2024

On Sun, 14 Jul 2024 21:08:08 +0300
Jaak Ristioja <jaak at ristioja.ee> wrote:

> Hello,
> 
> +1, however this is not a small feat. Having also considered this, I 
> would like to share some toughts on this topic which I hope you find
> useful.
> 
> As far as I understand libsword, it tries to support both FTP and 
> HTTP(S) repositories.
>    * Libsword seems to include a hand-written parser to parse the 
> non-standardized FTP directory listings in order to figure out the 
> modules present on the remote repository.
>    * Similarly for HTTP(S), libsword expects to the web server to 
> provide (Apache HTTPD style?) HTML directory indexes, for which it
> seems to include an overly-simplistic hand-written parser.
> 
> Reliance on these non-standardized server-specific index
> files/directory listings is very fragile, as slight deviations of
> server output might cause the respective parsing in libsword to be
> unreliable. The quality of these (and other[*]) hand-written parsers
> in libsword is questionable, and I would not be suprised to find in
> it bugs which put users in danger. ;(

hmm, it sounds like you're thinking something along the lines of
remotely accessing the SWORD modules *directly*? My thought was
something more along the lines of a SWORD client making a call to a
specialized SWORD server that read the file and returned the desired
verse references or whatever for it. The server would take some sort of
syntax as input and then spit out basically the same kind of info that
mod2imp spits out, then send it back over the wire for the client's
libsword to parse, mutate, and eventually hand to whatever rendering
engine the frontend uses. Trying to remotely access parts of a SWORD
module sounds like a nightmare, and if the modules were loaded all at
once it would kind of undo the point since we already can download
entire modules at once. I think most of your concerns below are solved
by this way of doing things (although I'm sure it comes with its own
fun set of problems).

As for the actual issues themselves, on a scale of 1 to 10 how
difficult do you think it would be to have a repository descriptor file
that is located in a predictable place and that contains data about
what modules exist on the server and how to find them? This is
more-or-less what the apt package manager does and given Debian and
Ubuntu's success it seems to work well.

> Cryptographic signing of Sword modules and/or repository index files 
> would only marginally alleviate the situation while also introducing 
> biggers problems such as public key distribution and secure handling
> of private keys. This might still be a good optional feature in some
> later design, but more important things first...
> 
> Another problem is that a single Sword modules consist of multiple 
> files: the configuration file and one or more files with the actual 
> content or content indexes (e.g. old testament content, old testament 
> content index, new testament content, new testament content index). 
> These are distributed in different repository directories and require 
> multiple client requests to download. The module file and directory 
> names do not contain a version identifier, nor is there any
> checksumming between the files. So when a server updates a module
> when a client is in the middle of downloading these files, this might
> cause the client to download files pertaining to different versions
> of the module or download partially uploaded files, leading to all
> kinds of nasty problems. Proper versioning in filenames and
> checksumming could help alleviate this.
> 
> It might be a blocker that libsword does not support having multiple 
> versions of a single module installed.
> 
> It might be a blocker that libsword does not have a namespacing
> scheme for modules e.g. there can only be one module named "KJV" and
> it might be problematic if two repositories (vendors) provide their
> own different "KJV" modules. And it would probably be a bad idea to
> try reserve the use of identifiers like "KJV" to specific vendors
> e.g. by using some kind of registry.
> 
> Another obstacle to defining a new repository format/protocol is that 
> there is no complete and sound formal specification for the module 
> configuration file format and its fields. The descriptions in the
> SWORD wiki are incomplete and contain ambiguity.

This is not something that my server idea overcomes, so I'll think
about that. Perhaps it would be worth digging into just that and
overcoming it by strictly defining the configuration file format?

> While perhaps not strictly be a blocker to creating a new repository 
> format/protocol, but there are no formal specifications for the
> module content and content index files. I remember these formats
> having being described as internal libsword details which don't
> require specification, because the format and libsword might change.
> However, I think this reasoning is incorrect, because files of these
> formats are exchanged over the wire, used in multiple repositories
> not all which are managed by Crosswire, and libsword wants to retain
> backwards compatibility with older modules as well.

I agree with you w.r.t. the shortcomings of this. It also makes me
realize that it means that the libsword on the server would have to be
"close enough" to the libsword of the client in order for my server
idea to work, because otherwise the server's libsword will send markup
data that the client can't process. If backwards compatibility is still
maintained, some way of transferring versioning information over the
wire might be enough.

> In my opinion the repository format should not much depend on the 
> underlying transport protocol (HTTP(S), FTP, local filesystem) and 
> should not require special handling on the server side. For HTTP this 
> means that all repository files may be served statically on a regular 
> web server without requiring extra server-side scripting. Just files
> and directories, no parsing of directory indexes, only retrieval of
> regular files by their path.

Hmm, I don't see how this is really possible in a "retrieve part of a
module" situation. I mean it probably would work if you used HTTP
partial downloads to retrieve the blocks of files you want, but that
sounds like it would probably require quite a lot of HTTP requests to
load one chapter from a module, which would probably put undue load on
the server and slow down the client.

> In the most simple case, the client would retrieve the (root) index
> file from a fixed location in the repository (e.g. using HTTP GET),
> parse it, and proceed to download selected modules, where each module
> version is a single archive file in the repository. Various specific
> repository (directory) layouts are possible. Since SWORD repositories
> are relatively small it might probably suffice for only one (root)
> index file which would contain all necessary metadata from all the
> module archives in the repository. I recommend JSON to be used for
> index files (for interoperability), and an extensible versioned JSON
> schema to be defined.

Makes sense to me, though SWORD uses XML-like formats in a lot of
places so maybe it would be a better fit than JSON. Then again, I hate
XML and would much rather use JSON :P

Thanks for the feedback!
Aaron

> Best regards,
> Jaak
> 
> 
> [*] Rewriting just the repository logic would not prevent other
> libsword parser bugs from being exploited.