[sword-devel] imp2ld and alphabetization
DM Smith
dmsmith555 at yahoo.com
Sun Oct 28 19:57:10 MST 2007
I'm not sure if I am reading the Sword code correctly, but it appears
that it is sorting at a byte level and not a character level. That
isn't by code points.
I think that we discussed this a little bit ago and concluded that
some work needs to be done in the engine.
Her is my thought on the matter, for what it is worth. Today the sort
serves two purposes: order and search. But it is search that
constrains the order to be as it is. I think that if we could search
independently of the order of keys in the module that would be ideal.
One simple way for any application to provide this is to create a
Lucene index similar to what we do for a Bible for the dictionary
(similar to what we do for a Bible) that consists of the term (stored
and indexed), the offset (stored) in the module (so it can be
retrieved and previous and next indexes can be found), the entry for
the term (indexed, but not stored). The application can then create
any kind of collation of the keys (using the excellent facilities of
ICU) that suite the user's needs. Then using this double handle
present the keys in part (as in BibleCS) or whole (as in
BibleDesktop, MacSword, ...) in the order that the user expects.
There are some related problems to this:
A user may expect to be able to find a Hebrew word in a Hebrew
dictionary independent of the pointing of the word in the dictionary.
(i.e. a user may wish to search without specifying accents)
A user may expect to find a word by stem not just by prefix.
A user may expect to be able to type "photos" (a transliteration) and
find the real Greek word in a Greek dictionary.
I'm cross-posting to J-Sword because this will be of interest there
as well.
In His Service,
DM Smith
On Oct 28, 2007, at 9:13 PM, Frank wrote:
> peter wrote:
>> Is this really only a Vietnamese problem, but will not all latinate
>> scripts with extra signs have exactly the same problem?
>>
>> Or actually all scripts which are treated as derrived scripts -
>> Farsi,
>> urdu and Malay from Arabic, Tajik, Uzbek, Azeri from Russian etc -
>> the
>> code points are initially for the "main" characters and then there
>> is a
>> always bunch of extra characters which are used only in one or other
>> language.
>>
>> But maybe I am just showing my ignorance here. I need to look at some
>> dictionaries - never had any installed.
> Any language that uses letters outside the ASCII range will be
> affected
> unless the collate the letter after "z"... and if it's strictly in
> Unicode point order, then all upper case will collate before lower
> case...
>
> --
> Blessings
>
> Frank
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel
mailing list