<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
My experience is from perl and java, but it may have bearing.<br>
Collation is language dependent. English, French and German collate
their accented characters differently than each other. In Spanish "ch"
is sorted at the beginning of the "c" (though this may be changing).<br>
In Java collation uses the provided locale and failing that the
program's default locale, which unless set is the user's locale.<br>
I found that the same logic was needed to do a binary search. So if
ICU is needed for sorting, then ICU will be needed for a bin search.<br>
<br>
On a project I was on we had two fundamental requirements for a list of
40K+ international publication titles:<br>
1) For each supported locale, present the lists and sublists of
publications in the order that is appropriate for that locale.<br>
2) Provide efficient searching.<br>
<br>
To accomplish this we first had to normalize the name of each
publication. This requires knowing the language of the title of the
publication so that that languages stop words could be used (Het
Dagblad, and The Podunk Times needed to sort under Dagblad and Podunk
Times, respectively, because Het and The are stop words in their
languages.) We had decided that while an English speaker might look for
Het Dagblad under the "H" that the publication's locale was more
important. We had tried a universal list of stop words as the union of
every language's stop words, but that did not work LA could be Spanish
or it could be an abbreviation for Los Angeles, Die in English and
German are very different.<br>
We 0 padded numbers, removed stop words, single cased everything,
removed some punctuation, and removed redundant spacing. There were
other normalizations, but these are the obvious ones we can all think
of.<br>
We then created a text table with the normalized title, the original
title, the other columns were numeric sort keys for each supported
language.<br>
(This could have been done with parallel tables)<br>
This table was sorted on the normalized title but using a 8-bit ascii
collation.<br>
<br>
To do a search for an exact match, the user's input was normalized with
the exact same rules and then did a binary search.<br>
When the user wanted to do a free text search, we used something like
Lucene to index the titles. With each title was the normalized form.<br>
To sort a list of titles in the fashion that the user wants to see, we
used the appropriate column from the table (using the default column,
if the user's locale was not supported.)<br>
<br>
We ultimately used Java to do the collation because Perl's UTF-8
support was not quite there (5.6 was the latest version at the time)
and we found that we needed ICU for some of the more specialized rules
that I did not present here. And ICU was not supported for perl at the
time. I don't know where perl stands now.<br>
<br>
BTW, this is something that I could throw together in Java, if it is ok
to have some Sword tools in something other than C++.<br>
<br>
Daniel Glassey wrote:
<blockquote cite="mid30e46b3d05062215545b53addc@mail.gmail.com"
type="cite">
<pre wrap="">fwiw here's my opinion on what the standards should be. I definitely
agree that there should be standards.
On 22/06/05, Joachim Ansorg <a class="moz-txt-link-rfc2396E" href="mailto:nospam+sword-devel@joachim-ansorg.de"><nospam+sword-devel@joachim-ansorg.de></a> wrote:
</pre>
<blockquote type="cite">
<pre wrap="">Hi,
I'm struggling with the unicode stuff of lexicons and lexicons in general.
Currently a frontend doesn't know whether to expect keys as utf8 or as
something else. because there's no standard defined. The same is valid of
GenBooks.
</pre>
</blockquote>
<pre wrap=""><!---->
It seems reasonable to me that all text, keys, everything in all types
of modules should be in UTF-8.
</pre>
<blockquote type="cite">
<pre wrap="">Secondly, the sort oder is not valid for unicode if unicode characters are
used in the entry names.
That way unicode strings like the german "a umlaut" appear in the end, but
they should be among the firtst entries of the list. Sorting in the frontend
moves the lexicon intro somewhere into the middle of the list and is
slow(er).
</pre>
</blockquote>
<pre wrap=""><!---->
Unicode defines collation(sorting).
<a class="moz-txt-link-freetext" href="http://www.unicode.org/reports/tr10/">http://www.unicode.org/reports/tr10/</a>
The entries should be sorted using something that implements the
algorithm by the module creation app. ICU should do the job and
doesn't have to be linked into the runtime lib to be able to do this.
It only needs to be linked into the module creation app. The way it
collates is language specific so it should get German right.
I think perl and python should also be able to do collation so they
are another option.
</pre>
<blockquote type="cite">
<pre wrap="">Thirdly, the lexicon intro is a hack, it uses a lot of prepended spaces to be
in the first place of the list.
We need to find a better solution for that.
</pre>
</blockquote>
<pre wrap=""><!---->
Agreed (sorry, I don't have one offhand)
</pre>
<blockquote type="cite">
<pre wrap="">I'm missing defined standards for the API and the modules. That would make
frontend development a lot easier.
</pre>
</blockquote>
<pre wrap=""><!---->
Agreed,
Daniel
_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page
</pre>
</blockquote>
</body>
</html>