[sword-devel] imp2ld and alphabetization
peter
refdoc at gmx.net
Sun Oct 28 17:42:55 MST 2007
Is this really only a Vietnamese problem, but will not all latinate
scripts with extra signs have exactly the same problem?
Or actually all scripts which are treated as derrived scripts - Farsi,
urdu and Malay from Arabic, Tajik, Uzbek, Azeri from Russian etc - the
code points are initially for the "main" characters and then there is a
always bunch of extra characters which are used only in one or other
language.
But maybe I am just showing my ignorance here. I need to look at some
dictionaries - never had any installed.
Daniel Owens wrote:
> Chris,
>
> I imagine that with most languages, sorting according to unicode codepoint order
> works, but for Vietnamese it doesn't, probably because the majority of letters
> are standard Latin characters, but then some are less usual ("đ" being a good
> example).
>
> This is probably very low on the priority list and I'm not sure how much work
> this would involve, but I would suggest at some point adding an option to the
> command line syntax for imp2ld either to 1. sort the order of keys according to
> unicode (default) or 2. retain the order of the IMP file (not sort at all). That
> way languages that do not alphabetize well according to the codepoint order in
> Unicode can remain in alphabetical order (assuming the module creator sorted
> correctly).
>
> Daniel
>
> Chris Little wrote:
>> Daniel,
>>
>> The order of keys in an LD module is according to the codepoint order in
>> Unicode. They keys are kept in this order in order to permit binary
>> searching. There is currently no way to perform localized collation.
>>
>> The platform and locale shouldn't play a role in this. If they do, it's
>> a bug.
>>
>> --Chris
>>
>> Daniel Owens wrote:
>>
>>> I am working on creating dictionary modules based on the Free Vietnamese
>>> Dictionary Project. The Vietnamese-English dictionary is working, but
>>> some words are not in alphabetical order, and I am trying to find out
>>> how to maintain the original alphabetization.
>>>
>>> I noticed this when all of the words beginning with a vowel having
>>> diacritics/tones or beginning with a "Ä‘" were sorted to the end of the
>>> dictionary. The DAT file maintains the original order, which is more
>>> accurate. It must be that the IDX file generated by imp2ld creates its
>>> own index and alphabetizes according to it's own scheme. The entries of
>>> each word are tagged as ThML. Here is a slightly random entry:
>>>
>>> $$$ác bá
>>> <entry key="ác bá" type="main" id="n20"><b>ác bá</b><br />[noun]<br />-
>>> Cruel landlord, village tyrant<br /></entry>
>>>
>>> Is there a way to keep imp2ld from changing the order of the index? I am
>>> happy to send someone the IMP file if that helps. I pasted the CONF file
>>> at the bottom of this message.
>>>
>>> Daniel
>>>
>>> CONF File:
>>>
>>> [VietAnh]
>>> DataPath=./modules/lexdict/rawld4/vietanh/vietanh
>>> ModDrv=RawLD4
>>> Encoding=UTF-8
>>> SourceType=THML
>>> SwordVersionDate=2007-10-27
>>> Version=1.0
>>> Lang=vi
>>> Description=FVDP Vietnamese-English Dictionary
>>> About=- This is the Vietnamese-English dictionary database of the Free
>>> Vietnamese Dictionary Project. It contains more than 23.400 entries with
>>> definitions and illustrative examples.\par\par- This database was
>>> compiled by Ho Ngoc Duc and other members of the Free Vietnamese
>>> Dictionary Project
>>> (http://www.informatik.uni-leipzig.de/~duc/Dict/)\par\par- Copyright (C)
>>> 1997-2003 The Free Vietnamese Dictionary Project\par\par- This program
>>> is free software; you can redistribute it and/or modify it under the
>>> terms of the GNU General Public License as published by the Free
>>> Software Foundation; either version 2 of the License, or (at your
>>> option) any later version. This program is distributed in the hope that
>>> it will be useful, but WITHOUT ANY WARRANTY; without even the implied
>>> warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>>> GNU General Public License for more details.
>>> TextSource=http://www.informatik.uni-leipzig.de/~duc/Dict/
>>>
>>>
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel
mailing list