[sword-devel] imp2ld and alphabetization

peter refdoc at gmx.net
Sun Oct 28 17:42:55 MST 2007


Is this really only a Vietnamese problem, but will not all latinate
scripts with extra signs have exactly the same problem?

Or actually all scripts which are treated as derrived scripts - Farsi,
urdu and Malay from Arabic, Tajik, Uzbek, Azeri from Russian etc - the
code points are initially for the "main" characters and then there is a
always bunch of extra characters which are used only in one or other
language.

But maybe I am just showing my ignorance here. I need to look at some
dictionaries - never had any installed.


Daniel Owens wrote:
> Chris,
> 
> I imagine that with most languages, sorting according to unicode codepoint order 
> works, but for Vietnamese it doesn't, probably because the majority of letters 
> are standard Latin characters, but then some are less usual ("đ" being a good 
> example).
> 
> This is probably very low on the priority list and I'm not sure how much work 
> this would involve, but I would suggest at some point adding an option to the 
> command line syntax for imp2ld either to 1. sort the order of keys according to 
> unicode (default) or 2. retain the order of the IMP file (not sort at all). That 
> way languages that do not alphabetize well according to the codepoint order in 
> Unicode can remain in alphabetical order (assuming the module creator sorted 
> correctly).
> 
> Daniel
> 
> Chris Little wrote:
>> Daniel,
>>
>> The order of keys in an LD module is according to the codepoint order in 
>> Unicode. They keys are kept in this order in order to permit binary 
>> searching. There is currently no way to perform localized collation.
>>
>> The platform and locale shouldn't play a role in this. If they do, it's 
>> a bug.
>>
>> --Chris
>>
>> Daniel Owens wrote:
>>   
>>> I am working on creating dictionary modules based on the Free Vietnamese 
>>> Dictionary Project. The Vietnamese-English dictionary is working, but 
>>> some words are not in alphabetical order, and I am trying to find out 
>>> how to maintain the original alphabetization.
>>>
>>> I noticed this when all of the words beginning with a vowel having 
>>> diacritics/tones or beginning with a "Ä‘" were sorted to the end of the 
>>> dictionary. The DAT file maintains the original order, which is more 
>>> accurate. It must be that the IDX file generated by imp2ld creates its 
>>> own index and alphabetizes according to it's own scheme. The entries of 
>>> each word are tagged as ThML. Here is a slightly random entry:
>>>
>>> $$$ác bá
>>> <entry key="ác bá" type="main" id="n20"><b>ác bá</b><br />[noun]<br />- 
>>> Cruel landlord, village tyrant<br /></entry>
>>>
>>> Is there a way to keep imp2ld from changing the order of the index? I am 
>>> happy to send someone the IMP file if that helps. I pasted the CONF file 
>>> at the bottom of this message.
>>>
>>> Daniel
>>>
>>> CONF File:
>>>
>>> [VietAnh]
>>> DataPath=./modules/lexdict/rawld4/vietanh/vietanh
>>> ModDrv=RawLD4
>>> Encoding=UTF-8
>>> SourceType=THML
>>> SwordVersionDate=2007-10-27
>>> Version=1.0
>>> Lang=vi
>>> Description=FVDP Vietnamese-English Dictionary
>>> About=- This is the Vietnamese-English dictionary database of the Free 
>>> Vietnamese Dictionary Project. It contains more than 23.400 entries with 
>>> definitions and illustrative examples.\par\par- This database was 
>>> compiled by Ho Ngoc Duc and other members of the Free Vietnamese 
>>> Dictionary Project 
>>> (http://www.informatik.uni-leipzig.de/~duc/Dict/)\par\par- Copyright (C) 
>>> 1997-2003 The Free Vietnamese Dictionary Project\par\par- This program 
>>> is free software; you can redistribute it and/or modify it under the 
>>> terms of the GNU General Public License as published by the Free 
>>> Software Foundation; either version 2 of the License, or (at your 
>>> option) any later version. This program is distributed in the hope that 
>>> it will be useful, but WITHOUT ANY WARRANTY; without even the implied 
>>> warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 
>>> GNU General Public License for more details.
>>> TextSource=http://www.informatik.uni-leipzig.de/~duc/Dict/
>>>
>>>
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>>     
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>>   
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page




More information about the sword-devel mailing list