[sword-devel] imp2ld encoding problem

Troy A. Griffitts scribe at crosswire.org
Tue Nov 22 17:29:36 MST 2005


Yiguang,
	Have you tried to open your imp file with a text editor that 
understands UTF-8?  vi (vim) should do fine.  If the data in the imp 
file looks ok (and indeed is UTF-8) then you should be good to go for 
sword.  Actually, a web browser might be easiest.  Just open your imp 
file with firefox and manually select the UTF-8 encoding and see if it 
show up ok.
	Hope we can get things working well for you,
		-Troy.



Yiguang Hu wrote:
> Thanks Chris. I used UTF-8 in the .conf file. It
> didn't work.
> By trying different encoding, I mean I tried to use
> word and text editor to read the geneated LD database
> (the *.dat file) by select different encodings(GB2312,
> BIG5, UTF-8, CN2202, etc), none of them work.
> 
> I am not familiar with C lanuage (I assume impl2ld is
> a c program) since I have not coded it for several
> years, so I don't know if there were potential hidden
> conversion that took place. Java does has some hidden
> conversion if encodings are not specified correctly.
> The heart of that problem is:
> String str=new String(byte[],ENCODING)/ str=new
> String(byte[]);
> and byte[] bt=str.getByte(ENCODING)/bt=str.getByte().
> If ENCODING is not specified, the default encoding is
> picked up according to the JVM environment and it will
> corrupt data if the default encoding is ASCII(for
> example en_US locale) while the data were actually
> DBCS or MBCS characters like Chinese encoded in no
> matter what encodings. The above conversion is very
> common is JAVA and could cause problems, for example
> during conversing stream bytes into string or writing
> string to file using stream.
> 
> Could there be similar issue in C/C++ ?
> 
> Thanks
> Yiguang
> 
> 
> --- Chris Little <chrislit at crosswire.org> wrote:
> 
> 
>>imp2ld faithfully converts an IMP file to an LD
>>database. There is no 
>>text encoding transformation of the data involved,
>>so what you put in 
>>your file is exactly what will be placed in the
>>module and is exactly 
>>what you will get back (from a front-end or
>>mod2imp).
>>
>>The invalid character warning can be ignored. The
>>only character 
>>transformations that imp2ld performs relate to
>>sorting the dictionary 
>>keys, so the worst case would involve entries in the
>>wrong order. 
>>(Correct me if I'm wrong about this Troy.)
>>
>>I'm not sure what you meant about trying different
>>encodings. Which 
>>values did you try? The .conf file for your module
>>should include a line 
>>that says "Encoding=UTF-8" if you have UTF-8 input.
>>
>>--Chris
>>
>>
>>Yiguang Hu wrote:
>>
>>>I ran into Encoding problem when I tried to use
>>
>>imp2ld
>>
>>>to convert a Chinese theology terms/Encyclopedia
>>
>>into
>>
>>>the module
>>>that sword can use. The input text file is a UTF-8
>>>encoded with the format:
>>>$$$English KeyWord Chinese Translation
>>>The meaning of the term
>>>$$$....
>>>For example:
>>>$$$Abbess &#22899;&#20462;&#36947;&#38498;&#38263;
>>>
>>>
>>
> &#12288;&#28858;&#22899;&#20462;&#36947;&#38498;&#20043;&#22899;&#38936;&#34966;&#65292;&#20854;&#32887;&#20219;&#19981;&#22914;&#30007;&#20462;&#36947;&#38498;&#38263;&#35373;&#31435;&#20043;&#26089;&#65292;&#20854;&#27402;&#20134;&#19981;&#22914;&#30007;&#20462;&#36947;&#38498;&#38263;&#20043;&#22823;&#12290;&#26377;&#26178;&#20134;&#31649;&#29702;&#30007;&#20462;&#36947;&#38498;&#12290;
> 
>>>$$$Abbey &#20462;&#36947;&#38498;
>>>
>>>
>>
> &#12288;&#21448;&#31281;*Monastery&#12290;&#21407;&#28858;&#19968;&#20462;&#36947;&#22763;&#22296;&#20043;&#21517;&#31281;&#65292;&#30001;&#19968;&#20301;&#38498;&#38263;&#31649;&#29702;&#12290;&#20197;&#24460;&#20182;&#20497;&#25152;&#23621;&#20303;&#20043;&#23627;&#23431;&#12289;&#31150;&#25308;&#22530;&#31561;&#65292;&#27010;&#31281;&#28858;&#20462;&#36947;&#38498;&#12290;
> 
>>>$$$Abbot &#20462;&#36947;&#38498;&#38263;
>>>
>>>
>>
> &#12288;&#28858;&#20462;&#36947;&#38498;&#38936;&#34966;&#20043;&#31281;&#65292;&#24847;&#21363;&#29238;&#20063;&#12290;&#20462;&#36947;&#38498;&#38263;&#21407;&#20418;&#24179;&#20449;&#24466;&#65292;&#24478;&#31532;&#19971;&#19990;&#32000;&#36215;&#65292;&#25945;&#26371;&#23450;&#28858;&#32854;&#32887;&#12290;&#36890;&#24120;&#28858;&#20854;&#26412;&#38498;&#24351;&#20804;&#25152;&#36984;&#33289;&#65292;&#20854;&#32887;&#20219;&#20035;&#32066;&#36523;&#12290;
> 
>>>$$$Abbot, George
>>>&#38463;&#27874;&#29305;&#65288;1562-1633&#65289;
>>>
>>>
>>
> &#12288;&#33521;&#22283;&#25945;&#23447;&#65307;&#22350;&#29305;&#24067;&#37324;&#22823;&#20027;&#25945;&#65307;*&#32854;&#32147;&#27453;&#23450;&#26412;&#30340;&#21512;&#32232;&#32773;&#12290;
> 
>>>$$$Abelard, Peter or Abailard
>>>&#20126;&#27604;&#25289;&#65288;1079-1142&#65289;
>>>
>>>I used imp2ld to generate the module. There were
>>
>>many
>>
>>>errors about invalid characters. But it
>>
>>neverthless
>>
>>>generated the module. The problem is the module
>>>characters are saved in wrong encoding. I tried
>>>different encodings to read and none of them make
>>
>>the
>>
>>>charater understandable as shown below:
>>>Abbey 修道院
>>>
>>>
>>
>  又稱*Monastery。原為一修道士團之名稱,由一位院長管理。以後他們所居住之屋宇、禮拜å
> 
>>‚等,概稱為修道院。
>>
>>>Abbot 修道院長
>>>
>>> 為修道院é
>>
> ˜è¢–之稱,意即父也。修道院長原係平信徒,從第七世紀起,教會定為聖職。通常為其本院弟兄所選舉,其職任乃終身。
> 
>>>Abbot, George 阿波特(1562-1633)
>>>
>>>Does anyone experience this and knows how to solve
>>>this problem?
>>>
>>>BTW, I have a couple of short java programs that
>>>generate the above format Dictionary file and
>>
>>Bible
>>
>>>text so you can use impl2vs and impl2ld to convert
>>>them into sword modules. I will be glad to put the
>>>code some where for share if someone interest in
>>
>>it.
>>
>>>Thanks
>>>Yiguang
>>
>>
> 
> 
> 
> 		
> __________________________________ 
> Yahoo! FareChase: Search multiple travel sites in one click.
> http://farechase.yahoo.com
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page



More information about the sword-devel mailing list