[sword-devel] imp2ld encoding problem

Yiguang Hu yighu at yahoo.com
Tue Nov 22 06:05:18 MST 2005


Thanks Chris. I used UTF-8 in the .conf file. It
didn't work.
By trying different encoding, I mean I tried to use
word and text editor to read the geneated LD database
(the *.dat file) by select different encodings(GB2312,
BIG5, UTF-8, CN2202, etc), none of them work.

I am not familiar with C lanuage (I assume impl2ld is
a c program) since I have not coded it for several
years, so I don't know if there were potential hidden
conversion that took place. Java does has some hidden
conversion if encodings are not specified correctly.
The heart of that problem is:
String str=new String(byte[],ENCODING)/ str=new
String(byte[]);
and byte[] bt=str.getByte(ENCODING)/bt=str.getByte().
If ENCODING is not specified, the default encoding is
picked up according to the JVM environment and it will
corrupt data if the default encoding is ASCII(for
example en_US locale) while the data were actually
DBCS or MBCS characters like Chinese encoded in no
matter what encodings. The above conversion is very
common is JAVA and could cause problems, for example
during conversing stream bytes into string or writing
string to file using stream.

Could there be similar issue in C/C++ ?

Thanks
Yiguang


--- Chris Little <chrislit at crosswire.org> wrote:

> imp2ld faithfully converts an IMP file to an LD
> database. There is no 
> text encoding transformation of the data involved,
> so what you put in 
> your file is exactly what will be placed in the
> module and is exactly 
> what you will get back (from a front-end or
> mod2imp).
> 
> The invalid character warning can be ignored. The
> only character 
> transformations that imp2ld performs relate to
> sorting the dictionary 
> keys, so the worst case would involve entries in the
> wrong order. 
> (Correct me if I'm wrong about this Troy.)
> 
> I'm not sure what you meant about trying different
> encodings. Which 
> values did you try? The .conf file for your module
> should include a line 
> that says "Encoding=UTF-8" if you have UTF-8 input.
> 
> --Chris
> 
> 
> Yiguang Hu wrote:
> > I ran into Encoding problem when I tried to use
> imp2ld
> > to convert a Chinese theology terms/Encyclopedia
> into
> > the module
> > that sword can use. The input text file is a UTF-8
> > encoded with the format:
> > $$$English KeyWord Chinese Translation
> > The meaning of the term
> > $$$....
> > For example:
> > $$$Abbess &#22899;&#20462;&#36947;&#38498;&#38263;
> > 
> >
>
&#12288;&#28858;&#22899;&#20462;&#36947;&#38498;&#20043;&#22899;&#38936;&#34966;&#65292;&#20854;&#32887;&#20219;&#19981;&#22914;&#30007;&#20462;&#36947;&#38498;&#38263;&#35373;&#31435;&#20043;&#26089;&#65292;&#20854;&#27402;&#20134;&#19981;&#22914;&#30007;&#20462;&#36947;&#38498;&#38263;&#20043;&#22823;&#12290;&#26377;&#26178;&#20134;&#31649;&#29702;&#30007;&#20462;&#36947;&#38498;&#12290;
> > $$$Abbey &#20462;&#36947;&#38498;
> > 
> >
>
&#12288;&#21448;&#31281;*Monastery&#12290;&#21407;&#28858;&#19968;&#20462;&#36947;&#22763;&#22296;&#20043;&#21517;&#31281;&#65292;&#30001;&#19968;&#20301;&#38498;&#38263;&#31649;&#29702;&#12290;&#20197;&#24460;&#20182;&#20497;&#25152;&#23621;&#20303;&#20043;&#23627;&#23431;&#12289;&#31150;&#25308;&#22530;&#31561;&#65292;&#27010;&#31281;&#28858;&#20462;&#36947;&#38498;&#12290;
> > $$$Abbot &#20462;&#36947;&#38498;&#38263;
> > 
> >
>
&#12288;&#28858;&#20462;&#36947;&#38498;&#38936;&#34966;&#20043;&#31281;&#65292;&#24847;&#21363;&#29238;&#20063;&#12290;&#20462;&#36947;&#38498;&#38263;&#21407;&#20418;&#24179;&#20449;&#24466;&#65292;&#24478;&#31532;&#19971;&#19990;&#32000;&#36215;&#65292;&#25945;&#26371;&#23450;&#28858;&#32854;&#32887;&#12290;&#36890;&#24120;&#28858;&#20854;&#26412;&#38498;&#24351;&#20804;&#25152;&#36984;&#33289;&#65292;&#20854;&#32887;&#20219;&#20035;&#32066;&#36523;&#12290;
> > $$$Abbot, George
> > &#38463;&#27874;&#29305;&#65288;1562-1633&#65289;
> > 
> >
>
&#12288;&#33521;&#22283;&#25945;&#23447;&#65307;&#22350;&#29305;&#24067;&#37324;&#22823;&#20027;&#25945;&#65307;*&#32854;&#32147;&#27453;&#23450;&#26412;&#30340;&#21512;&#32232;&#32773;&#12290;
> > $$$Abelard, Peter or Abailard
> > &#20126;&#27604;&#25289;&#65288;1079-1142&#65289;
> > 
> > I used imp2ld to generate the module. There were
> many
> > errors about invalid characters. But it
> neverthless
> > generated the module. The problem is the module
> > characters are saved in wrong encoding. I tried
> > different encodings to read and none of them make
> the
> > charater understandable as shown below:
> > Abbey 修道院
> > 
> >
>
 又稱*Monastery。原為一修道士團之名稱,由一位院長管理。以後他們所å±
住之屋宇、禮拜å
> ‚等,概稱為修道院。
> > 
> > Abbot 修道院長
> > 
> >  為修道院é
>
˜è¢–之稱,意即父也。修道院長原係平信徒,從第七世紀起,教會定為聖職。通常為å
¶æœ¬é™¢å¼Ÿå
„所選舉,å
¶è·ä»»ä¹ƒçµ‚身。
> > 
> > Abbot, George 阿波特(1562-1633)
> > 
> > Does anyone experience this and knows how to solve
> > this problem?
> > 
> > BTW, I have a couple of short java programs that
> > generate the above format Dictionary file and
> Bible
> > text so you can use impl2vs and impl2ld to convert
> > them into sword modules. I will be glad to put the
> > code some where for share if someone interest in
> it.
> > Thanks
> > Yiguang
> 
> 



		
__________________________________ 
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com


More information about the sword-devel mailing list