[sword-devel] Chinese "words"

sword-devel@crosswire.org sword-devel@crosswire.org
Fri, 27 Jun 2003 17:51:40 EDT


-------------------------------1056750700
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

In a message dated 27/06/2003 10:41:34 Pacific Daylight Time, 
crenz-swordproject@web42.com writes:
Sorry for being away for most of this month... am working my way
through 200+ sword-related e-mails and saw this one:

>NEW CHINESE TEXTS:  It seems in our older Union texts, we added spaces 
>between every character to help with line wraps and word breaks. 
I think the right thing to do is to change your layout engine to support 
correct Chinese line wrapping, instead of adding space (which should not be there) 
to work around the limitation in the layout engine. 
Is 
>this needed in the new NCV texts?  It seems they have spaces included at 
>certain places.  

Chinese texts usually don't have spaces except after punctuation
marks. 
Neither have space after puncation. No space, period. 
I'll install NCV and take a look at the spaces it has.

>I noticed this using the Hanzi dictionary which always 
>tried to lookup a 'word' instead of an individual glyph.

Chinese do have the concept of "word". But that is very different from the 
concept of the Latin word. 
First of all, space is not used to seperate words. 
Second, there are no easy way to parse a word. 
Third a word could be a single characters or composed by 2-6 characters.
Forth, there are compound word so some times there are no easy way to tell 
the boundary of a word even you are native Chinese. 

google implement very good Chinese search. Maybe you should look at how they 
do the search job. 

I didn't do anything do make it lookup a 'word', in fact I don't know
how to make it lookup an individual glyph only ;-). It is often not
very useful to only look up one character (imagine looking up "foot"
and "ball" vs. looking up "football". The first lets you someone guess
the meaning, but the second gives the exact information). So it should

be possible to select a few characters and look them up in the
dictionary with the mouse or keyboard. However, for "standard lookup"
(ie. without text being selected) looking up the current character
only instead of the whole 'word' probably would be more useful, since
with most modules the 'word' is going to be the whole line.

Greetings,
   Christian

-------------------------------1056750700
Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD>
<META charset=3DUTF-8 http-equiv=3DContent-Type content=3D"text/html;=20
charset=3Dutf-8">
<META content=3D"MSHTML 6.00.2600.0" name=3DGENERATOR></HEAD>
<BODY style=3D"FONT-SIZE: 10pt; COLOR: #000000; FONT-FAMILY: Arial;=20
BACKGROUND-COLOR: #ffffff">
<DIV>In a message dated 27/06/2003 10:41:34 Pacific Daylight Time,=20
crenz-swordproject@web42.com writes:</DIV>
<BLOCKQUOTE style=3D"PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: blue=20=
2px=20
solid"><FONT face=3DArial>Sorry for being away for most of this month... am=20
working my way<BR>through 200+ sword-related e-mails and saw this=20
one:<BR><BR>&gt;NEW CHINESE TEXTS:&nbsp; It seems in our older Union texts,=20=
we added spaces=20
<BR>&gt;between every character to help with line wraps and word=20
breaks.&nbsp;</BLOCKQUOTE>
<DIV>I think the right thing to do is to change your layout engine to suppor=
t=20
correct Chinese line wrapping, instead of adding space (which should not be=20
there) to work around the limitation in the layout engine. </DIV>
<BLOCKQUOTE style=3D"PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: blue=20=
2px=20
solid">Is <BR>&gt;this needed in the new NCV texts?&nbsp; It seems they have=
=20
spaces included at <BR>&gt;certain places.&nbsp; <BR><BR>Chinese texts usual=
ly=20
don't have spaces except after punctuation<BR>marks. </BLOCKQUOTE>
<DIV>Neither have space after puncation. No space, period. </DIV>
<BLOCKQUOTE style=3D"PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: blue=20=
2px=20
solid">I'll install NCV and take a look at the spaces it has.<BR><BR>&gt;I=20
noticed this using the Hanzi dictionary which always <BR>&gt;tried to lookup=
 a=20
'word' instead of an individual glyph.<BR></BLOCKQUOTE>
<DIV>Chinese do have the concept of "word". But that is very different from=20
the concept of the Latin word. </DIV>
<DIV>First of all, space is not used to seperate words. </DIV>
<DIV>Second, there are no easy way to parse a word. </DIV>
<DIV>Third a word could be a single characters or composed by 2-6=20
characters.</DIV>
<DIV>Forth, there are compound word so some times there are no easy way to=20
tell the boundary of a word even you are native Chinese. </DIV>
<DIV>&nbsp;</DIV>
<DIV>google implement very good Chinese search. Maybe you should look at how=
=20
they do the search job. </DIV>
<BLOCKQUOTE style=3D"PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: blue=20=
2px=20
solid"><BR>I didn't do anything do make it lookup a 'word', in fact I don't=20
know<BR>how to make it lookup an individual glyph only ;-). It is often=20
not<BR>very useful to only look up one character (imagine looking up "foot"<=
BR>and=20
"ball" vs. looking up "football". The first lets you someone guess<BR>the=20
meaning, but the second gives the exact information). So it should<BR>be pos=
sible to=20
select a few characters and look them up in the<BR>dictionary with the mouse=
=20
or keyboard. However, for "standard lookup"<BR>(ie. without text being=20
selected) looking up the current character<BR>only instead of the whole 'wor=
d'=20
probably would be more useful, since<BR>with most modules the 'word' is goin=
g to be=20
the whole line.<BR><BR>Greetings,<BR>&nbsp;&nbsp;=20
Christian<BR></FONT></BLOCKQUOTE>
<DIV></DIV></BODY></HTML>

-------------------------------1056750700--