[sword-devel] Chinese "words"

Mon, 30 Jun 2003 12:34:43 EDT

-------------------------------1056990883
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

In a message dated 6/27/2003 5:23:34 PM Pacific Daylight Time, 
crenz-swordproject@web42.com writes:
>I think the right thing to do is to change your layout engine to support
>correct Chinese line wrapping, instead of adding space (which should not be
>there) to work around the limitation in the layout engine.

I second that.

>Neither have space after puncation. No space, period.

Whoops... you're right. Thanks for the correction. I noticed people
don't seem to do it the "correct" way always, though, but I guess it
also depends on the font being used (ie. the glyphs being the correct
width for punctuation marks).

>Second, there are no easy way to parse a word.

That's why I think it would be too complicated to built Chinese word
splitting into Sword, unless e.g. ICU starts to come with a nice
built-in option we can just use. It's just not worth the effort. It's
easier to just let the user make the guess himself.
Mozilla have an Unicde base line breaker which can be easily port to other 
enivronment. ICU also have a line breaker which is very close to the line 
breaker interface in Java.
Those line breaker tell the app where (in the of character buffer offset) is 
the line break opportunity, the app then call the os to find out the length of 
the window and the length of the text and see it want to break in there or 
break in the next opportunity. It basically replace the operation of "find me 
the next space character " in the westen only based layout.

The gecko based line breaker interface is on
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/public/nsILineBreaker.h
The implementation is on
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.c
pp
(based on Japanese layout standard)

>google implement very good Chinese search. Maybe you should look at how they 
do
>the search job.

Again, I think it's overkill for Sword.

Greetings,
   Christian

-------------------------------1056990883
Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

<HTML><HEAD>
<META charset=3DUTF-8 http-equiv=3DContent-Type content=3D"text/html;=20
charset=3Dutf-8">
<META content=3D"MSHTML 6.00.2600.0" name=3DGENERATOR></HEAD>
<BODY style=3D"FONT-SIZE: 10pt; COLOR: #000000; FONT-FAMILY: Arial;=20
BACKGROUND-COLOR: #ffffff">
<DIV>In a message dated 6/27/2003 5:23:34 PM Pacific Daylight Time,=20
crenz-swordproject@web42.com writes:</DIV>
<BLOCKQUOTE style=3D"PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: blue=20=
2px=20
solid"><FONT face=3DArial>&gt;I think the right thing to do is to change you=
r=20
layout engine to support<BR>&gt;correct Chinese line wrapping, instead of=20
adding space (which should not be<BR>&gt;there) to work around the limitatio=
n in=20
the layout engine.<BR><BR>I second that.<BR><BR>&gt;Neither have space after=
=20
puncation. No space, period.<BR><BR>Whoops... you're right. Thanks for the=20
correction. I noticed people<BR>don't seem to do it the "correct" way always=
,=20
though, but I guess it<BR>also depends on the font being used (ie. the glyph=
s being=20
the correct<BR>width for punctuation marks).<BR><BR>&gt;Second, there are no=
=20
easy way to parse a word.<BR><BR>That's why I think it would be too complica=
ted=20
to built Chinese word<BR>splitting into Sword, unless e.g. ICU starts to com=
e=20
with a nice<BR>built-in option we can just use. It's just not worth the=20
effort. It's<BR>easier to just let the user make the guess himself.</BLOCKQU=
OTE>
<DIV>Mozilla have an Unicde base line breaker which can be easily port to=20
other enivronment. ICU also have a line breaker which is very close to the l=
ine=20
breaker interface in Java.</DIV>
<DIV>Those line breaker tell the app where (in the of character buffer=20
offset) is the line break opportunity, the app then call the os to find out=20=
the=20
length of the window and the length of the text and see it want to break in=20=
there=20
or break in the next opportunity. It basically replace the operation of "fin=
d=20
me the next space character " in the westen only based layout.</DIV>
<DIV>&nbsp;</DIV>
<DIV>The gecko based line breaker interface is on</DIV>
<DIV><A=20
href=3D"http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/public/nsILineBre=
aker.h">http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/public/nsILineB
reaker.h</A></DIV>
<DIV>The implementation is on</DIV>
<DIV><A=20
href=3D"http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501Lin=
eBreaker.cpp">http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJIS
x4501LineBreaker.cpp</A></DIV>
<DIV>(based on Japanese layout standard)<BR><BR>&gt;google implement very=20
good Chinese search. Maybe you should look at how they do<BR>&gt;the search=20
job.<BR><BR>Again, I think it's overkill for=20
Sword.<BR><BR>Greetings,<BR>&nbsp;&nbsp; Christian<BR></DIV></FONT>
<DIV></DIV></BODY></HTML>

-------------------------------1056990883--