[sword-devel] French ligatures in Louis SÉGOND’s text
Chris Little
chrislit at crosswire.org
Sun Jul 15 23:16:03 MST 2007
Leandro Guimarães Faria Corcete DUTRA wrote:
> Chris Little <chrislit at crosswire.org> writes:
>
>> We could change oe to oe-ligature where appropriate in Louis Segond.
>> That would be simple enough since editions exist online that use
>> oe-ligature correctly.
>
> Also, it is not that many words using that… cœur, sœur, mœur…
>
> Is there anyone to do it already, or should I do it?
WikiSource already has a copy with oe-lig that we could use. No need to
repeat the work.
>> However, since we won't be doing language-specific search tweaks
>
> That is not what I meant — I mean a general fix, where ligatures at
> the search box would find expanded characters, and vice‐versa. Just like
> Google does it, with all kind of European ligatures.
There's a simplistic solution for searching like you suggest by
decomposing ligatures as their components as part of the strip filter
process. That will work fine for French, I suppose, and Latin but it
would return incorrect results in other languages. In Norwegian,
ae-ligature is a letter on its own, not related to a or e. In Swedish
the same letter is written as a-umlaut. In Icelandic, oe-ligature
shouldn't be decomposed to oe either.
Should umlauted letters be decomposed also? So a-umlaut becomes ae,
o-umlaut becomes oe, u-umlaut becomes ue--which works fine for German,
but I doubt for many other languages. And what about i-umlaut and
e-umlaut? And what about letters with accents? Some languages would
simply drop the accent, others would double the letter, and there may be
other behaviors I don't know about.
The only ligatures that we could safely decompose without reference to
language are typographic ligatures, and we would never encode those as
ligatures in the first place.
I don't know how Google does what they do. They may do language
identification and language-specific processing of documents. But they
have a lot more data and horsepower at their disposal than we do.
> In the end it is an Unicode question, I guess?
It's not a Unicode question because Unicode doesn't deal with this
issue. The decomposition of oe-ligature to oe would be a
language-specific detail and is not encoded in any of Unicode's data sets.
>> since oe-ligature basicallly can't be typed on French keyboards
>
> Yes, but regardless of keyboards us GNU/Linux users who love
> typography (admittedly a small subset) have it mapped and used it quite often.
I'm understandably more concerned with Windows users who would lose
functionality.
--Chris
More information about the sword-devel
mailing list