[sword-devel] Languages without a space between words

Greg Hellings greg.hellings at gmail.com
Mon Apr 17 12:51:21 EDT 2023


I don't believe you're going to get that sort of feature directly in the
engine's simple search.

However, if you're using a build of the library that utilizes CLucene or
Xapian, then that should be the function of those libraries. They are
supposed to be able to handle all of that type of functionality if the
language has a corresponding contribution to that library. It might be
better to check in with them.

--Greg

On Mon, Apr 17, 2023 at 11:46 AM David Haslam <dfhdfh at protonmail.com> wrote:

> Unlike Hebrew and Arabic, etc, none of the names of the Thai Unicode characters
> contain the word FINAL. Likewise for Myanmar letters.
>
> A possible way forward might be to run one of the several Word
> Segmentation programs on the text of the ThaiKJV.
>
> Examples: KuCut, DeepCut, AttaCut
>
> This should insert a Unicode zero width non-joiner (ZWNJ) as a word
> separator.
>
> NB. The module would have to be updated using the segmented source text.
>
> Visually, the resulting text would display the same as the original, but
> the module would be amenable to indexing for word searches.
>
> A difficulty that might then arise is how the front-end user might enter
> the search query for an exact phrase search type (containing more than one
> word). Other search types (all words, any word) might be OK as is.
>
> Aside: The KuCut method developed in 2004 was originally trained using the
> text of the ThaKJV.
>
> Regards,
>
> David
>
> Sent from Proton Mail for iOS
>
>
> On Mon, Apr 17, 2023 at 17:16, Peter Von Kaehne <refdoc at gmx.net
> <On+Mon,+Apr+17,+2023+at+17:16,+Peter+Von+Kaehne+%3C%3Ca+href=>> wrote:
>
> Does Thai Burmese etc etc use end forms for letters? if so, are these
> encoded as such?
>
> Peter
>
>
> *Gesendet:* Montag, 17. April 2023 um 16:47 Uhr
> *Von:* "David Haslam" <dfhdfh at protonmail.com>
> *An:* sword-devel at crosswire.org
> *Betreff:* [sword-devel] Languages without a space between words
> How (if at all) does the SWORD API generate a search index for a module
> that is for a language without a space between words?
>
> Please consider how best to generate a useful search index for modules that are
> for Bible translations in languages that have no spaces between words.
>
> Example: CrossWire module ThaiKJV
>
> Seehttps://en.wikipedia.org/wiki/Category:Writing_systems_without_word_boundaries
>
> Has this ever been considered before.
>
> Best regards,
>
> David
>
> Sent from Proton Mail for iOS
> _______________________________________________ sword-devel mailing list:
> sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel Instructions to
> unsubscribe/change your settings at above page
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20230417/7f2182ee/attachment-0001.htm>


More information about the sword-devel mailing list