[sword-devel] Text preparation for searching in SWORD [was: Soft hyphens]

Troy A. Griffitts scribe at crosswire.org
Thu Nov 2 04:47:35 MST 2017


SWORD has a number of filtering stages which occur at different places
and events.

Specifically interesting for this discussion are "strip filters".  These
are called immediately before searching and should be called on the
search string before passing it to search:

ListKey results = module.search(module.stripText(searchTerm));

I am pretty sure most of our frontends to this.

This assures not only that the module text is normalize for searching,
but also that the search term itself is normalized using the same rules.

We strip markup and other things from the module buffer before doing the
comparison.  We obviously aren't stripping soft hyphens but I suggest we
simply add the soft hyphen character to the list of characters we are
removing.

Additionally, each module can specify its own additional strip filters
with a conf entry:

LocalStripFilter=

But that would entail there was a filter available which could strip out
soft hyphens, which I don't believe there is.

I have committed to trunk the addition of stripping out soft hyphens to
the strip filter for OSIS SourceType modules, for now, if you'd like to
have a test.

A legacy issue we've had that we'd like to eventual get rid of, is that
the we often use the same filter to do double-duty for both strip
filters (before searching) and as the render filter for plain text. 
This means that, as a side effect, soft hyphens will no longer be
present if you ask diatheke for plain text output.

Let me know if you have a chance to test,

Troy


On 11/02/2017 03:28 AM, David Haslam wrote:
> I am recommending the complete removal of soft hyphens because their use is a
> typographical kludge not semantic construction.
>
> See https://crosswire.org/wiki/Converting_SFM_Bibles_to_OSIS#Soft_hyphens
>
> Being a kludge, there could never be any possibility that any particular
> word would always have the soft hyphen. 
>
> They result because the USFM files were derived retrospectively from files
> exported from Quark XPress.
>
> It's been a useful discussion, prompted by my assistance to Fr Cyrille with
> his LinVB repository in GitLab.
>
> Best regards,
>
> David
>
>
>
> --
> Sent from: http://sword-dev.350566.n4.nabble.com/
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page





More information about the sword-devel mailing list