[sword-devel] Searching for hyphenated words?

DM Smith dmsmith at crosswire.org
Sat Mar 2 14:38:41 MST 2013


Couple of corrections on the list. David pointed out some problems:
Hazar–maveth should not be on the list. It is without the hyphen in the text.
These should be on the list:
Abi–ezrites
Beth–elite
Beth–lehemite
Rab–mag

The following are programmatic "typos"
KirhereKir–heres
RabsariRab–saris


Also, if you are looking for Gibeah–haaraloth, you'll find it in a note in Joel 5:3.


Thanks,
	DM

On Mar 2, 2013, at 11:42 AM, DM Smith <dmsmith at crosswire.org> wrote:

> I see two different questions being posed:
> a) The correctness of using an ndash within a word.
> b) The ability to search for words containing ndash or any kind of dash, including a simple hyphen.
> 
> I'll start with my conclusion: Changing the ndash to a simple hyphen does not really address the questions.
> 
> Regarding correctness:
> The usage of ndash in the KJV is within names only. At the bottom, I've included a list of the names having an ndash. In the 2003 version of the 1769 KJV, these words were not hyphenated. They were hyphenated with an ndash in the 2006 cleanup. As an interesting aside, I looked at some of the non-name words that are hyphenated in the 1769 KJV and compared them to a photocopy of the 1611. These are word such as God-ward, us-ward, thee-ward, joint-heirs, .... My search was not exhaustive, but the 1611 didn't have hyphens, but either concatenated the words as with the -ward suffixes or with a space as in joint heirs. The other thing I noticed was that in each case where the KJV (either 1769 or 1611) had a hyphenated name, it was a Hebrew transliteration of some sort and had an attached note to at least one of the instances.
> 
> One question is whether they should be taken as a whole or parts? So, is Beth–el, equivalent to Beth el or to Bethel? Another question, does a dash (hyphen, ndash, mdash, ...) have the same meaning today as it did hundreds of years ago? Same question but regarding different languages: Do different languages use a dash with different semantics than modern English?
> 
> Regarding search:
> This regards several issues:
> How does Lucene handle these different characters?
> What does an end user want/expect?
> Can we leverage that to meet user expectation?
> 
> Lucene's handling:
> Lucene uses an Analyzer to split text into words on punctuation for indexing and for search. JSword uses SimpleAnalyzer because it makes no further assumptions on the text. SWORD lib uses StandardAnalyzer which does. I think the StandardAnalyzer has special rules for hyphens. In Lucene 3.6 the StandardAnalyzer behavior changes to use UAX 29 rules for splitting the text. This is a huge step forward. I don't know whether it handles '-' differently than other punctuation. (JSword switched from the StandardAnalyzer to the SimpleAnalyzer very early on because of the extra assumptions that StandardAnalyzer makes about what the user wants to index and not index and because it was significantly slower.)
> 
> With the SimpleAnalyzer a dash (hyphen, ndash, mdash) are used to create phrases. As such Beth–el, Beth-el and "Beth el" are equivalent. (This is with Lucene 3.0.3, earlier versions may differ). Note, it really doesn't matter that it's a dash, any punctuation will do. I don't think this is the case with the StandardAnalyzer.
> 
> One of the impacts of having hypenated words is that searching for Bethlehem won't find Beth–lehem. (The NT and OT differ on the spelling in the KJV.) It doesn't matter what kind of dash is used. The user cannot omit the hyphen to concatenate the words.
> 
> Another impact of hyphenated words is that it is much harder to do a wild card search. It doesn't matter what kind of dash is used. If the search request has a dash a * cannot be used.
> 
> So Lucene can do the right thing wrt the ndash and hyphen. They are identical wrt indexing and searching. The user does not have to know the form that is used in the file and match that.
> 
> The other feature that Lucene offers out of the box is Fuzzy Searching. I will find close approximations to the word that you are requesting. All that needs to be done is append a ~ to the end of the word. For example, Abimelek~ finds Abimael, Abimelech, Abiezer and Ahimelech. This is not a Soundex search, so the results are often surprising. Bethelham~ finds Meshullam and Bethlehem~ finds betrothed but not Bethlehem.
> 
> Some front-ends don't use Lucene for indexing. Some use an older version. So the behavior can differ.
> Also, SWORD doesn't require indexing for "slow" search. Don't know if the SWORD "slow" search treats the various dashes the same or differently. (I think this is the Multi-word search mentioned by David)
> 
> User expectation:
> The hyphenation of these names is not common in other translations. I think that most users would expect Bethel and not Beth–el or Beth-el. Together this makes searching multiple Bibles at the same time very difficult.
> 
> I think that a user might have a reasonable expectation not knowing that proper spelling of more than a few of them. Let alone that they are hyphenated. 
> 
> Leveraging:
> I think that if StandardAnalyzer does not give expected behavior then SimpleAnalyzer should be used.
> 
> I think that hyphenated words should also be indexed as unhyphenated.
> 
> Adding a simple filter to change different forms of dashes into a single form for both search and index is a good solution but would break backward compatibility with existing indexes and changing from StandardAnalyzer to SimpleAnalyzer would be as much of a pain and a better solution (at least until 3.6, which I have not evaluated to see if it changes the behavior sufficiently.)
> 
> Conclusion: Changing the ndash to a simple hyphen does not really address the problems.
> 
> In Him,
> 	DM
> 
> Abed–nego
> Abel–beth–maachah
> Abel–maim
> Abel–meholah
> Abel–mizraim
> Abel–shittim
> Abi–albon
> Abi–ezer
> Abi–ezrite
> Adoni–bezek
> Adoni–zedek
> Allon–bachuth
> Almon–diblathaim
> Ashdoth–pisgah
> Ataroth–adar
> Ataroth–addar
> Aznoth–tabor
> Baalath–beer
> Baal–berith
> Baal–gad
> Baal–hamon
> Baal–hanan
> Baal–hazor
> Baal–hermon
> Baal–meon
> Baal–peor
> Baal–perazim
> Baal–shalisha
> Baal–tamar
> Baal–zebub
> Baal–zephon
> Bamoth–baal
> Bashan–havoth–jair
> Bath–rabbim
> Bath–sheba
> Bath–shua
> Beer–elim
> Beer–lahai–roi
> Beer–sheba
> Beesh–terah
> Ben–ammi
> Bene–berak
> Bene–jaakan
> Ben–hadad
> Ben–hail
> Ben–hanan
> Ben–oni
> Ben–zoheth
> Berodach–baladan
> Beth–anath
> Beth–anoth
> Beth–arabah
> Beth–aram
> Beth–arbel
> Beth–aven
> Beth–azmaveth
> Beth–baal–meon
> Beth–barah
> Beth–birei
> Beth–car
> Beth–dagon
> Beth–diblathaim
> Beth–el
> Beth–emek
> Beth–ezel
> Beth–gader
> Beth–gamul
> Beth–haccerem
> Beth–haran
> Beth–hoglah
> Beth–hogla
> Beth–horon
> Beth–jeshimoth
> Beth–jesimoth
> Beth–lebaoth
> Beth–lehem–judah
> Beth–lehem
> Beth–maachah
> Beth–marcaboth
> Beth–meon
> Beth–nimrah
> Beth–palet
> Beth–pazzez
> Beth–peor
> Beth–phelet
> Beth–rapha
> Beth–rehob
> Beth–shan
> Beth–shean
> Beth–shemesh
> Beth–shemite
> Beth–shittah
> Beth–tappuah
> Beth–zur
> Caleb–ephratah
> Chephar–haammonai
> Chisloth–tabor
> Chor–ashan
> Chushan–rishathaim
> Col–hozeh
> Dan–jaan
> Dibon–gad
> Ebed–melech
> Eben–ezer
> El–beth–el
> El–elohe–Israel
> El–elohe–Israel
> Elon–beth–hanan
> El–paran
> En–eglaim
> En–gannim
> En–gedi
> En–haddah
> En–hakkore
> En–hazor
> En–mishpat
> En–rimmon
> En–rogel
> En–shemesh
> En–tappuah
> Ephes–dammim
> Esar–haddon
> Esh–baal
> Evil–merodach
> Ezion–gaber
> Ezion–geber
> Gath–hepher
> Gath–rimmon
> Gibeah–haaraloth
> Gittah–hepher
> Gur–baal
> Hamath–zobah
> Hammoth–dor
> Hamon–gog
> Havoth–jair
> Hazar–addar
> Hazar–enan
> Hazar–gaddah
> Hazar–hatticon
> Hazar–maveth
> Hazar–shual
> Hazar–susah
> Hazar–susim
> Hazazon–tamar
> Hazezon–tamar
> Helkath–hazzurim
> Hephzi–bah
> Hor–hagidgad
> I–chabod
> Ije–abarim
> Ir–nahash
> Ir–shemesh
> Ishbi–benob
> Ish–bosheth
> Ish–tob
> Ittah–kazin
> Jaare–oregim
> Jabesh–gilead
> Jashubi–lehem
> Jegar–sahadutha
> Jehovah–jireh
> Jehovah–nissi
> Jehovah–shalom
> Jiphthah–el
> Jushab–hesed
> Kadesh–barnea
> Kedesh–naphtali
> Keren–happuch
> Kibroth–hattaavah
> Kir–haraseth
> Kir–hareseth
> Kir–haresh
> KirhereKir–heres
> Kirjath–arba
> Kirjath–arim
> Kirjath–baal
> Kirjath–huzoth
> Kirjath–jearim
> Kirjath–sannah
> Kirjath–sepher
> Lahai–roi
> Lo–ammi
> Lo–debar
> Lo–ruhamah
> Maaleh–acrabbim
> Magor–missabib
> Mahaneh–dan
> Maher–shalal–hash–baz
> Malchi–shua
> Me–jarkon
> Melchi–shua
> Meribah–Kadesh
> Merib–baal
> Merodach–baladan
> Metheg–ammah
> Migdal–el
> Migdal–gad
> Misrephoth–maim
> Moresheth–gath
> Nathan–melech
> Nebuzar–adan
> Nergal–sharezer
> Obed–edom
> Padan–aram
> Pahath–moab
> Pas–dammim
> Perez–uzzah
> Perez–uzza
> Pharaoh–hophra
> Pharaoh–nechoh
> Pharaoh–necho
> Pi–beseth
> Pi–hahiroth
> Poti–pherah
> RabsariRab–saris
> Rab–shakeh
> Ramathaim–zophim
> Ramath–lehi
> Ramath–mizpeh
> Ramoth–gilead
> Regem–melech
> Remmon–methoar
> Rimmon–parez
> Romamti–ezer
> Ru–hamah
> Samgar–nebo
> Sela–hammahlekoth
> Shear–jashub
> Shethar–boznai
> Shihor–libnath
> Shimron–meron
> Succoth–benoth
> Syria–damascus
> Syria–maachah
> Taanath–shiloh
> Tahtim–hodshi
> Tel–abib
> Tel–haresha
> Tel–harsa
> Tel–melah
> Tiglath–pileser
> Tilgath–pilneser
> Timnath–heres
> Timnath–serah
> Tob–adonijah
> Tubal–cain
> Uzzen–sherah
> Zareth–shahar
> Zaphnath–paaneah
> 
> 
> On Mar 2, 2013, at 6:01 AM, Chris Burrell <chris at burrell.me.uk> wrote:
> 
>> Can't this be done with a simple filter, i.e. always change the '-' to one kind regardless of the length. And when the user input comes in, do the same.
>> Chris
>> 
>> 
>> On 2 March 2013 02:36, Nic Carter <niccarter at mac.com> wrote:
>> 
>> Do you have a proposed solution to this, David?
>> 
>> I know that on my iPhone it is very simple to use a proper ndash & so I will always use the correct type of dash according to what I am writing. (same with on a Mac!)
>> However, the more significant issue is simply that people don't know there is a difference (or why they are different lengths, etc)...  ;)
>> 
>> On 25/02/2013, at 2:48 AM, David Haslam <dfhmch at googlemail.com> wrote:
>> 
>> > In the KJV module, if you want to search for [say] the hyphenated name
>> > "Maher–shalal–hash–baz", you first have to be aware that this module uses
>> > the ndash in place of the hyphen.
>> >
>> > btw.  It's not so easy to enter the ndash from a keyboard, and probably even
>> > harder in an Android tablet or mobile.
>> >
>> > If you use ordinary hyphen/minus for the search key hyphen for this module,
>> > you don't find anything with "Exact phrase".
>> > If you use "Multi-word", you do find "Maher" highlighted in the found verse.
>> > (e.g. using Xiphos).
>> >
>> > For modules in general, however, the user cannot usually know in advance
>> > whether hyphenated words use the ndash, the hyphen or something else.
>> >
>> > Has anyone else looked into this aspect of the search feature?
>> >
>> > David
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://sword-dev.350566.n4.nabble.com/Searching-for-hyphenated-words-tp4652016.html
>> > Sent from the SWORD Dev mailing list archive at Nabble.com.
>> >
>> > _______________________________________________
>> > sword-devel mailing list: sword-devel at crosswire.org
>> > http://www.crosswire.org/mailman/listinfo/sword-devel
>> > Instructions to unsubscribe/change your settings at above page
>> 
>> 
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>> 
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20130302/9d63710d/attachment-0001.html>


More information about the sword-devel mailing list