[sword-devel] indexed search discrepancy

Matthew Talbert ransom1982 at gmail.com
Sat Aug 29 13:25:30 MST 2009


I'm attaching a patch to fix several issues with indexed search.

Issue 1: large text fields weren't getting indexed due to a low MAX_CONV_SIZE
     Resolution: change MAX_CONV_SIZE to 1024 * 1024, and add call to
writer to boost its maximum field size

Issue 2: search causes segfault when searching for stop words
     Resolution: set analyzer stop words to NULL for both index
creation and search. Possibly this would only have to be set for
search, and left on to lower the index size.

Issue 3: index causes segfault *after indexing* when module location
isn't writable.
     Resolution: check the return value of
FileMgr::createParent(target + "/dummy"); if return value is -1, abort
indexing

In addition, this patch adds fields for footnotes, morphology, and
headers. I *really* would like to see this added to the default
indexing. The reason is that with indexed search it is possible to
combine fields in one search, something that SWORD attribute search
doesn't allow (AFAIK). And indexed search is much faster, of course.
My patch only covers one of the three spots this would apparently need
to be added. I didn't understand why there was so much duplicated
code, nor was I entirely comfortable with the code I had written, so I
didn't expand it to cover all cases. It appears that the code for
adding fields like strongs is the same in 3 different spots. Surely
this could be condensed somehow?

I really would like to see the first 3 issues fixed immediately (ie,
before next release). Issue 1 makes most genbook indexed search
pointless, while Issues 2 and 3 have both been reported as issues
against Xiphos. Of course, we can't control the segfault in either
case. As far as the extra fields, that will need some extra work, but
I feel it's really important as well. At some point, I am going to
redo the search functionality in Xiphos, and my plan is to implement
indexing myself if these fields aren't in SWORD by then.

I have been meaning to address these issues for some time, but hadn't
gotten around to it yet. The bug report we had forced the issue. While
we're at it, I'd like to bring up two more issues.

1. If the module location isn't writable, there isn't a way for the
user to create an index. I would like to see indexes created somewhere
else in this case, eg ~/.sword/indexes. I believe BT does something
like this already.

2. We currently have no way of notifying the user if the indexes are
no longer valid, or if they should be updated. I would like to see a
versioning scheme for indexes. For example, with the changes here, and
the changes for Hebrew search, all Hebrew indexes previously created
are now useless. How do we tell the user that he needs to re-create
the index? Along the same lines, all genbook indexes, and many
commentary indexes are incorrect. With the next release of SWORD,
hopefully with this issue resolved, it would be nice to be able to
notify the user that the indexes are now out-of-date or incorrect and
need to be rebuilt.

Finally, I would like to point out a great tool for examining
lucene/clucene indexes. You can get it here:
http://www.getopt.org/luke/

Matthew

PS I'm going to send this without the attachment. I'll send the patch
later, but here it is below:

 #ifdef USELUCENE
 	if (searchType == -4) {	// lucene
 		//Buffers for the wchar<->utf8 char* conversion
-		const unsigned short int MAX_CONV_SIZE = 2047;
+		const unsigned int MAX_CONV_SIZE = 1024 * 1024;
 		wchar_t wcharBuffer[MAX_CONV_SIZE + 1];
 		char utfBuffer[MAX_CONV_SIZE + 1];
 		
@@ -510,10 +510,11 @@
 			ir = IndexReader::open(target);
 			is = new IndexSearcher(ir);
 			(*percent)(10, percentUserData);
-
-			standard::StandardAnalyzer analyzer;
+			
+			const TCHAR* stop_words[] = { NULL };
+			standard::StandardAnalyzer *analyzer = new
standard::StandardAnalyzer( (const TCHAR**)stop_words );
 			lucene_utf8towcs(wcharBuffer, istr, MAX_CONV_SIZE); //TODO Is istr
always utf8?
-			q = QueryParser::parse(wcharBuffer, _T("content"), &analyzer);
+			q = QueryParser::parse(wcharBuffer, _T("content"), analyzer);
 			(*percent)(20, percentUserData);
 			h = is->search(q);
 			(*percent)(80, percentUserData);
@@ -1026,21 +1027,27 @@
 	IndexWriter *coreWriter = NULL;
 	IndexWriter *fsWriter = NULL;
 	Directory *d = NULL;
-
-	standard::StandardAnalyzer *an = new standard::StandardAnalyzer();
+	const unsigned int MAX_CONV_SIZE = 1024 * 1024;
+	
+	const TCHAR* stop_words[] = { NULL };
+	standard::StandardAnalyzer *an = new standard::StandardAnalyzer(
(const TCHAR**)stop_words );
 	SWBuf target = getConfigEntry("AbsoluteDataPath");
 	bool includeKeyInSearch = getConfig().has("SearchOption",
"IncludeKeyInSearch");
 	char ch = target.c_str()[strlen(target.c_str())-1];
 	if ((ch != '/') && (ch != '\\'))
 		target.append('/');
 	target.append("lucene");
-	FileMgr::createParent(target+"/dummy");
+	int iswritable = FileMgr::createParent(target+"/dummy");
+	if (iswritable == -1)
+		return -1;

 	ramDir = new RAMDirectory();
 	coreWriter = new IndexWriter(ramDir, an, true);
+	coreWriter->setMaxFieldLength(MAX_CONV_SIZE);



+
 	char perc = 1;
 	VerseKey *vkcheck = 0;
 	vkcheck = SWDYNAMIC_CAST(VerseKey, key);
@@ -1066,8 +1073,11 @@
 	SWBuf proxBuf;
 	SWBuf proxLem;
 	SWBuf strong;
+	SWBuf morph;
+	SWBuf footnote;
+	SWBuf heading;

-	const short int MAX_CONV_SIZE = 2047;
+	
 	wchar_t wcharBuffer[MAX_CONV_SIZE + 1];

 	char err = Error();
@@ -1104,8 +1114,15 @@
 			AttributeTypeList::iterator words;
 			AttributeList::iterator word;
 			AttributeValue::iterator strongVal;
+			AttributeValue::iterator morphVal;
+			AttributeValue::iterator headings;

+			AttributeTypeList::iterator footnotes;
+			AttributeList::iterator footList;
+			AttributeValue::iterator footVal;
+
 			strong="";
+			morph="";
 			words = getEntryAttributes().find("Word");
 			if (words != getEntryAttributes().end()) {
 				for (word = words->second.begin();word != words->second.end(); word++) {
@@ -1124,10 +1141,38 @@
 							strong.append(strongVal->second);
 							strong.append(' ');
 						}
+						tmp = "Morph";
+						morphVal = word->second.find(tmp);
+						if (morphVal != word->second.end()){
+							morph.append(morphVal->second);
+							morph.append(' ');
+						}
 					}
 				}
 			}

+			footnote="";
+			footnotes = getEntryAttributes().find("Footnote");
+			if (footnotes != getEntryAttributes().end()) {
+				for (footList = footnotes->second.begin(); footList !=
footnotes->second.end(); footList++) {
+					SWBuf tmp = "body";
+					footVal = footList->second.find(tmp);
+					if (footVal != footList->second.end()) {
+						footnote.append(footVal->second);
+						footnote.append(' ');
+					}
+				}
+			}
+
+			heading="";
+			for (headings = getEntryAttributes()["Heading"]["Preverse"].begin();
+			     headings != getEntryAttributes()["Heading"]["Preverse"].end();
+			     headings++) {
+				heading.append(headings->second);
+				heading.append(' ');
+			}
+			
+
 			lucene_utf8towcs(wcharBuffer, keyText, MAX_CONV_SIZE); //keyText
must be utf8
 //			doc->add( *(new Field("key", wcharBuffer, Field::STORE_YES |
Field::INDEX_TOKENIZED)));
 			doc->add( *Field::Text(_T("key"), wcharBuffer ) );
@@ -1149,6 +1194,21 @@
 //printf("setting fields (%s).\ncontent: %s\nlemma: %s\n", (const
char *)*key, content, strong.c_str());
 			}

+			if (morph.length() > 0) {
+				lucene_utf8towcs(wcharBuffer, morph, MAX_CONV_SIZE);
+				doc->add( *Field::UnStored(_T("morph"), wcharBuffer) );
+			}
+
+			if (footnote.length() > 0) {
+				lucene_utf8towcs(wcharBuffer, footnote, MAX_CONV_SIZE);
+				doc->add( *Field::UnStored(_T("footnote"), wcharBuffer) );
+			}
+
+			if (heading.length() > 0) {
+				lucene_utf8towcs(wcharBuffer, heading, MAX_CONV_SIZE);
+				doc->add( *Field::UnStored(_T("heading"), wcharBuffer) );
+			}
+
 //printf("setting fields (%s).\n", (const char *)*key);
 //fflush(stdout);
 		}



More information about the sword-devel mailing list