[jsword-devel] Lucene

Tue Dec 19 06:07:14 MST 2006

While the design is recursive, it is probably not going to recurse  
except for Raw GenBooks.

In JSword the interface for a Key allows for any Key to have  
children. This would be akin to a book having chapters and chapters  
having verses. However in the case of a Bible the key is a flat list.  
With regard to the storage requirements of a Key to the whole bible,  
the amount of storage it takes is dependent upon what kind of  
optimization is used for the Key. It might be a:

BitwisePassage with one bit for each verse in the Key. BitwisePassage  
has a constant space requirement.
RangedPassage with very little storage overhead. Each range is stored  
separately. It is slower to iterate over than any of the other  
implementations.
DistinctPassage uses way too much storage, with one Key object per  
verse.
PassageTally keeps a weight for each of the keys it stores. It is  
used prioritize search results.

I have found that this generation of the search index is expensive.  
But I have found ways to make it faster. The first thing is that  
Lucene uses lots of temporary documents on disk to build the index.  
Depending on what hardware I use, I can index an entire bible from <2  
minutes to 5 minutes. However, on Windows I found that it took in  
excess of 40 minutes. This with an AMD 2400+. I did two things that  
got it down to a few minutes. First I turned off Microsoft's "fast  
index". Turns out MS tried to index all of these temporary documents.  
It should not have tried to index any. Second, I was using a "smart"  
virus programmer that scanned every document as it is deposited on  
the disk or perhaps accessed from the disk. Not sure which. Turning  
both of these off gave me an index speed of about 4 minutes.  
Subsequently, I replace the virus scanner and never turned "fast  
indexing" back on.

However, I don't expect that we will build an index on a "small"  
device. Rather, I would imagine we would pre-build it and load it.

With regard to the Job class, we should change it to a Job interface  
and make the current class an implementation of it. Then we can  
create a null implementation of the Job that does nothing in contexts  
where there should be no reporting of progress. Or we can create an  
appropriate implementation for the target device.

Please note that from the comment "// report progress" that none of  
that is needed if we don't report progress.

Also, there are several opportunities for optimization here (e.g. the  
number of verses in a bible does not change as progress is made).  
Also, the implementation should be generalized a bit moe. This does  
not allow for indexing commentaries or dictionaries. It should allow  
for indexing all books.

I'll see if I can make those changes.

In Him,
	DM

On Dec 19, 2006, at 12:38 AM, Zhaojun Li wrote:

> Here are the two methods, one original, one mirror.
>
>  /**
>      * Dig down into a Key indexing as we go.
>      */
>     private void newgenerateSearchIndexImpl( List errors,  
> IndexWriter writer, Key key) throws BookException, IOException
>     {
>         int bookNum = 0;
>         int oldBookNum = -1;
>         int percent = 0;
>         String name = ""; //$NON-NLS-1$
>         String text = ""; //$NON-NLS-1$
>         BookData data = null;
>         Key subkey = null;
>         Verse verse = null;
>         Document doc = null;
>         for (Iterator it = key.iterator(); it.hasNext(); )
>         {
>             subkey = (Key) it.next();
>             if ( subkey.canHaveChildren())
>             {
>                 newgenerateSearchIndexImpl( errors, writer, subkey);
>             }
>             else
>             {
>                 data = null;
>                 try
>                 {
>                     data = book.getData(subkey);
>                 }
>                 catch (BookException e)
>                 {
>                     errors.add(subkey);
>                     continue;
>                 }
>
>                 text = data.getVerseText();
>
>                 // Do the actual indexing
>                 if (text != null && text.length() > 0)
>                 {
>                     doc = new Document();
>                     doc.add(new Field(FIELD_NAME, subkey.getOsisRef 
> (), Field.Store.YES, Field.Index.NO));
>                     doc.add(new Field(FIELD_BODY, new StringReader 
> (text)));
>                     writer.addDocument(doc);
>                 }
>
>                 // report progress
>                 verse = KeyUtil.getVerse(subkey);
>
>                 try
>                 {
>                     percent = 95 * verse.getOrdinal() /  
> BibleInfo.versesInBible();
>                     bookNum = verse.getBook();
>                     if (oldBookNum != bookNum)
>                     {
>                         name = BibleInfo.getBookName (bookNum);
>                         oldBookNum = bookNum;
>                     }
>                 }
>                 catch (NoSuchVerseException ex)
>                 {
>                     log.error("Failed to get book name from verse:  
> " + verse, ex); //$NON-NLS-1$
>                     assert false;
>                     name = subkey.getName();
>                 }
>
>
>             }
>         }
>     }
>
>     /**
>      * Dig down into a Key indexing as we go.
>      */
>     private void generateSearchIndexImpl(Job job, List errors,  
> IndexWriter writer, Key key) throws BookException, IOException
>     {
>         int bookNum = 0;
>         int oldBookNum = -1;
>         int percent = 0;
>         String name = ""; //$NON-NLS-1$
>         String text = ""; //$NON-NLS-1$
>         BookData data = null;
>         Key subkey = null;
>         Verse verse = null;
>         Document doc = null;
>         for (Iterator it = key.iterator(); it.hasNext(); )
>         {
>             subkey = (Key) it.next();
>             if (subkey.canHaveChildren())
>             {
>                 generateSearchIndexImpl(job, errors, writer, subkey);
>             }
>             else
>             {
>                 data = null;
>                 try
>                 {
>                     data = book.getData(subkey);
>                 }
>                 catch (BookException e)
>                 {
>                     errors.add(subkey);
>                     continue;
>                 }
>
>                 text = data.getVerseText();
>
>                 // Do the actual indexing
>                 if (text != null && text.length() > 0)
>                 {
>                     doc = new Document();
>                     doc.add(new Field(FIELD_NAME, subkey.getOsisRef 
> (), Field.Store.YES , Field.Index.NO));
>                     doc.add(new Field(FIELD_BODY, new StringReader 
> (text)));
>                     writer.addDocument(doc);
>                 }
>
>                 // report progress
>                 verse = KeyUtil.getVerse(subkey);
>
>                 try
>                 {
>                     percent = 95 * verse.getOrdinal() /  
> BibleInfo.versesInBible();
>                     bookNum = verse.getBook();
>                     if (oldBookNum != bookNum)
>                     {
>                         name = BibleInfo.getBookName(bookNum);
>                         oldBookNum = bookNum;
>                     }
>                 }
>                 catch (NoSuchVerseException ex)
>                 {
>                     log.error("Failed to get book name from verse:  
> " + verse, ex); //$NON-NLS-1$
>                     assert false;
>                     name = subkey.getName();
>                 }
>
>                 job.setProgress(percent, Msg.INDEXING.toString(name));
>
>                 // This could take a long time ...
>                 Thread.yield();
>                 if (Thread.currentThread().isInterrupted())
>                 {
>                     break;
>                 }
>             }
>         }
>     }
>
>
> On 12/19/06, Zhaojun Li <lzj369 at gmail.com> wrote:
> Hi, Dear all,
>
> I am new to Lucene, so please help.
>
> I need to remove the job class from the current Lucene  
> implementation. What I did is: create mirror method from  
> generateSearchIndexImpl by removing any Job class reference. I  
> tested it and it works.
>
> However, the speed is not good.  In the design, it is a recursive  
> call.   How to do multithreading for this? I mean by usual thread  
> class, not JSWORD Job api.
>
> Thanks!
>
> Zhaojun
>
>
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20061219/5525a44e/attachment.html