[jsword-devel] Lucene

Tue Dec 19 06:39:09 MST 2006

Thanks, DM!
Can we do indexing all in memory? This way we can avoid temp files from
being generated. Most of PC have enough memory now.

On 12/19/06, DM Smith <dmsmith555 at yahoo.com> wrote:
>
> While the design is recursive, it is probably not going to recurse except
> for Raw GenBooks.
> In JSword the interface for a Key allows for any Key to have children.
> This would be akin to a book having chapters and chapters having verses.
> However in the case of a Bible the key is a flat list. With regard to the
> storage requirements of a Key to the whole bible, the amount of storage it
> takes is dependent upon what kind of optimization is used for the Key. It
> might be a:
> BitwisePassage with one bit for each verse in the Key. BitwisePassage has
> a constant space requirement.
> RangedPassage with very little storage overhead. Each range is stored
> separately. It is slower to iterate over than any of the other
> implementations.
> DistinctPassage uses way too much storage, with one Key object per verse.
> PassageTally keeps a weight for each of the keys it stores. It is used
> prioritize search results.
>
> I have found that this generation of the search index is expensive. But I
> have found ways to make it faster. The first thing is that Lucene uses lots
> of temporary documents on disk to build the index. Depending on what
> hardware I use, I can index an entire bible from <2 minutes to 5 minutes.
> However, on Windows I found that it took in excess of 40 minutes. This with
> an AMD 2400+. I did two things that got it down to a few minutes. First I
> turned off Microsoft's "fast index". Turns out MS tried to index all of
> these temporary documents. It should not have tried to index any. Second, I
> was using a "smart" virus programmer that scanned every document as it is
> deposited on the disk or perhaps accessed from the disk. Not sure which.
> Turning both of these off gave me an index speed of about 4 minutes.
> Subsequently, I replace the virus scanner and never turned "fast indexing"
> back on.
>
> However, I don't expect that we will build an index on a "small" device.
> Rather, I would imagine we would pre-build it and load it.
>
> With regard to the Job class, we should change it to a Job interface and
> make the current class an implementation of it. Then we can create a null
> implementation of the Job that does nothing in contexts where there should
> be no reporting of progress. Or we can create an appropriate implementation
> for the target device.
>
> Please note that from the comment "// report progress" that none of that
> is needed if we don't report progress.
>
> Also, there are several opportunities for optimization here (e.g. the
> number of verses in a bible does not change as progress is made). Also, the
> implementation should be generalized a bit moe. This does not allow for
> indexing commentaries or dictionaries. It should allow for indexing all
> books.
>
> I'll see if I can make those changes.
>
> In Him,
> DM
>
>
> On Dec 19, 2006, at 12:38 AM, Zhaojun Li wrote:
>
> Here are the two methods, one original, one mirror.
>
>  /**
>      * Dig down into a Key indexing as we go.
>      */
>     private void newgenerateSearchIndexImpl( List errors, IndexWriter
> writer, Key key) throws BookException, IOException
>     {
>         int bookNum = 0;
>         int oldBookNum = -1;
>         int percent = 0;
>         String name = ""; //$NON-NLS-1$
>         String text = ""; //$NON-NLS-1$
>         BookData data = null;
>         Key subkey = null;
>         Verse verse = null;
>         Document doc = null;
>         for (Iterator it = key.iterator(); it.hasNext(); )
>         {
>             subkey = (Key) it.next();
>             if ( subkey.canHaveChildren())
>             {
>                 newgenerateSearchIndexImpl( errors, writer, subkey);
>             }
>             else
>             {
>                 data = null;
>                 try
>                 {
>                     data = book.getData(subkey);
>                 }
>                 catch (BookException e)
>                 {
>                     errors.add(subkey);
>                     continue;
>                 }
>
>                 text = data.getVerseText();
>
>                 // Do the actual indexing
>                 if (text != null && text.length() > 0)
>                 {
>                     doc = new Document();
>                     doc.add(new Field(FIELD_NAME, subkey.getOsisRef(),
> Field.Store.YES, Field.Index.NO));
>                     doc.add(new Field(FIELD_BODY, new
> StringReader(text)));
>                     writer.addDocument(doc);
>                 }
>
>                 // report progress
>                 verse = KeyUtil.getVerse(subkey);
>
>                 try
>                 {
>                     percent = 95 * verse.getOrdinal() /
> BibleInfo.versesInBible();
>                     bookNum = verse.getBook();
>                     if (oldBookNum != bookNum)
>                     {
>                         name = BibleInfo.getBookName (bookNum);
>                         oldBookNum = bookNum;
>                     }
>                 }
>                 catch (NoSuchVerseException ex)
>                 {
>                     log.error("Failed to get book name from verse: " +
> verse, ex); //$NON-NLS-1$
>                     assert false;
>                     name = subkey.getName();
>                 }
>
>
>             }
>         }
>     }
>
>     /**
>      * Dig down into a Key indexing as we go.
>      */
>     private void generateSearchIndexImpl(Job job, List errors, IndexWriter
> writer, Key key) throws BookException, IOException
>     {
>         int bookNum = 0;
>         int oldBookNum = -1;
>         int percent = 0;
>         String name = ""; //$NON-NLS-1$
>         String text = ""; //$NON-NLS-1$
>         BookData data = null;
>         Key subkey = null;
>         Verse verse = null;
>         Document doc = null;
>         for (Iterator it = key.iterator(); it.hasNext(); )
>         {
>             subkey = (Key) it.next();
>             if (subkey.canHaveChildren())
>             {
>                 generateSearchIndexImpl(job, errors, writer, subkey);
>             }
>             else
>             {
>                 data = null;
>                 try
>                 {
>                     data = book.getData(subkey);
>                 }
>                 catch (BookException e)
>                 {
>                     errors.add(subkey);
>                     continue;
>                 }
>
>                 text = data.getVerseText();
>
>                 // Do the actual indexing
>                 if (text != null && text.length() > 0)
>                 {
>                     doc = new Document();
>                     doc.add(new Field(FIELD_NAME, subkey.getOsisRef(),
> Field.Store.YES , Field.Index.NO));
>                     doc.add(new Field(FIELD_BODY, new
> StringReader(text)));
>                     writer.addDocument(doc);
>                 }
>
>                 // report progress
>                 verse = KeyUtil.getVerse(subkey);
>
>                 try
>                 {
>                     percent = 95 * verse.getOrdinal() /
> BibleInfo.versesInBible();
>                     bookNum = verse.getBook();
>                     if (oldBookNum != bookNum)
>                     {
>                         name = BibleInfo.getBookName(bookNum);
>                         oldBookNum = bookNum;
>                     }
>                 }
>                 catch (NoSuchVerseException ex)
>                 {
>                     log.error("Failed to get book name from verse: " +
> verse, ex); //$NON-NLS-1$
>                     assert false;
>                     name = subkey.getName();
>                 }
>
>                 job.setProgress(percent, Msg.INDEXING.toString(name));
>
>                 // This could take a long time ...
>                 Thread.yield();
>                 if (Thread.currentThread().isInterrupted())
>                 {
>                     break;
>                 }
>             }
>         }
>     }
>
>
> On 12/19/06, Zhaojun Li <lzj369 at gmail.com> wrote:
> >
> > Hi, Dear all,
> >
> > I am new to Lucene, so please help.
> >
> > I need to remove the job class from the current Lucene implementation.
> > What I did is: create mirror method from generateSearchIndexImpl by removing
> > any Job class reference. I tested it and it works.
> >
> > However, the speed is not good.  In the design, it is a recursive
> > call.   How to do multithreading for this? I mean by usual thread class, not
> > JSWORD Job api.
> >
> > Thanks!
> >
> > Zhaojun
> >
> >
> >
> >
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20061219/f066ba47/attachment-0001.html