[jsword-devel] Lucene
Zhaojun Li
lzj369 at gmail.com
Tue Dec 19 06:39:09 MST 2006
Thanks, DM!
Can we do indexing all in memory? This way we can avoid temp files from
being generated. Most of PC have enough memory now.
On 12/19/06, DM Smith <dmsmith555 at yahoo.com> wrote:
>
> While the design is recursive, it is probably not going to recurse except
> for Raw GenBooks.
> In JSword the interface for a Key allows for any Key to have children.
> This would be akin to a book having chapters and chapters having verses.
> However in the case of a Bible the key is a flat list. With regard to the
> storage requirements of a Key to the whole bible, the amount of storage it
> takes is dependent upon what kind of optimization is used for the Key. It
> might be a:
> BitwisePassage with one bit for each verse in the Key. BitwisePassage has
> a constant space requirement.
> RangedPassage with very little storage overhead. Each range is stored
> separately. It is slower to iterate over than any of the other
> implementations.
> DistinctPassage uses way too much storage, with one Key object per verse.
> PassageTally keeps a weight for each of the keys it stores. It is used
> prioritize search results.
>
> I have found that this generation of the search index is expensive. But I
> have found ways to make it faster. The first thing is that Lucene uses lots
> of temporary documents on disk to build the index. Depending on what
> hardware I use, I can index an entire bible from <2 minutes to 5 minutes.
> However, on Windows I found that it took in excess of 40 minutes. This with
> an AMD 2400+. I did two things that got it down to a few minutes. First I
> turned off Microsoft's "fast index". Turns out MS tried to index all of
> these temporary documents. It should not have tried to index any. Second, I
> was using a "smart" virus programmer that scanned every document as it is
> deposited on the disk or perhaps accessed from the disk. Not sure which.
> Turning both of these off gave me an index speed of about 4 minutes.
> Subsequently, I replace the virus scanner and never turned "fast indexing"
> back on.
>
> However, I don't expect that we will build an index on a "small" device.
> Rather, I would imagine we would pre-build it and load it.
>
> With regard to the Job class, we should change it to a Job interface and
> make the current class an implementation of it. Then we can create a null
> implementation of the Job that does nothing in contexts where there should
> be no reporting of progress. Or we can create an appropriate implementation
> for the target device.
>
> Please note that from the comment "// report progress" that none of that
> is needed if we don't report progress.
>
> Also, there are several opportunities for optimization here (e.g. the
> number of verses in a bible does not change as progress is made). Also, the
> implementation should be generalized a bit moe. This does not allow for
> indexing commentaries or dictionaries. It should allow for indexing all
> books.
>
> I'll see if I can make those changes.
>
> In Him,
> DM
>
>
> On Dec 19, 2006, at 12:38 AM, Zhaojun Li wrote:
>
> Here are the two methods, one original, one mirror.
>
> /**
> * Dig down into a Key indexing as we go.
> */
> private void newgenerateSearchIndexImpl( List errors, IndexWriter
> writer, Key key) throws BookException, IOException
> {
> int bookNum = 0;
> int oldBookNum = -1;
> int percent = 0;
> String name = ""; //$NON-NLS-1$
> String text = ""; //$NON-NLS-1$
> BookData data = null;
> Key subkey = null;
> Verse verse = null;
> Document doc = null;
> for (Iterator it = key.iterator(); it.hasNext(); )
> {
> subkey = (Key) it.next();
> if ( subkey.canHaveChildren())
> {
> newgenerateSearchIndexImpl( errors, writer, subkey);
> }
> else
> {
> data = null;
> try
> {
> data = book.getData(subkey);
> }
> catch (BookException e)
> {
> errors.add(subkey);
> continue;
> }
>
> text = data.getVerseText();
>
> // Do the actual indexing
> if (text != null && text.length() > 0)
> {
> doc = new Document();
> doc.add(new Field(FIELD_NAME, subkey.getOsisRef(),
> Field.Store.YES, Field.Index.NO));
> doc.add(new Field(FIELD_BODY, new
> StringReader(text)));
> writer.addDocument(doc);
> }
>
> // report progress
> verse = KeyUtil.getVerse(subkey);
>
> try
> {
> percent = 95 * verse.getOrdinal() /
> BibleInfo.versesInBible();
> bookNum = verse.getBook();
> if (oldBookNum != bookNum)
> {
> name = BibleInfo.getBookName (bookNum);
> oldBookNum = bookNum;
> }
> }
> catch (NoSuchVerseException ex)
> {
> log.error("Failed to get book name from verse: " +
> verse, ex); //$NON-NLS-1$
> assert false;
> name = subkey.getName();
> }
>
>
> }
> }
> }
>
> /**
> * Dig down into a Key indexing as we go.
> */
> private void generateSearchIndexImpl(Job job, List errors, IndexWriter
> writer, Key key) throws BookException, IOException
> {
> int bookNum = 0;
> int oldBookNum = -1;
> int percent = 0;
> String name = ""; //$NON-NLS-1$
> String text = ""; //$NON-NLS-1$
> BookData data = null;
> Key subkey = null;
> Verse verse = null;
> Document doc = null;
> for (Iterator it = key.iterator(); it.hasNext(); )
> {
> subkey = (Key) it.next();
> if (subkey.canHaveChildren())
> {
> generateSearchIndexImpl(job, errors, writer, subkey);
> }
> else
> {
> data = null;
> try
> {
> data = book.getData(subkey);
> }
> catch (BookException e)
> {
> errors.add(subkey);
> continue;
> }
>
> text = data.getVerseText();
>
> // Do the actual indexing
> if (text != null && text.length() > 0)
> {
> doc = new Document();
> doc.add(new Field(FIELD_NAME, subkey.getOsisRef(),
> Field.Store.YES , Field.Index.NO));
> doc.add(new Field(FIELD_BODY, new
> StringReader(text)));
> writer.addDocument(doc);
> }
>
> // report progress
> verse = KeyUtil.getVerse(subkey);
>
> try
> {
> percent = 95 * verse.getOrdinal() /
> BibleInfo.versesInBible();
> bookNum = verse.getBook();
> if (oldBookNum != bookNum)
> {
> name = BibleInfo.getBookName(bookNum);
> oldBookNum = bookNum;
> }
> }
> catch (NoSuchVerseException ex)
> {
> log.error("Failed to get book name from verse: " +
> verse, ex); //$NON-NLS-1$
> assert false;
> name = subkey.getName();
> }
>
> job.setProgress(percent, Msg.INDEXING.toString(name));
>
> // This could take a long time ...
> Thread.yield();
> if (Thread.currentThread().isInterrupted())
> {
> break;
> }
> }
> }
> }
>
>
> On 12/19/06, Zhaojun Li <lzj369 at gmail.com> wrote:
> >
> > Hi, Dear all,
> >
> > I am new to Lucene, so please help.
> >
> > I need to remove the job class from the current Lucene implementation.
> > What I did is: create mirror method from generateSearchIndexImpl by removing
> > any Job class reference. I tested it and it works.
> >
> > However, the speed is not good. In the design, it is a recursive
> > call. How to do multithreading for this? I mean by usual thread class, not
> > JSWORD Job api.
> >
> > Thanks!
> >
> > Zhaojun
> >
> >
> >
> >
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20061219/f066ba47/attachment-0001.html
More information about the jsword-devel
mailing list