[jsword-devel] Lucene

Zhaojun Li lzj369 at gmail.com
Tue Dec 19 09:58:05 MST 2006


Very good! I tried ramdirectory to generate the index. and dump to the file
after it is done in memeory, it takes only 20-30 sec. It used to be like 40
minutes!

  /**
     * Generate an index to use, telling the job about progress as you go.
     * @throws BookException If we fail to read the index files
     */
    public LuceneIndex(Book book, URL storage, boolean create,String f)
throws BookException
    {
        assert create;

        this.book = book;
        File finalPath = null;
        try
        {
            finalPath = NetUtil.getAsFile(storage);
            this.path = finalPath.getCanonicalPath();
        }
        catch (IOException ex)
        {
            throw new BookException(Msg.LUCENE_INIT, ex);
        }

        SimpleAnalyzer analyzer=new SimpleAnalyzer();
        IndexStatus finalStatus = IndexStatus.UNDONE;
        try
        {
            synchronized (CREATING)
            {
                book.setIndexStatus(IndexStatus.CREATING);
                File tempPath = new File(path + '.' +
IndexStatus.CREATING.toString());

                // An index is created by opening an IndexWriter with the
                // create argument set to true.

                RAMDirectory ramDir    = new RAMDirectory();
                IndexWriter  writer = new IndexWriter(ramDir, analyzer,
true);
                //addDocs(writer, docsInIndex);


                //IndexWriter writer = new IndexWriter(
tempPath.getCanonicalPath(), new SimpleAnalyzer(), true);

                List errors = new ArrayList();
                newgenerateSearchIndexImpl( errors, writer,
book.getGlobalKeyList());


                writer.optimize();
                IndexWriter fsWriter   = new IndexWriter(path, analyzer,
true);

                fsWriter.addIndexes(new Directory[] { ramDir });

                writer.close();

                fsWriter.close();

                    tempPath.renameTo(finalPath);


                if (finalPath.exists())
                {
                    finalStatus = IndexStatus.DONE;
                }
                if (errors.size() > 0)
                {
                    StringBuffer buf = new StringBuffer();
                    Iterator iter = errors.iterator();
                    while (iter.hasNext())
                    {
                        buf.append(iter.next());
                        buf.append('\n');
                    }

                }

            }
        }
        catch (IOException ex)
        {

            throw new BookException(Msg.LUCENE_INIT, ex);
        }
        finally
        {
            book.setIndexStatus(finalStatus);

        }
    }

On 12/19/06, Zhaojun Li <lzj369 at gmail.com> wrote:
>
> Thanks, DM!
> Can we do indexing all in memory? This way we can avoid temp files from
> being generated. Most of PC have enough memory now.
>
> On 12/19/06, DM Smith <dmsmith555 at yahoo.com> wrote:
> >
> > While the design is recursive, it is probably not going to recurse
> > except for Raw GenBooks.
> > In JSword the interface for a Key allows for any Key to have children.
> > This would be akin to a book having chapters and chapters having verses.
> > However in the case of a Bible the key is a flat list. With regard to the
> > storage requirements of a Key to the whole bible, the amount of storage it
> > takes is dependent upon what kind of optimization is used for the Key. It
> > might be a:
> > BitwisePassage with one bit for each verse in the Key. BitwisePassage
> > has a constant space requirement.
> > RangedPassage with very little storage overhead. Each range is stored
> > separately. It is slower to iterate over than any of the other
> > implementations.
> > DistinctPassage uses way too much storage, with one Key object per
> > verse.
> > PassageTally keeps a weight for each of the keys it stores. It is used
> > prioritize search results.
> >
> > I have found that this generation of the search index is expensive. But
> > I have found ways to make it faster. The first thing is that Lucene uses
> > lots of temporary documents on disk to build the index. Depending on what
> > hardware I use, I can index an entire bible from <2 minutes to 5 minutes.
> > However, on Windows I found that it took in excess of 40 minutes. This with
> > an AMD 2400+. I did two things that got it down to a few minutes. First I
> > turned off Microsoft's "fast index". Turns out MS tried to index all of
> > these temporary documents. It should not have tried to index any. Second, I
> > was using a "smart" virus programmer that scanned every document as it is
> > deposited on the disk or perhaps accessed from the disk. Not sure which.
> > Turning both of these off gave me an index speed of about 4 minutes.
> > Subsequently, I replace the virus scanner and never turned "fast indexing"
> > back on.
> >
> > However, I don't expect that we will build an index on a "small" device.
> > Rather, I would imagine we would pre-build it and load it.
> >
> > With regard to the Job class, we should change it to a Job interface and
> > make the current class an implementation of it. Then we can create a null
> > implementation of the Job that does nothing in contexts where there should
> > be no reporting of progress. Or we can create an appropriate implementation
> > for the target device.
> >
> > Please note that from the comment "// report progress" that none of that
> > is needed if we don't report progress.
> >
> > Also, there are several opportunities for optimization here ( e.g. the
> > number of verses in a bible does not change as progress is made). Also, the
> > implementation should be generalized a bit moe. This does not allow for
> > indexing commentaries or dictionaries. It should allow for indexing all
> > books.
> >
> > I'll see if I can make those changes.
> >
> > In Him,
> > DM
> >
> >
> > On Dec 19, 2006, at 12:38 AM, Zhaojun Li wrote:
> >
> > Here are the two methods, one original, one mirror.
> >
> >  /**
> >      * Dig down into a Key indexing as we go.
> >      */
> >     private void newgenerateSearchIndexImpl( List errors, IndexWriter
> > writer, Key key) throws BookException, IOException
> >     {
> >         int bookNum = 0;
> >         int oldBookNum = -1;
> >         int percent = 0;
> >         String name = ""; //$NON-NLS-1$
> >         String text = ""; //$NON-NLS-1$
> >         BookData data = null;
> >         Key subkey = null;
> >         Verse verse = null;
> >         Document doc = null;
> >         for (Iterator it = key.iterator(); it.hasNext(); )
> >         {
> >             subkey = (Key) it.next();
> >             if ( subkey.canHaveChildren())
> >             {
> >                 newgenerateSearchIndexImpl( errors, writer, subkey);
> >             }
> >             else
> >             {
> >                 data = null;
> >                 try
> >                 {
> >                     data = book.getData(subkey);
> >                 }
> >                 catch (BookException e)
> >                 {
> >                     errors.add(subkey);
> >                     continue;
> >                 }
> >
> >                 text = data.getVerseText();
> >
> >                 // Do the actual indexing
> >                 if (text != null && text.length() > 0)
> >                 {
> >                     doc = new Document();
> >                     doc.add(new Field(FIELD_NAME, subkey.getOsisRef(),
> > Field.Store.YES , Field.Index.NO));
> >                     doc.add(new Field(FIELD_BODY, new
> > StringReader(text)));
> >                     writer.addDocument(doc);
> >                 }
> >
> >                 // report progress
> >                 verse = KeyUtil.getVerse(subkey);
> >
> >                 try
> >                 {
> >                     percent = 95 * verse.getOrdinal() /
> > BibleInfo.versesInBible();
> >                     bookNum = verse.getBook();
> >                     if (oldBookNum != bookNum)
> >                     {
> >                         name = BibleInfo.getBookName (bookNum);
> >                         oldBookNum = bookNum;
> >                     }
> >                 }
> >                 catch (NoSuchVerseException ex)
> >                 {
> >                     log.error("Failed to get book name from verse: " +
> > verse, ex); //$NON-NLS-1$
> >                     assert false;
> >                     name = subkey.getName();
> >                 }
> >
> >
> >             }
> >         }
> >     }
> >
> >     /**
> >      * Dig down into a Key indexing as we go.
> >      */
> >     private void generateSearchIndexImpl(Job job, List errors,
> > IndexWriter writer, Key key) throws BookException, IOException
> >     {
> >         int bookNum = 0;
> >         int oldBookNum = -1;
> >         int percent = 0;
> >         String name = ""; //$NON-NLS-1$
> >         String text = ""; //$NON-NLS-1$
> >         BookData data = null;
> >         Key subkey = null;
> >         Verse verse = null;
> >         Document doc = null;
> >         for (Iterator it = key.iterator(); it.hasNext(); )
> >         {
> >             subkey = (Key) it.next();
> >             if (subkey.canHaveChildren())
> >             {
> >                 generateSearchIndexImpl(job, errors, writer, subkey);
> >             }
> >             else
> >             {
> >                 data = null;
> >                 try
> >                 {
> >                     data = book.getData(subkey);
> >                 }
> >                 catch (BookException e)
> >                 {
> >                     errors.add(subkey);
> >                     continue;
> >                 }
> >
> >                 text = data.getVerseText();
> >
> >                 // Do the actual indexing
> >                 if (text != null && text.length() > 0)
> >                 {
> >                     doc = new Document();
> >                     doc.add(new Field(FIELD_NAME, subkey.getOsisRef(),
> > Field.Store.YES , Field.Index.NO));
> >                     doc.add(new Field(FIELD_BODY, new
> > StringReader(text)));
> >                     writer.addDocument(doc);
> >                 }
> >
> >                 // report progress
> >                 verse = KeyUtil.getVerse(subkey);
> >
> >                 try
> >                 {
> >                     percent = 95 * verse.getOrdinal() /
> > BibleInfo.versesInBible();
> >                     bookNum = verse.getBook();
> >                     if (oldBookNum != bookNum)
> >                     {
> >                         name = BibleInfo.getBookName (bookNum);
> >                         oldBookNum = bookNum;
> >                     }
> >                 }
> >                 catch (NoSuchVerseException ex)
> >                 {
> >                     log.error("Failed to get book name from verse: " +
> > verse, ex); //$NON-NLS-1$
> >                     assert false;
> >                     name = subkey.getName();
> >                 }
> >
> >                 job.setProgress(percent, Msg.INDEXING.toString(name));
> >
> >                 // This could take a long time ...
> >                 Thread.yield();
> >                 if (Thread.currentThread().isInterrupted())
> >                 {
> >                     break;
> >                 }
> >             }
> >         }
> >     }
> >
> >
> >  On 12/19/06, Zhaojun Li <lzj369 at gmail.com> wrote:
> > >
> > > Hi, Dear all,
> > >
> > > I am new to Lucene, so please help.
> > >
> > > I need to remove the job class from the current Lucene implementation.
> > > What I did is: create mirror method from generateSearchIndexImpl by removing
> > > any Job class reference. I tested it and it works.
> > >
> > > However, the speed is not good.  In the design, it is a recursive
> > > call.   How to do multithreading for this? I mean by usual thread class, not
> > > JSWORD Job api.
> > >
> > > Thanks!
> > >
> > > Zhaojun
> > >
> > >
> > >
> > >
> > _______________________________________________
> > jsword-devel mailing list
> > jsword-devel at crosswire.org
> > http://www.crosswire.org/mailman/listinfo/jsword-devel
> >
> >
> >
> > _______________________________________________
> > jsword-devel mailing list
> > jsword-devel at crosswire.org
> > http://www.crosswire.org/mailman/listinfo/jsword-devel
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20061219/bebcb4cf/attachment-0001.html 


More information about the jsword-devel mailing list