[jsword-devel] Fwd: Search Index Downloading

Troy A. Griffitts scribe at crosswire.org
Tue Oct 12 21:29:18 MST 2004


Thinking about it more and more makes me think this isn't anything we want 
to manage (at least I don't want to manage).  Talking about it with some 
of the c++ guys, I had comments like:

 	Many places in the world don't have the bandwidth to download 6.8 
megs of index for the KJV (current clucene index size).  [Much less all 
these other indecies you may want to add.]

 	Can't they [the jsword guys] nice a process [spawn a thread] to 
index in the background?

____ end of comments ____

I tend to agree.  The size and management headache aren't worth the 5 
minute savings for the user.  And if we can get the 5 minutes down to 2 
and the 2 in the background and not noticed much at all, I think our time 
is better spent on such.

 	-Troy.




On Tue, 12 Oct 2004, DM Smith wrote:

> The basic issues that I see are:
> 1) As Lucene is upgraded it may invalidate an index built against an earlier 
> version.
> 2) If an upgraded Lucene is backwardly compatible, we still may want to 
> re-index to get more features.
> 3) If a module is upgraded we will need to re-index (as you pointed out.)
> 4) As we create indexes for other features (e.g. transliteration of Greek and 
> Hebrew; removal of accents, breathing, diacriticals, ...) , these will be 
> subject to the same issues.
> 5) Old indexes need to be retained according to some reasonable policy. At 
> any given time we may need to support more than one version of the index.
> 6) An index should not be made visible until it is completed. (e.g. build in 
> an alternate directory and then rename the directory when it is finished)
>
> All this seems to point to is that versioning of the indexes is necessary and 
> will need to be well thought out.
>
> I think we will certainly need to have a metadata describing the index. It 
> may be possible to use path names to do this.
> It should contain sufficient version information to tie it to a particular 
> version of Lucene, to the versions of Sword and JSword that can use it, and 
> to the particular version of the module.
> If we maintained a checksum for the module, we could probably automate the 
> re-indexing of modules. From the server logs, we can probably figure out a 
> good (idle) time to do it.
>
> Troy A. Griffitts wrote:
>
>> Hey guys,
>>     I'd like to do some experiments to see if clucene and Java Lucene 
>> indecies are binary compatible.
>> 
>>     I also like the idea of a subdirectory under idx for keeping different 
>> kinds of indecies.  I might suggest even 1 more level under L1, if you are 
>> planning for version changes of your index structure.
>> 
>>     e.g. C++ SWORD supports a pluggable index architecture, and we are 
>> hoping to write some cool indexers for morphologically declined searches, 
>> etc.  We could keep pre-generated index sets under different 
>> subdirectories under idx for each plugin.
>> 
>>     On the downside, we release updated modules on a regular basis-- some 
>> modules more 'regular' than others.  To keep the indecies up to date for 
>> each module should not be the module creators responsibility.  I wouldn't 
>> expect our current maintainers to run a number of different indexers every 
>> time they release a new module, unless the process was nearly completely 
>> automated to handle ALL types of indexing.
>> 
>>     Up until this consideration, we have always taken the methodology of 
>> generating anything needed for a plugin on demand on the end user's 
>> system.  Which is always the least maintenance option for us :)
>> 
>>     -Troy.
>> 
>> 
>> On Mon, 11 Oct 2004, Joe Walker wrote:
>> 
>>> Getting Reply and ReplyAll confused again ...
>>> 
>>> ---------- Forwarded message ----------
>>> From: Joe Walker <joseph.walker at gmail.com>
>>> Date: Mon, 11 Oct 2004 08:12:18 +0100
>>> Subject: Re: Search Index Downloading
>>> To: "Troy A. Griffitts" <scribe at crosswire.org>
>>> 
>>> How about we use /pub/sword/raw/idx/L1/[book].zip then?
>>> If Java Lucene indexes and CLucene indexes are compatible then it
>>> won't be proprietary to JSword. If they are not compatible, or if you
>>> want to use different options in creating the index then you can use
>>> /pub/sword/raw/idx/C1/[book].zip or something.
>>> 
>>> Joe.
>>> 
>>> 
>>> 
>>> On Sun, 10 Oct 2004 22:12:00 -0700, Troy A. Griffitts
>>> <scribe at crosswire.org> wrote:
>>> 
>>>> Hey Joe,
>>>>         That's fine.  Let me know if there is anything I need to do 
>>>> for you.
>>>> Don't we have a /pub/jsword directory for your stuff?  I understand 
>>>> what
>>>> you mean by having the same base directory for modules (which would be
>>>> /pub/sword/raw for our server, so maybe /pub/sword/raw/idx, but this
>>>> isn't a sword module data structure.  This is jsword's proprietary (in
>>>> the sense of not publicly sword declared) data.  It would be nice to
>>>> unify a common index format for sword modules.
>>>> 
>>>>         Does it really take lucene 5+ minutes to generate?  That's a 
>>>> bummer.
>>>> You would think it wouldn't take much longer than a single non-index
>>>> search thru the Bible.
>>>> 
>>>>         To belatedly answer your question on sword-devel, I honestly 
>>>> have no
>>>> idea if clucene indecies are binary compatible with the java lucene
>>>> counterpart.
>>>> 
>>>>         -Troy.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> Joe Walker wrote:
>>>> 
>>>>> Hi Troy,
>>>>> 
>>>>> I'd like to allow users of Bible Desktop to download search indexes
>>>>> because they take about 5 mins to generate. A search index is 
>>>>> between
>>>>> 2-3Mb per book so it ought not to take up too much space.
>>>>> 
>>>>> Ideally we would use an FTP directory on crosswire something like:
>>>>> - /pub/sword/search/jsword/L1/[book].zip
>>>>> 
>>>>> It starts /pub/sword so that if the beta modules site (or other
>>>>> download sites) come online we can just remember one root path per
>>>>> module site. The search/jsword bit would keep our stuff from getting
>>>>> in anyone elses way. L1 is simply a version number so we can update
>>>>> the index format without huge turmoil.
>>>>> 
>>>>> Is that OK?
>>>>> Thanks,
>>>>> 
>>>>> Joe.
>>>> 
>>>> 
>>>> 
>>> _______________________________________________
>>> jsword-devel mailing list
>>> jsword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>> 
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>> 
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>


More information about the jsword-devel mailing list