[sword-devel] SWORD trunk / Regression when updating remote sources

Troy A. Griffitts scribe at crosswire.org
Sat Oct 17 07:52:47 EDT 2020


OK Tobias,

Give it a go when you have a chance and let me know.

Troy


On 10/17/20 12:16 PM, Troy A. Griffitts wrote:
> The unTarGZ is also a new method and it looks like it is using the 
> default file handle pool functionality of FileMgr, from looking at 
> your stack trace. Give me about an hour and I'll have a chance to take 
> a look at it. Good news is that it's not having trouble in the 
> CURLFTPTransport. I have the same change queued up for commit for the 
> other 3 transport impls so I will go ahead and commit those, as well. 
> Thank you for working through this with me.
>
> Troy
>
> On October 17, 2020 10:44:18 AM GMT+02:00, Tobias Klein 
> <contact at tklein.info> wrote:
>
>     Dear Troy,
>
>     Thank you so much for the help and all your work on this.
>     Unfortunately the issue is still not resolved for me based on your
>     latest commits.
>
>     I have n threads that all run InstallMgr::refreshRemoteSource. n
>     corresponds to the number of repositories available, so it's
>     currently 10.
>     The operation works until at some point things start hanging
>     again. Sometime it happens after ten consecutive calls of the
>     update function, sometimes already after three times and I also
>     had it hanging after only one attempt.
>     In the calling function the hanging occurs when I join the threads
>     (waiting for them to complete).
>
>     Looking at details in gdb I find this at the point of hanging:
>
>     (gdb) info threads
>       Id   Target Id         Frame
>       102  Thread 0x7f8e3cdd1700 (LWP 46520) "node_sword_cli"
>     0x000056411eacb46c in sword::FileMgr::sysOpen(sword::FileDesc*) ()
>       95   Thread 0x7f8e37fff700 (LWP 46514) "node_sword_cli"
>     0x000056411eacb283 in sword::FileMgr::sysOpen(sword::FileDesc*) ()
>       93   Thread 0x7f8e3ddf6700 (LWP 46511) "node_sword_cli"
>     0x000056411eacb283 in sword::FileMgr::sysOpen(sword::FileDesc*) ()
>       91   Thread 0x7f8e2bfff700 (LWP 46510) "node_sword_cli"
>     0x000056411eacb46c in sword::FileMgr::sysOpen(sword::FileDesc*) ()
>     * 90   Thread 0x7f8e2b7fe700 (LWP 46509) "node_sword_cli"
>     0x000056411eacb296 in sword::FileMgr::sysOpen(sword::FileDesc*) ()
>       1    Thread 0x7f8e3ddf9e00 (LWP 46380) "node_sword_cli"
>     0x00007f8e40d98cd7 in __pthread_clockjoin_ex () from
>     /lib/x86_64-linux-gnu/libpthread.so.0
>
>     And this stacktrace for each individual thread (relevant portion):
>     #0  0x000056411eacb46c in sword::FileMgr::sysOpen(sword::FileDesc*) ()
>     #1  0x000056411eab4b0b in sword::FileDesc::getFd() ()
>     #2  0x000056411eb2fb70 in
>     sword::ZipCompress::unTarGZ(sword::FileDesc*, char const*) ()
>     #3  0x000056411eae4437 in
>     sword::InstallMgr::refreshRemoteSource(sword::InstallSource*) ()
>
>     I'm not sure whether all of these threads here are now hanging or
>     only one of them. It could be just the one that the main function
>     tries to join right now.
>
>     Another observation is that I am getting random output like this
>     during the process (it happens with different conf files, not this
>     one all the time):
>
>     error writing
>     /home/tobi/.sword/installMgr/20120711005000/mods.d/ngu_BL_1987.conf
>     skipping...
>
>     I didn't get these error messages with earlier SVN revisions.
>
>     To be sure I just once more tested with SVN Rev. 3759 and there I
>     consistently get 20 out of 20 attempts successful.
>
>     Best regards,
>     Tobias
>
>     PS: I'm sending this e-mail the second time, didn't seem to come
>     through via mailman the first time (at 9:11 CEST).
>
>     On 10/15/20 8:09 PM, Troy A. Griffitts wrote:
>>
>>     Dear Tobias,
>>
>>     Thank you for all the great information.  This enabled me to
>>     isolate the change which caused the issue.
>>
>>     So, for a bit of background, SWORD has no calls to mark critical
>>     sections which might be problematic for re-entrant usage.  This
>>     has been due to the many implementations of threading across many
>>     different platforms over the years, before C++11.  But, as a
>>     policy to support clients which desire to use SWORD in a
>>     multithreaded manner, we do our best to make this safe by
>>     advising clients to use separate SWMgr instances per thread. 
>>     There are still some shared objects in this scenario, but we do
>>     our best to do all the writing to these shared objects upon
>>     initialization.  We broke this rule in commit 2760, which is what
>>     caused your problem. SWORD have a facility to pool open file
>>     handles, to help OSs which have small open file handle limits. 
>>     This work is done in FileMgr.  Recently, to support Windows
>>     Unicode path names (the commit you found which breaks your
>>     multithreaded use), we rounded up all remaining native file IO
>>     calls and replaced them to used FileMgr for the IO and then
>>     extended FileMgr to handle Windows Unicode paths in a
>>     Windows-specific manner.  One of these changes was in
>>     CURLFTPTransport, which is where you are having the issue.  The
>>     problem is that, where previously this class was directly opening
>>     a FILE to do its writing, commit 2760 changed this to use FileMgr
>>     to open the file, which involved the SWORD-wide file handle pool,
>>     and since we are create a new file, we are always writing to this
>>     shared pool container, which is not threadsafe.  My guess is that
>>     you have two threads trying to update the pool container at
>>     exactly the same time.  Using the file handle pool is usually
>>     safe, because SWMgr "opens" all of its file handles on
>>     initialization (these are not actually opening OS file handles,
>>     but instead updating the file handle pool container with proxy
>>     objects which delay actual OS open to on-demand, but the point is
>>     this instance of shared file handle pool container writing is
>>     done on creation of the SWMgr, afterward, the shared resource
>>     file handle pool is only read and each object in the pool is
>>     owned by only 1 thread if the "each thread must have its own
>>     SWMgr" rule is followed.
>>
>>     Regardless of the details.  I believe I have committed a fix for
>>     you.  In short, I have changed CURLFTPTransport to follow our
>>     rule to avoid writing to shared objects when we might be
>>     re-entrant.  Here we now use FileMgr's methods which isolate OS
>>     implementation, but not FileMgr's file handle pool (as it did not
>>     previously use the pool before this commit).  This should allow
>>     this to still take advantage of the Windows OS-specific
>>     implementation, and also avoid the critical section.  Can you
>>     please try SVN head and let me know if we are back to 20 out of
>>     20 successes?
>>
>>     Thanks again for the very helpful debug log and exact revision
>>     where failure began.
>>
>>     Troy
>>
>>
>>     On 10/13/20 10:08 PM, Tobias Klein wrote:
>>>
>>>     I managed to get a backtrace to a segmentation fault using GDB.
>>>
>>>     It seems like the crash is happening in sword::FileMgr::open( ...
>>>
>>>     The starting point is sword::InstallMgr::refreshRemoteSource as
>>>     I was writing before.
>>>
>>>     Best regards,
>>>     Tobias
>>>
>>>     Program received signal SIGSEGV, Segmentation fault.
>>>     [Switching to Thread 0x7f1af3fff700 (LWP 220833)]
>>>     0x00007f1b027045a4 in sword::FileMgr::open(char const*, int,
>>>     int, bool) () from
>>>     /home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>>     (gdb) backtrace
>>>     #0  0x00007f1b027045a4 in sword::FileMgr::open(char const*, int,
>>>     int, bool) () from
>>>     /home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>>     #1  0x00007f1b0276ad7b in sword::(anonymous
>>>     namespace)::my_fwrite(void*, unsigned long, unsigned long, void*) ()
>>>        from
>>>     /home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>>     #2  0x00007f1b180626bf in ?? () from
>>>     /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4
>>>     #3  0x00007f1b18074a2b in ?? () from
>>>     /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4
>>>     #4  0x00007f1b1807e2e4 in ?? () from
>>>     /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4
>>>     #5  0x00007f1b1807f6f9 in curl_multi_perform () from
>>>     /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4
>>>     #6  0x00007f1b18075d13 in curl_easy_perform () from
>>>     /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4
>>>     #7  0x00007f1b0276b683 in sword::CURLFTPTransport::getURL(char
>>>     const*, char const*, sword::SWBuf*) () from
>>>     /home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>>     #8  0x00007f1b0271d5d2 in
>>>     sword::InstallMgr::remoteCopy(sword::InstallSource*, char
>>>     const*, char const*, bool, char const*) ()
>>>        from
>>>     /home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>>     #9  0x00007f1b0271edc7 in
>>>     sword::InstallMgr::refreshRemoteSource(sword::InstallSource*) ()
>>>     from
>>>     /home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>>     #10 0x00007f1b026ad734 in
>>>     RepositoryInterface::refreshIndividualRemoteSource(std::__cxx11::basic_string<char,
>>>     std::char_traits<char>, std::allocator<char> >,
>>>     std::function<void (unsigned int)>*) ()
>>>        from
>>>     /home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>>     #11 0x00007f1b026b17dd in
>>>     std::thread::_State_impl<std::thread::_Invoker<std::tuple<int
>>>     (RepositoryInterface::*)(std::__cxx11::basic_string<char,
>>>     std::char_traits<char>, std::allocator<char> >,
>>>     std::function<void (unsigned int)>*), RepositoryInterface*,
>>>     std::__cxx11::basic_string<char, std::char_traits<char>,
>>>     std::allocator<char> >, std::function<void (unsigned int)>*> >
>>>     >::_M_run() ()
>>>        from
>>>     /home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>>     #12 0x00007f1b1d622cb4 in ?? () from
>>>     /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>>>     #13 0x00007f1b1e20a609 in start_thread () from
>>>     /lib/x86_64-linux-gnu/libpthread.so.0
>>>     #14 0x00007f1b1e131103 in clone () from
>>>     /lib/x86_64-linux-gnu/libc.so.6
>>>
>>>     On 10/13/20 1:07 PM, Tobias Klein wrote:
>>>>
>>>>     Hi Troy,
>>>>
>>>>     I tested more SVN revisions of SWORD trunk (starting from my
>>>>     stable version until I hit the bug) and I can now say that
>>>>
>>>>     SVN Rev. 3759 is the last SVN revision that works without
>>>>     hanging for the below mentioned scenario. (20 out of 20 tests
>>>>     successful)
>>>>
>>>>     SVN Rev. 3760 is the first SVN revision where the hanging
>>>>     occurs. The commit message is "First cut at better isolation of
>>>>     FileIO to FileMgr and providing a WIN32 impl with works with
>>>>     wchar_t".
>>>>
>>>>     Modified files:
>>>>     include/filemgr.h
>>>>     include/swbuf.h
>>>>     lib/bcppmake/libsword.bpr
>>>>     src/mgr/curlftpt.cpp
>>>>     src/mgr/curlhttpt.cpp
>>>>     src/mgr/filemgr.cpp
>>>>     src/mgr/installmgr.cpp
>>>>     src/mgr/swmgr.cpp
>>>>     src/utilfuns/utilstr.cpp
>>>>
>>>>     Maybe this helps to find the root-cause.
>>>>
>>>>     Best regards,
>>>>     Tobias
>>>>
>>>>     On 10/12/20 9:20 PM, Tobias Klein wrote:
>>>>>
>>>>>     I'll see whether I can collect a stack trace. It may take some
>>>>>     time until I have it.
>>>>>
>>>>>     The multi-threaded "remote source refreshing" worked without
>>>>>     issues until recently.
>>>>>
>>>>>     Here is the code of the function that does the actual work in
>>>>>     a thread.
>>>>>     See
>>>>>     https://github.com/tobias-klein/node-sword-interface/blob/787160ccb4b3bab2a762d22f74031c7237edc803/src/sword_backend/repository_interface.cpp#L105.
>>>>>
>>>>>     intRepositoryInterface::refreshIndividualRemoteSource(stringremoteSourceName,
>>>>>     std::function<void(unsignedintprogress)>*progressCallback)
>>>>>     {
>>>>>     //cout << "Refreshing source " << remoteSourceName << endl <<
>>>>>     flush;
>>>>>     InstallSource* source= this->getRemoteSource(remoteSourceName);
>>>>>     intresult= this->_installMgr->refreshRemoteSource(source);
>>>>>     if(result!= 0) {
>>>>>     cerr<<"Failed to refresh source "<<remoteSourceName<<endl<<flush;
>>>>>     }
>>>>>     remoteSourceUpdateMutex.lock();
>>>>>     this->_remoteSourceUpdateCount++;
>>>>>     unsignedinttotalPercent=
>>>>>     (unsignedint)calculateIntPercentage<double>(this->_remoteSourceUpdateCount,
>>>>>     this->_remoteSourceCount);
>>>>>     if(progressCallback!= 0) {
>>>>>     (*progressCallback)(totalPercent);
>>>>>     }
>>>>>     remoteSourceUpdateMutex.unlock();
>>>>>     returnresult;
>>>>>     }
>>>>>
>>>>>     Best regards,
>>>>>     Tobias
>>>>>
>>>>>     On 10/12/20 9:01 PM, Troy A. Griffitts wrote:
>>>>>>     Any luck getting a stack trace on crash?
>>>>>>
>>>>>>     Regarding the "multitheaded mode", I'd have to get a bit more
>>>>>>     information as to exactly how you are sharing SWORD objects
>>>>>>     across your threads. Generally, as a rule, you shouldn't. We
>>>>>>     recommend a separate instance of SWMgr per thread and that
>>>>>>     probably goes for InstallMgr, as well.
>>>>>>
>>>>>>     Troy
>>>>>>
>>>>>>     On October 12, 2020 8:29:31 PM GMT+02:00, Tobias Klein
>>>>>>     <contact at tklein.info> wrote:
>>>>>>
>>>>>>         Hi Troy,
>>>>>>
>>>>>>         I'm using curl on all three platforms.
>>>>>>
>>>>>>         Regarding the timeout configuration I have not changed
>>>>>>         anything yet, to make this configurable in Ezra Project
>>>>>>         is still on my todo list.
>>>>>>
>>>>>>         I just checked on Linux.
>>>>>>         With the old version (May 18th 2020) no hanging or crash
>>>>>>         in 10 out of 10 times.
>>>>>>         WIth the new version (latest trunk / SWORD 1.9 RC3) I get
>>>>>>         1 x crash, 2 x hanging, 7 x working.
>>>>>>
>>>>>>         I'm running the InstallMgr::refreshRemoteSource "in a
>>>>>>         multi-threaded mode".
>>>>>>
>>>>>>         Best regards,
>>>>>>         Tobias
>>>>>>
>>>>>>         On 10/12/20 6:59 PM, Troy A. Griffitts wrote:
>>>>>>>         Hi Tobias,
>>>>>>>
>>>>>>>         What transport library are you building with? ftplib or
>>>>>>>         curl?
>>>>>>>
>>>>>>>         Have you changed the value of our new timeout from the
>>>>>>>         default, I believe we decided on, 10 seconds?
>>>>>>>
>>>>>>>         Troy
>>>>>>>
>>>>>>>         On October 12, 2020 6:46:54 PM GMT+02:00, Tobias Klein
>>>>>>>         <contact at tklein.info> wrote:
>>>>>>>
>>>>>>>             Hi Troy,
>>>>>>>
>>>>>>>             In my latest Ezra Project builds using SWORD trunk I’ve been noticing random „hangs“ and crashes related to "updating remote sources“. I suppose it must be around InstallMgr::refreshRemoteSource.
>>>>>>>
>>>>>>>             This was still rock solid when using SWORD trunk from May 18th 2020, but not so any more with the recent SWORD trunk.
>>>>>>>
>>>>>>>             Unfortunately I cannot pinpoint this more specifically. I just wanted to first share this observation, because it’s worrying me.
>>>>>>>
>>>>>>>             I’ve been noticing this regression both on Windows and macOS. Need to check later whether this also happens on Linux, cannot recall it right now.
>>>>>>>
>>>>>>>             Best regards,
>>>>>>>             Tobias
>>>>>>>             ------------------------------------------------------------------------
>>>>>>>             sword-devel mailing list:sword-devel at crosswire.org
>>>>>>>             http://crosswire.org/mailman/listinfo/sword-devel  <http://crosswire.org/mailman/listinfo/sword-devel>
>>>>>>>             Instructions to unsubscribe/change your settings at above page
>>>>>>>
>>>>>>>
>>>>>>>         -- 
>>>>>>>         Sent from my Android device with K-9 Mail. Please excuse
>>>>>>>         my brevity. 
>>>>>>
>>>>>>
>>>>>>     -- 
>>>>>>     Sent from my Android device with K-9 Mail. Please excuse my
>>>>>>     brevity. 
>>>>>
>>>>>     _______________________________________________
>>>>>     sword-devel mailing list:sword-devel at crosswire.org
>>>>>     http://crosswire.org/mailman/listinfo/sword-devel
>>>>>     Instructions to unsubscribe/change your settings at above page
>>>>
>>>>     _______________________________________________
>>>>     sword-devel mailing list:sword-devel at crosswire.org
>>>>     http://crosswire.org/mailman/listinfo/sword-devel
>>>>     Instructions to unsubscribe/change your settings at above page
>>>
>>>     _______________________________________________
>>>     sword-devel mailing list:sword-devel at crosswire.org
>>>     http://crosswire.org/mailman/listinfo/sword-devel
>>>     Instructions to unsubscribe/change your settings at above page
>>
>>     _______________________________________________
>>     sword-devel mailing list:sword-devel at crosswire.org
>>     http://crosswire.org/mailman/listinfo/sword-devel
>>     Instructions to unsubscribe/change your settings at above page
>
>
> -- 
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20201017/55f17c6c/attachment-0001.html>


More information about the sword-devel mailing list