[sword-devel] SWORD trunk / Regression when updating remote sources
Troy A. Griffitts
scribe at crosswire.org
Sat Oct 17 06:16:04 EDT 2020
The unTarGZ is also a new method and it looks like it is using the default file handle pool functionality of FileMgr, from looking at your stack trace. Give me about an hour and I'll have a chance to take a look at it. Good news is that it's not having trouble in the CURLFTPTransport. I have the same change queued up for commit for the other 3 transport impls so I will go ahead and commit those, as well. Thank you for working through this with me.
Troy
On October 17, 2020 10:44:18 AM GMT+02:00, Tobias Klein <contact at tklein.info> wrote:
>Dear Troy,
>
>Thank you so much for the help and all your work on this. Unfortunately
>
>the issue is still not resolved for me based on your latest commits.
>
>I have n threads that all run InstallMgr::refreshRemoteSource. n
>corresponds to the number of repositories available, so it's currently
>10.
>The operation works until at some point things start hanging again.
>Sometime it happens after ten consecutive calls of the update function,
>
>sometimes already after three times and I also had it hanging after
>only
>one attempt.
>In the calling function the hanging occurs when I join the threads
>(waiting for them to complete).
>
>Looking at details in gdb I find this at the point of hanging:
>
>(gdb) info threads
> Id Target Id Frame
> 102 Thread 0x7f8e3cdd1700 (LWP 46520) "node_sword_cli"
>0x000056411eacb46c in sword::FileMgr::sysOpen(sword::FileDesc*) ()
> 95 Thread 0x7f8e37fff700 (LWP 46514) "node_sword_cli"
>0x000056411eacb283 in sword::FileMgr::sysOpen(sword::FileDesc*) ()
> 93 Thread 0x7f8e3ddf6700 (LWP 46511) "node_sword_cli"
>0x000056411eacb283 in sword::FileMgr::sysOpen(sword::FileDesc*) ()
> 91 Thread 0x7f8e2bfff700 (LWP 46510) "node_sword_cli"
>0x000056411eacb46c in sword::FileMgr::sysOpen(sword::FileDesc*) ()
>* 90 Thread 0x7f8e2b7fe700 (LWP 46509) "node_sword_cli"
>0x000056411eacb296 in sword::FileMgr::sysOpen(sword::FileDesc*) ()
> 1 Thread 0x7f8e3ddf9e00 (LWP 46380) "node_sword_cli"
>0x00007f8e40d98cd7 in __pthread_clockjoin_ex () from
>/lib/x86_64-linux-gnu/libpthread.so.0
>
>And this stacktrace for each individual thread (relevant portion):
>#0 0x000056411eacb46c in sword::FileMgr::sysOpen(sword::FileDesc*) ()
>#1 0x000056411eab4b0b in sword::FileDesc::getFd() ()
>#2 0x000056411eb2fb70 in sword::ZipCompress::unTarGZ(sword::FileDesc*,
>
>char const*) ()
>#3 0x000056411eae4437 in
>sword::InstallMgr::refreshRemoteSource(sword::InstallSource*) ()
>
>I'm not sure whether all of these threads here are now hanging or only
>one of them. It could be just the one that the main function tries to
>join right now.
>
>Another observation is that I am getting random output like this during
>
>the process (it happens with different conf files, not this one all the
>
>time):
>
>error writing
>/home/tobi/.sword/installMgr/20120711005000/mods.d/ngu_BL_1987.conf
>skipping...
>
>I didn't get these error messages with earlier SVN revisions.
>
>To be sure I just once more tested with SVN Rev. 3759 and there I
>consistently get 20 out of 20 attempts successful.
>
>Best regards,
>Tobias
>
>PS: I'm sending this e-mail the second time, didn't seem to come
>through
>via mailman the first time (at 9:11 CEST).
>
>On 10/15/20 8:09 PM, Troy A. Griffitts wrote:
>>
>> Dear Tobias,
>>
>> Thank you for all the great information. This enabled me to isolate
>> the change which caused the issue.
>>
>> So, for a bit of background, SWORD has no calls to mark critical
>> sections which might be problematic for re-entrant usage. This has
>> been due to the many implementations of threading across many
>> different platforms over the years, before C++11. But, as a policy
>to
>> support clients which desire to use SWORD in a multithreaded manner,
>> we do our best to make this safe by advising clients to use separate
>> SWMgr instances per thread. There are still some shared objects in
>> this scenario, but we do our best to do all the writing to these
>> shared objects upon initialization. We broke this rule in commit
>> 2760, which is what caused your problem. SWORD have a facility to
>> pool open file handles, to help OSs which have small open file handle
>
>> limits. This work is done in FileMgr. Recently, to support Windows
>> Unicode path names (the commit you found which breaks your
>> multithreaded use), we rounded up all remaining native file IO calls
>> and replaced them to used FileMgr for the IO and then extended
>FileMgr
>> to handle Windows Unicode paths in a Windows-specific manner. One of
>
>> these changes was in CURLFTPTransport, which is where you are having
>> the issue. The problem is that, where previously this class was
>> directly opening a FILE to do its writing, commit 2760 changed this
>to
>> use FileMgr to open the file, which involved the SWORD-wide file
>> handle pool, and since we are create a new file, we are always
>writing
>> to this shared pool container, which is not threadsafe. My guess is
>> that you have two threads trying to update the pool container at
>> exactly the same time. Using the file handle pool is usually safe,
>> because SWMgr "opens" all of its file handles on initialization
>(these
>> are not actually opening OS file handles, but instead updating the
>> file handle pool container with proxy objects which delay actual OS
>> open to on-demand, but the point is this instance of shared file
>> handle pool container writing is done on creation of the SWMgr,
>> afterward, the shared resource file handle pool is only read and each
>
>> object in the pool is owned by only 1 thread if the "each thread must
>
>> have its own SWMgr" rule is followed.
>>
>> Regardless of the details. I believe I have committed a fix for
>you.
>> In short, I have changed CURLFTPTransport to follow our rule to avoid
>
>> writing to shared objects when we might be re-entrant. Here we now
>> use FileMgr's methods which isolate OS implementation, but not
>> FileMgr's file handle pool (as it did not previously use the pool
>> before this commit). This should allow this to still take advantage
>> of the Windows OS-specific implementation, and also avoid the
>critical
>> section. Can you please try SVN head and let me know if we are back
>> to 20 out of 20 successes?
>>
>> Thanks again for the very helpful debug log and exact revision where
>> failure began.
>>
>> Troy
>>
>>
>> On 10/13/20 10:08 PM, Tobias Klein wrote:
>>>
>>> I managed to get a backtrace to a segmentation fault using GDB.
>>>
>>> It seems like the crash is happening in sword::FileMgr::open( ...
>>>
>>> The starting point is sword::InstallMgr::refreshRemoteSource as I
>was
>>> writing before.
>>>
>>> Best regards,
>>> Tobias
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> [Switching to Thread 0x7f1af3fff700 (LWP 220833)]
>>> 0x00007f1b027045a4 in sword::FileMgr::open(char const*, int, int,
>>> bool) () from
>>>
>/home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>> (gdb) backtrace
>>> #0 0x00007f1b027045a4 in sword::FileMgr::open(char const*, int,
>int,
>>> bool) () from
>>>
>/home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>> #1 0x00007f1b0276ad7b in sword::(anonymous
>>> namespace)::my_fwrite(void*, unsigned long, unsigned long, void*) ()
>>> from
>>>
>/home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>> #2 0x00007f1b180626bf in ?? () from
>>> /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4
>>> #3 0x00007f1b18074a2b in ?? () from
>>> /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4
>>> #4 0x00007f1b1807e2e4 in ?? () from
>>> /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4
>>> #5 0x00007f1b1807f6f9 in curl_multi_perform () from
>>> /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4
>>> #6 0x00007f1b18075d13 in curl_easy_perform () from
>>> /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4
>>> #7 0x00007f1b0276b683 in sword::CURLFTPTransport::getURL(char
>>> const*, char const*, sword::SWBuf*) () from
>>>
>/home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>> #8 0x00007f1b0271d5d2 in
>>> sword::InstallMgr::remoteCopy(sword::InstallSource*, char const*,
>>> char const*, bool, char const*) ()
>>> from
>>>
>/home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>> #9 0x00007f1b0271edc7 in
>>> sword::InstallMgr::refreshRemoteSource(sword::InstallSource*) ()
>from
>>>
>/home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>> #10 0x00007f1b026ad734 in
>>>
>RepositoryInterface::refreshIndividualRemoteSource(std::__cxx11::basic_string<char,
>
>>> std::char_traits<char>, std::allocator<char> >, std::function<void
>>> (unsigned int)>*) ()
>>> from
>>>
>/home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>> #11 0x00007f1b026b17dd in
>>> std::thread::_State_impl<std::thread::_Invoker<std::tuple<int
>>> (RepositoryInterface::*)(std::__cxx11::basic_string<char,
>>> std::char_traits<char>, std::allocator<char> >, std::function<void
>>> (unsigned int)>*), RepositoryInterface*,
>>> std::__cxx11::basic_string<char, std::char_traits<char>,
>>> std::allocator<char> >, std::function<void (unsigned int)>*> >
>>> >::_M_run() ()
>>> from
>>>
>/home/tobi/dev/ezra_project/node-sword-interface-git/build/Release/node_sword_interface.node
>>> #12 0x00007f1b1d622cb4 in ?? () from
>>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>>> #13 0x00007f1b1e20a609 in start_thread () from
>>> /lib/x86_64-linux-gnu/libpthread.so.0
>>> #14 0x00007f1b1e131103 in clone () from
>/lib/x86_64-linux-gnu/libc.so.6
>>>
>>> On 10/13/20 1:07 PM, Tobias Klein wrote:
>>>>
>>>> Hi Troy,
>>>>
>>>> I tested more SVN revisions of SWORD trunk (starting from my stable
>
>>>> version until I hit the bug) and I can now say that
>>>>
>>>> SVN Rev. 3759 is the last SVN revision that works without hanging
>>>> for the below mentioned scenario. (20 out of 20 tests successful)
>>>>
>>>> SVN Rev. 3760 is the first SVN revision where the hanging occurs.
>>>> The commit message is "First cut at better isolation of FileIO to
>>>> FileMgr and providing a WIN32 impl with works with wchar_t".
>>>>
>>>> Modified files:
>>>> include/filemgr.h
>>>> include/swbuf.h
>>>> lib/bcppmake/libsword.bpr
>>>> src/mgr/curlftpt.cpp
>>>> src/mgr/curlhttpt.cpp
>>>> src/mgr/filemgr.cpp
>>>> src/mgr/installmgr.cpp
>>>> src/mgr/swmgr.cpp
>>>> src/utilfuns/utilstr.cpp
>>>>
>>>> Maybe this helps to find the root-cause.
>>>>
>>>> Best regards,
>>>> Tobias
>>>>
>>>> On 10/12/20 9:20 PM, Tobias Klein wrote:
>>>>>
>>>>> I'll see whether I can collect a stack trace. It may take some
>time
>>>>> until I have it.
>>>>>
>>>>> The multi-threaded "remote source refreshing" worked without
>issues
>>>>> until recently.
>>>>>
>>>>> Here is the code of the function that does the actual work in a
>thread.
>>>>> See
>>>>>
>https://github.com/tobias-klein/node-sword-interface/blob/787160ccb4b3bab2a762d22f74031c7237edc803/src/sword_backend/repository_interface.cpp#L105.
>>>>>
>>>>>
>intRepositoryInterface::refreshIndividualRemoteSource(stringremoteSourceName,
>
>>>>> std::function<void(unsignedintprogress)>*progressCallback)
>>>>> {
>>>>> //cout << "Refreshing source " << remoteSourceName << endl <<
>flush;
>>>>> InstallSource* source= this->getRemoteSource(remoteSourceName);
>>>>> intresult= this->_installMgr->refreshRemoteSource(source);
>>>>> if(result!= 0) {
>>>>> cerr<<"Failed to refresh source "<<remoteSourceName<<endl<<flush;
>>>>> }
>>>>> remoteSourceUpdateMutex.lock();
>>>>> this->_remoteSourceUpdateCount++;
>>>>> unsignedinttotalPercent=
>>>>>
>(unsignedint)calculateIntPercentage<double>(this->_remoteSourceUpdateCount,
>>>>> this->_remoteSourceCount);
>>>>> if(progressCallback!= 0) {
>>>>> (*progressCallback)(totalPercent);
>>>>> }
>>>>> remoteSourceUpdateMutex.unlock();
>>>>> returnresult;
>>>>> }
>>>>>
>>>>> Best regards,
>>>>> Tobias
>>>>>
>>>>> On 10/12/20 9:01 PM, Troy A. Griffitts wrote:
>>>>>> Any luck getting a stack trace on crash?
>>>>>>
>>>>>> Regarding the "multitheaded mode", I'd have to get a bit more
>>>>>> information as to exactly how you are sharing SWORD objects
>across
>>>>>> your threads. Generally, as a rule, you shouldn't. We recommend a
>
>>>>>> separate instance of SWMgr per thread and that probably goes for
>>>>>> InstallMgr, as well.
>>>>>>
>>>>>> Troy
>>>>>>
>>>>>> On October 12, 2020 8:29:31 PM GMT+02:00, Tobias Klein
>>>>>> <contact at tklein.info> wrote:
>>>>>>
>>>>>> Hi Troy,
>>>>>>
>>>>>> I'm using curl on all three platforms.
>>>>>>
>>>>>> Regarding the timeout configuration I have not changed
>>>>>> anything yet, to make this configurable in Ezra Project is
>>>>>> still on my todo list.
>>>>>>
>>>>>> I just checked on Linux.
>>>>>> With the old version (May 18th 2020) no hanging or crash in
>10
>>>>>> out of 10 times.
>>>>>> WIth the new version (latest trunk / SWORD 1.9 RC3) I get 1 x
>>>>>> crash, 2 x hanging, 7 x working.
>>>>>>
>>>>>> I'm running the InstallMgr::refreshRemoteSource "in a
>>>>>> multi-threaded mode".
>>>>>>
>>>>>> Best regards,
>>>>>> Tobias
>>>>>>
>>>>>> On 10/12/20 6:59 PM, Troy A. Griffitts wrote:
>>>>>>> Hi Tobias,
>>>>>>>
>>>>>>> What transport library are you building with? ftplib or
>curl?
>>>>>>>
>>>>>>> Have you changed the value of our new timeout from the
>>>>>>> default, I believe we decided on, 10 seconds?
>>>>>>>
>>>>>>> Troy
>>>>>>>
>>>>>>> On October 12, 2020 6:46:54 PM GMT+02:00, Tobias Klein
>>>>>>> <contact at tklein.info> wrote:
>>>>>>>
>>>>>>> Hi Troy,
>>>>>>>
>>>>>>> In my latest Ezra Project builds using SWORD trunk I’ve
>been noticing random „hangs“ and crashes related to "updating remote
>sources“. I suppose it must be around InstallMgr::refreshRemoteSource.
>>>>>>>
>>>>>>> This was still rock solid when using SWORD trunk from
>May 18th 2020, but not so any more with the recent SWORD trunk.
>>>>>>>
>>>>>>> Unfortunately I cannot pinpoint this more specifically.
>I just wanted to first share this observation, because it’s worrying
>me.
>>>>>>>
>>>>>>> I’ve been noticing this regression both on Windows and
>macOS. Need to check later whether this also happens on Linux, cannot
>recall it right now.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Tobias
>>>>>>>
>------------------------------------------------------------------------
>>>>>>> sword-devel mailing list:sword-devel at crosswire.org
>>>>>>> http://crosswire.org/mailman/listinfo/sword-devel
>>>>>>> Instructions to unsubscribe/change your settings at
>above page
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sent from my Android device with K-9 Mail. Please excuse my
>>>>>>> brevity.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from my Android device with K-9 Mail. Please excuse my
>brevity.
>>>>>
>>>>> _______________________________________________
>>>>> sword-devel mailing list:sword-devel at crosswire.org
>>>>> http://crosswire.org/mailman/listinfo/sword-devel
>>>>> Instructions to unsubscribe/change your settings at above page
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list:sword-devel at crosswire.org
>>>> http://crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at above page
>>>
>>> _______________________________________________
>>> sword-devel mailing list:sword-devel at crosswire.org
>>> http://crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20201017/01879ebe/attachment-0001.html>
More information about the sword-devel
mailing list