Context Navigation

← Previous Ticket
Next Ticket →

Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#13571 closed Bug Report - Crash (fixed)

Crash of backend on delete of program being recorded

Reported by:	Klaas de Waal	Owned by:	Klaas de Waal
Priority:	minor	Milestone:	31.1
Component:	MythTV - General	Version:	Master Head
Severity:	medium	Keywords:
Cc:	gigem, mark, spieth	Ticket locked:	no

Description

Over the last two weeks I have now experienced three times a segfault of the mythbackend when deleting a program that is still being recorded. The program is deleted from the frontend on a separate machine. This happens on my "production" living room system and it does happen only incidentally but it is happening only since I am running master in the living room. This is a master of a few weeks old. All three segfaults are in the same piece of code.

Backend identification:

2020-01-26 10:09:44.408326 C [23981/23981] thread_unknown mythcommandlineparser.cpp:2602 (ConfigureLogging) - mythbackend version: master [v31-Pre-1811-g0d3bb87db7] www.mythtv.org
2020-01-26 10:09:44.408371 C [23981/23981] thread_unknown mythcommandlineparser.cpp:2606 (ConfigureLogging) - Qt version: compile: 5.12.5, runtime: 5.12.5
2020-01-26 10:09:44.408411 I [23981/23981] thread_unknown mythcommandlineparser.cpp:2608 (ConfigureLogging) - Fedora 31 (Workstation Edition) (x86_64)
2020-01-26 10:09:44.408415 N [23981/23981] thread_unknown mythcommandlineparser.cpp:2610 (ConfigureLogging) - Enabled verbose msgs:  general channel record
2020-01-26 10:09:44.408524 N [23981/23981] thread_unknown logging.cpp:724 (logStart) - Setting Log Level to LOG_INFO

Last few lines of the backend log:

2020-02-02 21:16:47.721784 I [23981/24016] TVRecEvent tv_rec.cpp:4433 (ClearFlags) - TVRec[19]: ClearFlags(PENDINGACTIONS,) -> RunMainLoop,RingBufferReady, @ tv_rec.cpp:3640
2020-02-02 21:16:47.721844 I [23981/87786] ProcessRequest tv_rec.cpp:4433 (ClearFlags) - TVRec[19]: ClearFlags(CancelNextRecording,KillRec,) -> RunMainLoop,RingBufferReady, @ tv_rec.cpp:721
2020-02-02 21:16:47.723057 I [23981/87786] ProcessRequest tv_rec.cpp:692 (SetRecordingStatus) - TVRec[19]: SetRecordingStatus(Opname->Onbekend) on line 723
2020-02-02 21:16:47.775262 C [23981/23981] CoreContext signalhandling.cpp:299 (handleSignal) - Received Segmentation fault: Code 1, PID 40, UID 0, Value 0x00000000
2020-02-02 21:16:47.841684 I [23981/24025] Scheduler scheduler.cpp:2425 (HandleReschedule) - Scheduled 91 items in 0.1 = 0.00 match + 0.00 check + 0.06 place

This is the backtrace of the latest crash:

(gdb) bt
#0  0x0000000000449be2 in EncoderLink::IsLocal() const (this=0x0) at encoderlink.h:51
#1  0x000000000048c11f in MainServer::DoHandleStopRecording(RecordingInfo&, PlaybackSock*) (this=0x1b26620, recinfo=..., pbs=0x0) at mainserver.cpp:2914
#2  0x000000000048cc55 in MainServer::DoHandleDeleteRecording(RecordingInfo&, PlaybackSock*, bool, bool, bool)
    (this=0x1b26620, recinfo=..., pbs=0x7f4f74042490, forceMetadataDelete=false, lexpirer=false, forgetHistory=false) at mainserver.cpp:3018
#3  0x000000000048c553 in MainServer::HandleDeleteRecording(QString&, QString&, PlaybackSock*, bool, bool)
    (this=0x1b26620, chanid=..., starttime=..., pbs=0x7f4f74042490, forceMetadataDelete=false, forgetHistory=false) at mainserver.cpp:2967
#4  0x0000000000473e43 in MainServer::ProcessRequestWork(MythSocket*) (this=0x1b26620, sock=0x7f50702e4ce0) at mainserver.cpp:641
#5  0x0000000000472a90 in MainServer::ProcessRequest(MythSocket*) (this=0x1b26620, sock=0x7f50702e4ce0) at mainserver.cpp:456
#6  0x00000000004c9d04 in ProcessRequestRunnable::run() (this=0x7f500001f840) at mainserver.cpp:160
#7  0x00007f5087d0c8d7 in MPoolThread::run() (this=0x7f4f74010a30) at mthreadpool.cpp:140
#8  0x00007f5087d09258 in MThreadInternal::run() (this=0x7f4f7044ae10) at mthread.cpp:79
#9  0x00007f5084e8e4e6 in QThreadPrivate::start(void*) (arg=0x7f4f7044ae10) at thread/qthread_unix.cpp:361
#10 0x00007f50863d34e2 in start_thread (arg=<optimized out>) at pthread_create.c:479
#11 0x00007f50849ad693 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The crash is now in mainserver.cpp line 2914 but the previous two crashes were in line 2910.

(gdb) l
2909        TVRec::s_inputsLock.lockForRead();
2910        for (auto iter = m_encoderList->begin(); iter != m_encoderList->end(); ++iter)
2911        {
2912            EncoderLink *elink = *iter;
2913
2914            if (elink->IsLocal() && elink->MatchesRecording(&recinfo))
2915            {
2916                recnum = iter.key();
2917
2918                elink->StopRecording();
(gdb) p iter
$1 = {i = 0x7f502c024d20}
(gdb) p elink
$2 = (EncoderLink *) 0x0
(gdb)

It looks to me that m_encoderList is corrupt but I have not been able to find the cause of this. hence this ticket.

Attachments (8)

20200202-2208_bt_all.log (81.5 KB) - added by Klaas de Waal 5 years ago.: gdb backtrace of all threads.
mythbackend.20200126090944.23981_summary.log (1.6 MB) - added by Klaas de Waal 5 years ago.: Log of mythbackend (summary)
20200325-stop_one_recording.patch (474 bytes) - added by Klaas de Waal 5 years ago.: Terminate loop iteration after deleting one recording
20200404-repeated-access.patch (1.3 KB) - added by Klaas de Waal 5 years ago.: Repeatedly access m_encoderLink causes crash
20200406-all-testcases.patch (9.8 KB) - added by Klaas de Waal 5 years ago.: Testcases with different loop constructs to test re-entrancy of QMap.
20200513-encoderlink-revert.patch (14.4 KB) - added by Klaas de Waal 5 years ago.: Test patch for reverting all foreach loops over m_encoderLink/m_tvLink in mainserver.cpp and scheduler.cpp to iterator loops as done in v30.
20200513-encoderlink-make-const.patch (11.1 KB) - added by David Hampton 5 years ago.: Alternative test patch marking list traversals as constant.
20200516-encoderlink-rc2.patch (10.8 KB) - added by Klaas de Waal 5 years ago.: Replacing Qt foreach loops with C++11 range loops and make all accesses const

Download all attachments as: .zip

Change History (49)

Changed 5 years ago by Klaas de Waal

Attachment:	20200202-2208_bt_all.log added

gdb backtrace of all threads.

Changed 5 years ago by Klaas de Waal

Attachment:	mythbackend.20200126090944.23981_summary.log added

Log of mythbackend (summary)

comment:1 Changed 5 years ago by Klaas de Waal

I have been able to reproduce the crash once on my development system. After starting a new recording there was a new tuner entry created in MainServer::HandleAddChildInput? and this tuner entry is then used for the new recording. Deleting this recording while it was still recording causes the crash. When a new tuner is added a new entry is added to m_encoderList; the crash happens when traversing the m_encoderList. This is most likely not a coincidence but I have not yet found exactly how this can happen

comment:2 Changed 5 years ago by gigem

Are you using a slave backend?

comment:3 Changed 5 years ago by gigem

Klaas, I looked through the code and didn't see any obvious errors. I'll try to see if I can reproduce the problem. Adding inputs should be a rare occurrence. Are you ding anything special that might make the problem more reproduceable?

comment:4 Changed 5 years ago by Klaas de Waal

David, thanks for looking into this.

The living room system has only a master backend. The development system is combined backend/frontend. Most recording extend to 15 minutes or more after the scheduled end time so we often delete a program that is still being recorded, and that is when the crash happens.

To get the crash on my development system I have created a new capture card starting with one tuner and "schedule as group" on so that additional tuner entries are created. However, I tried this a few times and it crashed only once, in the new created tuner entry.

It seems more likely to get a crash when the recording has been running for some time. When a recording is started and immediately deleted nothing happens.

To confuse matters a bit, the crashes on my living room system are not in the latest created tuner (but they are in the traversal of m_encoderList as mentioned).

I am currently thinking along the following lines.

It could be that the m_encoderList is reallocated when a new tuner is added. This has no consequence when it is always traversed from the start but if somewhere a pointer to an entry is stored then that pointer is invalid.
The m_encoderList is extended before the start of the recording (because the new tuner entry is needed) and the m_encoderList is correct until right before the recording is terminated by TVRec[x]. Could TVRec[x] have a pointer somewhere into the m_encoderList and then write something with a now-invalid pointer?
It looks to me that the AutoExpire? is started with a copy of the pointer to the same m_encoderList but in the AutoExpire? there is no locking whatsoever. This does not look to be the problem though as the AutoExpire? does only read and the crashes are not in the AutoExpire?.

comment:5 follow-up: 7 Changed 5 years ago by gigem

I tried to reproduce the problem last night but was not successful. I will keep tyring.

There is some unsafe access somewhere. I didn't see it in my visual audit of mainserver and scheduler but I'll keep looking. Autoexpire should have some locking too (I'll add it) but is unlikely to be the problem.

comment:6 Changed 5 years ago by Klaas de Waal

The last crash of the living room server yesterday evening had no HandleAddChildInput? log messages so no tuner was added. This rules out the reallocation/saved pointer as possible cause.

The living room system makes many recordings and most of them are auto-expired with a limit on the number of episodes. This does never fail. It is the "delete while still being recorded" that triggers the problem.

I have added loads of log messages in my development system but then I could not get it to crash again. This does suggests a race condition.

comment:7 in reply to: 5 Changed 5 years ago by gigem

Replying to gigem:

Autoexpire should have some locking too (I'll add it) but is unlikely to be the problem.

I rechecked AutoExpire? and it already has locking. It's just at a higher level and was missed.

Changed 5 years ago by Klaas de Waal

Attachment:	20200325-stop_one_recording.patch added

Terminate loop iteration after deleting one recording

comment:8 Changed 5 years ago by Klaas de Waal

Cc:	gigem mark spieth added

After having a few more crashes, revisiting the code around the failing bits in mythserver.cpp:2916 raises the question why the loop continues after the recording is stopped. Because the recording is also deleted immediately after it is stopped, could it be that the recinfo it points to is deleted while still looping towards the end of the loop and accessing the already deleted recinfo?

The attached patch terminates the loop after a recording has been found and stopped.

This patch asssumes that there are not two recordings associated with one recinfo. This sounds obvious but I have not checked it.

comment:9 Changed 5 years ago by gigem

The recinfo can't be deleted because it's a local variable created in HandleDeleteRecording?() and passed down. There should only be one recording. As I recall, though, some of the code in that call chain uses fuzzy matching. It's all in dire need of being converted to use recordedid exclusively instead of chanid/starttime.

In short, your patch shouldn't hurt but I don't think it will explain/fix the underlying problem.

comment:10 Changed 5 years ago by Klaas de Waal

Milestone:	needs_triage → 32.0
Owner:	set to Klaas de Waal
Status:	new → assigned

comment:11 Changed 5 years ago by Klaas de Waal

The crash can now be reproduced easily by accessing the m_encoderList repeatedly right after StopRecording? in line mainserver.cpp:2920 as shown in the following code fragment:

            elink->StopRecording();

            // Repeated access of m_encoderList after StopRecording causes crash
            for (int i=0; i<100000; i++)
            {
                int j = 0;
                for (auto iter2 = m_encoderList->begin(); iter2 != m_encoderList->end(); ++iter2)
                {
                    j++;
                    EncoderLink *elink2 = *iter2;
                    if (elink2->IsLocal())
                    {
                        (void) iter2.key();
                    }
                }
            }

The backtrace is usually similar to that in the initial post but incidentally the crash is in the QMap code, as shown here:

Thread 50 "PT3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffe9f7fe700 (LWP 2029089)]
0x00007ffff20caa43 in QMapNodeBase::nextNode (this=0x8c3df0) at tools/qmap.cpp:58
58	        while (n->left)
(gdb) bt
#0  0x00007ffff20caa43 in QMapNodeBase::nextNode() const (this=0x8c3df0) at tools/qmap.cpp:58
#1  0x0000000000446a66 in QMapNodeBase::nextNode() (this=0x8c3df0) at /usr/include/qt5/QtCore/qmap.h:92
#2  0x000000000044b102 in QMapNode<int, EncoderLink*>::nextNode() (this=0x8c3df0) at /usr/include/qt5/QtCore/qmap.h:120
#3  0x00000000004c6859 in QMap<int, EncoderLink*>::iterator::operator++() (this=0x7ffe9f7fd0b8) at /usr/include/qt5/QtCore/qmap.h:430
#4  0x00000000004875fa in MainServer::DoHandleStopRecording(RecordingInfo&, PlaybackSock*) (this=0x714690, recinfo=..., pbs=0x0) at mainserver.cpp:2926
#5  0x00000000004880c6 in MainServer::DoHandleDeleteRecording(RecordingInfo&, PlaybackSock*, bool, bool, bool)
    (this=0x714690, recinfo=..., pbs=0x7ffe98005440, forceMetadataDelete=false, lexpirer=false, forgetHistory=false) at mainserver.cpp:3038
#6  0x00000000004879f3 in MainServer::HandleDeleteRecording(QString&, QString&, PlaybackSock*, bool, bool)
    (this=0x714690, chanid=..., starttime=..., pbs=0x7ffe98005440, forceMetadataDelete=false, forgetHistory=false) at mainserver.cpp:2987
#7  0x000000000046fda7 in MainServer::ProcessRequestWork(MythSocket*) (this=0x714690, sock=0xa7a200) at mainserver.cpp:641
#8  0x000000000046ea30 in MainServer::ProcessRequest(MythSocket*) (this=0x714690, sock=0xa7a200) at mainserver.cpp:456
#9  0x00000000004c38d6 in ProcessRequestRunnable::run() (this=0x7ffea400b830) at mainserver.cpp:160
#10 0x00007ffff5412909 in MPoolThread::run() (this=0x7ffe940036a0) at mthreadpool.cpp:140
#11 0x00007ffff540f284 in MThreadInternal::run() (this=0x70de00) at mthread.cpp:79
#12 0x00007ffff2067d96 in QThreadPrivate::start(void*) (arg=0x70de00) at thread/qthread_unix.cpp:360
#13 0x00007ffff35d04e2 in start_thread () at /lib64/libpthread.so.0
#14 0x00007ffff1b836d3 in clone () at /lib64/libc.so.6
(gdb) l
53	const QMapNodeBase *QMapNodeBase::nextNode() const
54	{
55	    const QMapNodeBase *n = this;
56	    if (n->right) {
57	        n = n->right;
58	        while (n->left)
59	            n = n->left;
60	    } else {
61	        const QMapNodeBase *y = n->parent();
62	        while (y && n == y->right) {
(gdb)

It can take a few thousand loops across the m_encoderLink before the crash happens:

(gdb) fr 4
#4  0x00000000004875fa in MainServer::DoHandleStopRecording (this=0x714690, recinfo=..., pbs=0x0) at mainserver.cpp:2926
2926	                for (auto iter2 = m_encoderList->begin(); iter2 != m_encoderList->end(); ++iter2)
(gdb) l
2921	
2922	            // Repeated access of m_encoderList after stopRecording causes crash
2923	            for (int i=0; i<100000; i++)
2924	            {
2925	                int j = 0;
2926	                for (auto iter2 = m_encoderList->begin(); iter2 != m_encoderList->end(); ++iter2)
2927	                {
2928	                    j++;
2929	                    EncoderLink *elink2 = *iter2;
2930	                    if (elink2->IsLocal())
(gdb) p i
$13 = 6000
(gdb) p j
$14 = 14

The crash can also happen when stopping a recording. Deleting is not essential. The crash can also happen when creating and deleting recordings via the Services API with the script vcr.py from Bill Meek but then still a large number of deletes are needed. With the GUI it takes now only one or two deletes to produce the crash in the backend.

Having a large m_encoderList, so having a lot of tuners and a lot of multirec virtual tuners, and making more recordings at the same time definitely increases the chance of crashing to, on my system, close to certain.

The critical point is accessing the m_encoderList after the StopRecording? is done which is what the patch is doing repeatedly. Therefore the patch 20200325_stop_one_recording.patch does actually work because then the m_encoderList is not accessed anymore after the StopRecording?.

A patch with the "repeated access" code is attached.

Changed 5 years ago by Klaas de Waal

Attachment:	20200404-repeated-access.patch added

Repeatedly access m_encoderLink causes crash

comment:12 Changed 5 years ago by mspieth

Note that you cant do this without having the standard locking around the loop, otherwise its an invalid test.

comment:13 Changed 5 years ago by Klaas de Waal

The code is inside the following lock:

    TVRec::s_inputsLock.lockForRead();
    ...
    TVRec::s_inputsLock.unlock();

The m_encoderList (tvList) contains values for all entries in the capturecard table that are connected to a videosource. The m_encoderList is only extended when a new virtual tuner is added and in that code the lockForWrite is used. That is not happening here, the number of elements in the QMap is constant.

With the lockForRead it is possible to have multiple readers so the QMap can be traversed by different threads simultaneously. My understanding of the QMap is that this is allowed, QMap being documented as re-entrant.

comment:14 Changed 5 years ago by gigem

Klaas, have you tried repeatedly iterating through the list before callsing StopRecording?()? Also have you tried running under valgrind?

comment:15 Changed 5 years ago by Klaas de Waal

Running the iteration test code before the StopRecording? goes OK, it does not crash. I have used valgrind a few times but it has not come up with anything.

When I change the iteration test code to use a Java-style iterator instead of an STL-style iterator, as described in https://doc.qt.io/qt-5/qmap.html, then the iteration test code, as shown below, runs OK and does not cause a crash.

            for (int i=0; i<100000; i++)
            {
                QMapIterator<int, EncoderLink *> z(*m_encoderList);
                while (z.hasNext())
                {
                    z.next();
                    EncoderLink *elink2 = z.value();
                    if (elink2->IsLocal())
                    {
                        (void) z.key();
                    }
                }
            }

I have also ran tests with comparing all key-value pairs of the m_encoderList with the values just after the m_encoderList had been created. With a small delay after the StopRecording? the list never changes as is expected. Without the small delay the content as read back is sometimes wrong.

It is in principle possible that StopRecording? causes writes to the m_encoderLink of values that are identical to what they already are. In that case the end result is identical, but maybe the QMap is then sorted again. However if that would be the case then I expect that the Java-style iterator should also not work.

My tentative conclusion is that the QMap is not re-entrant when using STL-style iteration.

Whether this depends on compilers, compiler versions, Qt versions or even for-loop coding style I do not know. As mentioned in the post, I had this crash not with v30 but only when I changed to v31/master. For reference, software and compiler versions used are here below.

2020-04-05 22:38:25.299816 C  mythbackend version: master [v32-Pre-234-g9b5d6b957f-dirty] www.mythtv.org
2020-04-05 22:38:25.299828 C  Qt version: compile: 5.13.2, runtime: 5.13.2
2020-04-05 22:38:25.299851 I  Fedora 31 (Thirty One) (x86_64)

[klaas@modu mythtv]$ gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 9.2.1 20190827 (Red Hat 9.2.1-1) (GCC) 
[klaas@modu mythtv]$

To summarize, there are now three ways to fix this:

Terminate the loop after a match is found. The patch for this is already attached.
Add a small delay, e.g. 100ms, after the StopRecording?. Then traversing the m_encoderList continue to the end (although not useful at all)
Recode the loop to use the Java-style iterator.

P.S. Having gone through all the code that uses m_encoderLink and m_tvList I found a few places where the code can be improved, e.g. to avoid accidental map element creation when using a non-existing key. I did actually tentatively fix them all but this did not make any difference.

comment:16 Changed 5 years ago by gigem

I'd personally rather not "fix" the problem in any of those ways yet until we have a better idea of why it's failing. Have you tried explicitly declaring the iterator as QMap<int, EncoderLink? *>::Iterator instead of auto? That's probably grasping at straws but is a difference from v30 and we have had at least one issue with auto so far.

comment:17 Changed 5 years ago by gigem

I'm also curious if using foreach to iterate through the map shows the same problem.

comment:18 Changed 5 years ago by Klaas de Waal

Extensive testing of many different ways of coding the loop gives the following results:

loop with v31 code fails (as described before)
loop with v31 code BUT with constBegin and constEnd is OK
loop with v30 code fails
loop with v30 code BUT with const_interator, constBegin and constEnd is OK
loop with Java-style iterator is OK
loop with foreach over all keys is OK

This shows that the problem is not caused by the v31 "for auto" loop coding style versus the v30 loop coding style. The underlying problem appears to be that the QMap is only re-entrant when using const. This is not a limitation that I can find the Qt documentation and it is likely a bug.

If and how this should be reported back to Qt is another issue; for that it is probably needed to create a small standalone application that reproduces the problem.

The proper solution looks to be to use constBegin and constEnd, so to change the code from this:

    for (auto iter = m_encoderList->begin(); iter != m_encoderList->end(); ++iter)
    {
        EncoderLink *elink = *iter;

to using constBegin and constEnd like this:

    for (auto iter = m_encoderList->constBegin(); iter != m_encoderList->constEnd(); ++iter)
    {
        EncoderLink *elink = *iter;

It is probably a good idea to change other loops of m_encoderList (and m_tvList etc) in the same way.

A patch that contains all the testcases is attached for reference.

Changed 5 years ago by Klaas de Waal

Attachment:	20200406-all-testcases.patch added

Testcases with different loop constructs to test re-entrancy of QMap.

comment:19 Changed 5 years ago by gigem

Excellent findings, Klaas! I read somewhere along the lines of container traversal being thread safe when done in read-only mode. Read-only can be interpreted different ways besides just using const iterators.

I agree with your suggestion to switch nearly all of our uses to const iterators. We don't allow deleting map entries and after initialization, we only add entries in MainServer::HandleAddChildInput?() which uses exclusive, write locking, so we should be good.

comment:20 Changed 5 years ago by David Hampton

For most classes, that should be as simple as changing the calls to begin/end to call cbegin/cend instead.

comment:21 Changed 5 years ago by Klaas de Waal

Milestone:	32.0 → 31.1

In the context of this ticket I intend to check the iterators for tvList, m_encoderList and the other pointers to tvList and to commit a fix as soon as possible.

It is probably a good idea to go over all of the MythTV code and check all the QMap iterators and change them to const where appropriate.

It could be that also the use of other classes such as QList does need to be changed.

comment:22 Changed 5 years ago by Klaas de Waal <kdewaal@…>

Resolution:	→ fixed
Status:	assigned → closed

In 4192aab4d/mythtv:
Error: Processor CommitTicketReference failed
GIT backend not available

comment:23 Changed 5 years ago by Klaas de Waal <kdewaal@…>

In a465f1b03/mythtv:
Error: Processor CommitTicketReference failed
GIT backend not available

comment:24 Changed 5 years ago by Klaas de Waal

Milestone:	31.1 → 32.0
Resolution:	fixed
Status:	closed → new

Reopening this ticket as it looks now that the problem has not been solved. Again a segfault when accessing the EncoderLink? list when a recording is stopped, but now in the scheduler.

2020-04-17 17:09:46.803220 I  Scheduled 143 items in 0.6 = 0.47 match + 0.03 check + 0.10 place
2020-04-17 17:10:00.013295 I  TVRec[1]: Changing from RecordingOnly to None
2020-04-17 17:10:00.478830 N  Finished Recording: Container: MPEG2-TS Video Codec: h264 (1920x1088 A/R: 3 25fps) Audio Codec: ac3
2020-04-17 17:10:00.500988 I  Finished recording Chasing Classic Cars: channel 10017
2020-04-17 17:10:00.501715 I  Updating status for "Chasing Classic Cars" on cardid [1] (Opname => Opgenomen)
Handling Segmentation fault
Segmentation fault (core dumped)
[klaas@zolder mythtv-master]$

(gdb) bt
#0  0x00000000004c2756 in EncoderLink::IsAsleep() const (this=0x55) at encoderlink.h:57
#1  0x00000000004f478d in Scheduler::HandleWakeSlave(RecordingInfo&, int) (this=0x19b1af0, ri=..., prerollseconds=180) at scheduler.cpp:2567
#2  0x00000000004f0dcd in Scheduler::run() (this=0x19b1af0) at scheduler.cpp:2171
#3  0x00007f4dab82a284 in MThreadInternal::run() (this=0x1947000) at mthread.cpp:79
#4  0x00007f4da847cd96 in QThreadPrivate::start(void*) () at /lib64/libQt5Core.so.5
#5  0x00007f4da99e54e2 in start_thread () at /lib64/libpthread.so.0
#6  0x00007f4da7f996a3 in clone () at /lib64/libc.so.6

comment:25 Changed 5 years ago by Klaas de Waal

The latest crash as reported in comment:24 shows that the problem which was thought to be solved by using const everywhere in iterating the m_encoderList/m_tvList was not really solved.

Comparing scheduler.cpp and mainserver.cpp of v30 with v31 shows that for/iterator loops over m_encoderList/m_tvList have been replaced by foreach loops in v31.

For testing, I have reverted all foreach loops in scheduler.cpp and mainserver.cpp to the for/iterator loops from v30.

The result is that the test code in comment:11 does not crash anymore. This does not prove that this also solves the crash reported in comment:24 but given that the problem appeared when moving from v30 to v31 it is not impossible.

comment:26 Changed 5 years ago by gigem

Klaas, I was about to report that I was pretty sure I was getting crashes with v30 too. However, I just checked my backups and it looks like my crashes probably did start around the time I switched my production backend to v31.

comment:27 Changed 5 years ago by mspieth

I have not had a BE crash like this since updating master on 10 April which was post a465f1b03. Much better than before.

comment:28 Changed 5 years ago by David Hampton

I started looking at this yesterday, and m_encoderList is a pointer to tvList as defined in backendcontect.cpp. The Scheduler, AutoExpire, HttpStatus, and MainServer classes all hold pointers to this same structure (under different names). There are also some direct references to tvList from main_helper.cpp, but those all appear to be in startup code.

Running the test code from comment:11, I can cause crashes regularly. What's happening is that while running the list, one the calls to end() is causing the map to detach.(1) That invalidates the iterator, leaving it pointing at the old copy of the map and not at the new copy of the map. Once that happens the iterator will never equal end() and is guaranteed to fall off the end of the map. Using constBegin() and constEnd() will prevent this problem as they never detach the map.

What's unclear is where the code is "copying" the map and causing the reference count to increase. All of the pointers held by the various structures all point to the same tvList variable. Unless a copy of that variable is made, the reference counter should never increase. Just iterating over the map shouldn't bump the reference count and cause the test code to detach. What's needed is is a call to the constructor, or 'operator=', or something like that.

There are other threads referencing this list while the test code is running. I see Scheduler::AddNewRecords being called, and there may be other references that I'm not seeing.

(1) It makes no sense to me that end() should ever detach. That seems guaranteed to leave you with an invalid iterator. Only begin() should detach.

comment:29 Changed 5 years ago by Klaas de Waal

David, I am happy that you are looking into this. Meanwhile I did some more tests on this and I found two critical places in the code.

In scheduler.cpp in AddNewRecords?, around line 4500, the foreach 10 lines from the start. In mainserver.cpp in UpdateSystemdStatus?, after comment line "Count active recordings",the foreach there.

When both these two foreach loops are replaced by v30 iterator loops there is no crash. All other foreach loops do not matter.

Changed 5 years ago by Klaas de Waal

Attachment:	20200513-encoderlink-revert.patch added

Test patch for reverting all foreach loops over m_encoderLink/m_tvLink in mainserver.cpp and scheduler.cpp to iterator loops as done in v30.

comment:30 Changed 5 years ago by gigem

What if we changed m_encoderList to const? Would that make foreach work correctly? There are only two places where m_encoderList needs to be changed and making it const would help catch any other places that attempt to modify it.

Note that I'm still concerned about other containers that are iterated through in multiple threads. The scheduler's m_recList is one such list. Are all of those cases susceptible to the same problem?

Changed 5 years ago by David Hampton

Attachment:	20200513-encoderlink-make-const.patch added

Alternative test patch marking list traversals as constant.

comment:31 Changed 5 years ago by Klaas de Waal

This does change my mental model of the Qt containers.

https://doc.qt.io/qt-5/implicit-sharing.html

https://doc.qt.io/qt-5/containers.html#implicit-sharing-iterator-problem

Looks like Qt is not fit for mission-critical applications....

comment:32 Changed 5 years ago by Klaas de Waal

A quick short test in which

        foreach (auto elink, *m_encoderList)

is replaced with

       for (auto elink : *m_encoderList)

looks like it is also fixing the problem.

More testing needed, but could it be that just "foreach", which is not even C++ but a Qt macro, can be the culprit?

To be continued...

comment:33 Changed 5 years ago by gigem

foreach is Qt. Unless there is a good reason to use foreach, I say replace its use with the standard, C++ form of container iteration.

Changed 5 years ago by Klaas de Waal

Attachment:	20200516-encoderlink-rc2.patch added

Replacing Qt foreach loops with C++11 range loops and make all accesses const

comment:34 Changed 5 years ago by Klaas de Waal

Added patch 20200516-encoderlink-rc2.patch which is intended to be the final patch for this problem. This patch can be applied to today's master. This patch is largely identical to the patch 20200513-encoderlink-make-const.patch.

Although replacing the foreach loop with for loop does already fix the problem it is a good idea to make container access explicitly readonly by using qAsConst.

N.B. The Qt function qAsConst is used instead of the C++ std::as_const because the std::as_const is introduced in C++17 and the fix should also be applied to v31 which is C++11 based.

If final testing does not show any problems and if no further suggestions for improvements are received then the intention is to put this in master, and a few weeks later in v31-fixes.

This patch replaces a number of foreach statements introduced in commit 16bd3ad9849bd393c69571d0d0f967062fd32b78 on 29 Dec 2019 with the goal of modernizing control loops. However, contrary to what is described in the commit message, the foreach does actually cause the problems that the C++ iterators do not, instead of the other way around.

It might be a good idea to revisit all places in the code where foreach has recently been introduced to check if there are more possible problems and revert the modernizations done. However, that is clearly out of scope for this ticket.

N.B. There are 646 occurrences of the keyword foreach by last count.

N.B. More interesting reading, dated 2016: https://www.kdab.com/goodbye-q_foreach/

comment:35 Changed 5 years ago by David Hampton

Hi Klaus. The big difference that I see between the two patches is that your patch uses:

for (auto var : qAsConst(blah))

where mine uses:

for (auto * var : qAsConst(blah))

Using your patch the compiler will still correctly determine that the resulting auto variable should be a pointer, but explicitly stating that fact makes the code is easier to read.

Other than that the two patches basically look identical.

I am planning to revisit/replace all usage of the Q_FOREACH macro (and its alias foreach) in the core.

comment:36 Changed 5 years ago by Klaas de Waal <kdewaal@…>

Resolution:	→ fixed
Status:	new → closed

In 8e28726793/mythtv:
Error: Processor CommitTicketReference failed
GIT backend not available

comment:37 Changed 5 years ago by Stuart Auchterlonie

Klaas,

Does this need a cherry-pick back to fixes/31 ??

Regards Stuart

comment:38 Changed 5 years ago by Klaas de Waal

Hi Stuart,

Yes definitely, the idea was to give it some time to verify that it does actually work as intended and then put in also in fixes/31. My plan is to put it in fixes/31 on Sunday if I have not heard anything by then. Can do it earlier though if that is appreciated, I personally think it is OK now.

Groetjes, Klaas.

comment:39 Changed 5 years ago by gigem

I had another crash with v31 last night.

comment:40 Changed 5 years ago by Klaas de Waal

Milestone:	32.0 → 31.1

Now also in fixes/31.

comment:41 Changed 5 years ago by benjsc

May I commend all who worked on this ticket on some exceptional debugging. Not only have you solved the problem/fixed the bug, but the details in this ticket have provided great education for anyone reading it - well done!

Note: See TracTickets for help on using tickets.

Download in other formats: