Opened 2 months ago

Last modified 8 days ago

#13571 new Bug Report - Crash

Crash of backend on delete of program being recorded

Reported by: Klaas de Waal Owned by:
Priority: minor Milestone: needs_triage
Component: MythTV - General Version: Master Head
Severity: medium Keywords:
Cc: gigem, mark, spieth Ticket locked: no

Description

Over the last two weeks I have now experienced three times a segfault of the mythbackend when deleting a program that is still being recorded. The program is deleted from the frontend on a separate machine. This happens on my "production" living room system and it does happen only incidentally but it is happening only since I am running master in the living room. This is a master of a few weeks old. All three segfaults are in the same piece of code.

Backend identification:

2020-01-26 10:09:44.408326 C [23981/23981] thread_unknown mythcommandlineparser.cpp:2602 (ConfigureLogging) - mythbackend version: master [v31-Pre-1811-g0d3bb87db7] www.mythtv.org
2020-01-26 10:09:44.408371 C [23981/23981] thread_unknown mythcommandlineparser.cpp:2606 (ConfigureLogging) - Qt version: compile: 5.12.5, runtime: 5.12.5
2020-01-26 10:09:44.408411 I [23981/23981] thread_unknown mythcommandlineparser.cpp:2608 (ConfigureLogging) - Fedora 31 (Workstation Edition) (x86_64)
2020-01-26 10:09:44.408415 N [23981/23981] thread_unknown mythcommandlineparser.cpp:2610 (ConfigureLogging) - Enabled verbose msgs:  general channel record
2020-01-26 10:09:44.408524 N [23981/23981] thread_unknown logging.cpp:724 (logStart) - Setting Log Level to LOG_INFO

Last few lines of the backend log:

2020-02-02 21:16:47.721784 I [23981/24016] TVRecEvent tv_rec.cpp:4433 (ClearFlags) - TVRec[19]: ClearFlags(PENDINGACTIONS,) -> RunMainLoop,RingBufferReady, @ tv_rec.cpp:3640
2020-02-02 21:16:47.721844 I [23981/87786] ProcessRequest tv_rec.cpp:4433 (ClearFlags) - TVRec[19]: ClearFlags(CancelNextRecording,KillRec,) -> RunMainLoop,RingBufferReady, @ tv_rec.cpp:721
2020-02-02 21:16:47.723057 I [23981/87786] ProcessRequest tv_rec.cpp:692 (SetRecordingStatus) - TVRec[19]: SetRecordingStatus(Opname->Onbekend) on line 723
2020-02-02 21:16:47.775262 C [23981/23981] CoreContext signalhandling.cpp:299 (handleSignal) - Received Segmentation fault: Code 1, PID 40, UID 0, Value 0x00000000
2020-02-02 21:16:47.841684 I [23981/24025] Scheduler scheduler.cpp:2425 (HandleReschedule) - Scheduled 91 items in 0.1 = 0.00 match + 0.00 check + 0.06 place

This is the backtrace of the latest crash:

(gdb) bt
#0  0x0000000000449be2 in EncoderLink::IsLocal() const (this=0x0) at encoderlink.h:51
#1  0x000000000048c11f in MainServer::DoHandleStopRecording(RecordingInfo&, PlaybackSock*) (this=0x1b26620, recinfo=..., pbs=0x0) at mainserver.cpp:2914
#2  0x000000000048cc55 in MainServer::DoHandleDeleteRecording(RecordingInfo&, PlaybackSock*, bool, bool, bool)
    (this=0x1b26620, recinfo=..., pbs=0x7f4f74042490, forceMetadataDelete=false, lexpirer=false, forgetHistory=false) at mainserver.cpp:3018
#3  0x000000000048c553 in MainServer::HandleDeleteRecording(QString&, QString&, PlaybackSock*, bool, bool)
    (this=0x1b26620, chanid=..., starttime=..., pbs=0x7f4f74042490, forceMetadataDelete=false, forgetHistory=false) at mainserver.cpp:2967
#4  0x0000000000473e43 in MainServer::ProcessRequestWork(MythSocket*) (this=0x1b26620, sock=0x7f50702e4ce0) at mainserver.cpp:641
#5  0x0000000000472a90 in MainServer::ProcessRequest(MythSocket*) (this=0x1b26620, sock=0x7f50702e4ce0) at mainserver.cpp:456
#6  0x00000000004c9d04 in ProcessRequestRunnable::run() (this=0x7f500001f840) at mainserver.cpp:160
#7  0x00007f5087d0c8d7 in MPoolThread::run() (this=0x7f4f74010a30) at mthreadpool.cpp:140
#8  0x00007f5087d09258 in MThreadInternal::run() (this=0x7f4f7044ae10) at mthread.cpp:79
#9  0x00007f5084e8e4e6 in QThreadPrivate::start(void*) (arg=0x7f4f7044ae10) at thread/qthread_unix.cpp:361
#10 0x00007f50863d34e2 in start_thread (arg=<optimized out>) at pthread_create.c:479
#11 0x00007f50849ad693 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The crash is now in mainserver.cpp line 2914 but the previous two crashes were in line 2910.

(gdb) l
2909        TVRec::s_inputsLock.lockForRead();
2910        for (auto iter = m_encoderList->begin(); iter != m_encoderList->end(); ++iter)
2911        {
2912            EncoderLink *elink = *iter;
2913
2914            if (elink->IsLocal() && elink->MatchesRecording(&recinfo))
2915            {
2916                recnum = iter.key();
2917
2918                elink->StopRecording();
(gdb) p iter
$1 = {i = 0x7f502c024d20}
(gdb) p elink
$2 = (EncoderLink *) 0x0
(gdb)

It looks to me that m_encoderList is corrupt but I have not been able to find the cause of this. hence this ticket.

Attachments (3)

20200202-2208_bt_all.log (81.5 KB) - added by Klaas de Waal 2 months ago.
gdb backtrace of all threads.
mythbackend.20200126090944.23981_summary.log (1.6 MB) - added by Klaas de Waal 2 months ago.
Log of mythbackend (summary)
20200325-stop_one_recording.patch (474 bytes) - added by Klaas de Waal 8 days ago.
Terminate loop iteration after deleting one recording

Download all attachments as: .zip

Change History (12)

Changed 2 months ago by Klaas de Waal

Attachment: 20200202-2208_bt_all.log added

gdb backtrace of all threads.

Changed 2 months ago by Klaas de Waal

Log of mythbackend (summary)

comment:1 Changed 2 months ago by Klaas de Waal

I have been able to reproduce the crash once on my development system. After starting a new recording there was a new tuner entry created in MainServer::HandleAddChildInput? and this tuner entry is then used for the new recording. Deleting this recording while it was still recording causes the crash. When a new tuner is added a new entry is added to m_encoderList; the crash happens when traversing the m_encoderList. This is most likely not a coincidence but I have not yet found exactly how this can happen

comment:2 Changed 2 months ago by gigem

Are you using a slave backend?

comment:3 Changed 2 months ago by gigem

Klaas, I looked through the code and didn't see any obvious errors. I'll try to see if I can reproduce the problem. Adding inputs should be a rare occurrence. Are you ding anything special that might make the problem more reproduceable?

comment:4 Changed 8 weeks ago by Klaas de Waal

David, thanks for looking into this.

The living room system has only a master backend. The development system is combined backend/frontend. Most recording extend to 15 minutes or more after the scheduled end time so we often delete a program that is still being recorded, and that is when the crash happens.

To get the crash on my development system I have created a new capture card starting with one tuner and "schedule as group" on so that additional tuner entries are created. However, I tried this a few times and it crashed only once, in the new created tuner entry.

It seems more likely to get a crash when the recording has been running for some time. When a recording is started and immediately deleted nothing happens.

To confuse matters a bit, the crashes on my living room system are not in the latest created tuner (but they are in the traversal of m_encoderList as mentioned).

I am currently thinking along the following lines.

  • It could be that the m_encoderList is reallocated when a new tuner is added. This has no consequence when it is always traversed from the start but if somewhere a pointer to an entry is stored then that pointer is invalid.
  • The m_encoderList is extended before the start of the recording (because the new tuner entry is needed) and the m_encoderList is correct until right before the recording is terminated by TVRec[x]. Could TVRec[x] have a pointer somewhere into the m_encoderList and then write something with a now-invalid pointer?
  • It looks to me that the AutoExpire? is started with a copy of the pointer to the same m_encoderList but in the AutoExpire? there is no locking whatsoever. This does not look to be the problem though as the AutoExpire? does only read and the crashes are not in the AutoExpire?.

comment:5 Changed 8 weeks ago by gigem

I tried to reproduce the problem last night but was not successful. I will keep tyring.

There is some unsafe access somewhere. I didn't see it in my visual audit of mainserver and scheduler but I'll keep looking. Autoexpire should have some locking too (I'll add it) but is unlikely to be the problem.

comment:6 Changed 8 weeks ago by Klaas de Waal

The last crash of the living room server yesterday evening had no HandleAddChildInput? log messages so no tuner was added. This rules out the reallocation/saved pointer as possible cause.

The living room system makes many recordings and most of them are auto-expired with a limit on the number of episodes. This does never fail. It is the "delete while still being recorded" that triggers the problem.

I have added loads of log messages in my development system but then I could not get it to crash again. This does suggests a race condition.

comment:7 in reply to:  5 Changed 8 weeks ago by gigem

Replying to gigem:

Autoexpire should have some locking too (I'll add it) but is unlikely to be the problem.

I rechecked AutoExpire? and it already has locking. It's just at a higher level and was missed.

Changed 8 days ago by Klaas de Waal

Terminate loop iteration after deleting one recording

comment:8 Changed 8 days ago by Klaas de Waal

Cc: gigem mark spieth added

After having a few more crashes, revisiting the code around the failing bits in mythserver.cpp:2916 raises the question why the loop continues after the recording is stopped. Because the recording is also deleted immediately after it is stopped, could it be that the recinfo it points to is deleted while still looping towards the end of the loop and accessing the already deleted recinfo?

The attached patch terminates the loop after a recording has been found and stopped.

This patch asssumes that there are not two recordings associated with one recinfo. This sounds obvious but I have not checked it.

comment:9 Changed 8 days ago by gigem

The recinfo can't be deleted because it's a local variable created in HandleDeleteRecording?() and passed down. There should only be one recording. As I recall, though, some of the code in that call chain uses fuzzy matching. It's all in dire need of being converted to use recordedid exclusively instead of chanid/starttime.

In short, your patch shouldn't hurt but I don't think it will explain/fix the underlying problem.

Note: See TracTickets for help on using tickets.