Opened 9 years ago

Closed 9 years ago

#9016 closed defect (Fixed)

use of VERBOSE in myth_system child code can wedge the child

Reported by: Tomi Orava <tomi.orava@…> Owned by: beirdo
Priority: critical Milestone: 0.24
Component: MythTV - General Version: Master Head
Severity: medium Keywords:
Cc: Ticket locked: no

Description

The current trunk has a some sort of locking problem that is somehow related to mythsystem.

The following backtrace can always be seen (or something similar):

(gdb) info thr

  • 1 Thread 0x7f805b7de710 (LWP 6878) lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97

(gdb) bt #0 lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97 #1 0x00007f807a53844b in _L_lock_2076 () at tzset.c:138 #2 0x00007f807a538331 in tzset () at tzset.c:590 #3 0x00007f807b055156 in QDateTime::currentDateTime() () from /usr/lib/libQtCore.so.4 #4 0x00007f8081ac07e5 in myth_system_fork (command=..., result=@0x7f805b7dd66c) at mythsystem.cpp:373 #5 0x00007f8081abfe2a in myth_system (command=..., flags=3, timeout=0) at mythsystem.cpp:251 #6 0x00007f8083af78fb in PreviewGenerator::Run (this=0x7f8048762a80) at previewgenerator.cpp:244 #7 0x00007f8083af9702 in PreviewGenerator::run (this=0x7f8048762a80) at previewgenerator.cpp:324 #8 0x00007f807b047775 in ?? () from /usr/lib/libQtCore.so.4 #9 0x00007f807cd8a9ca in start_thread (arg=<value optimized out>) at pthread_create.c:300 #10 0x00007f807a5836fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #11 0x0000000000000000 in ?? ()

Unfortunately, I did not have a chance to retrieve a full backtrace of all the threads as of now. However, from past experiences (2-3 past weeks) it looks like if I break/hack the following change:

Index: libs/libmythdb/mythsystem.cpp
===================================================================
--- libs/libmythdb/mythsystem.cpp       (revision 26296)
+++ libs/libmythdb/mythsystem.cpp       (working copy)
@@ -353,6 +353,14 @@
 {
     QString LOC_ERR = QString("myth_system('%1'): Error: ").arg(command);
     VERBOSE(VB_IMPORTANT, QString("Launching: %1") .arg(command));
+
+
+    /* In case we forked WHILE it was locked */
+    bool unlocked = verbose_mutex.tryLock();
+    verbose_mutex.unlock();
+    if( !unlocked )
+        VERBOSE(VB_IMPORTANT, "Cleared parent's verbose lock");
+
     pid_t child = fork();
 
     if (child < 0)
@@ -366,11 +374,6 @@
     else if (child == 0)
     {
         /* Child */
-        /* In case we forked WHILE it was locked */
-        bool unlocked = verbose_mutex.tryLock();
-        verbose_mutex.unlock();
-        if( !unlocked )
-            VERBOSE(VB_IMPORTANT, "Cleared parent's verbose lock");
 
         /* Close all open file descriptors except stdout/stderr */
         for (int i = sysconf(_SC_OPEN_MAX) - 1; i > 2; i--)

The hanging problem disappears. Of course the above patch is not a proper fix, but somehow it is able to fix the problem for me.

The following log entries might also be helpful:

010-09-26 21:00:41.327 RingBuf(/net/alderan/mythtv/disk-1/recordings/1004_20100926210000.mpg) Warning: Taking too long to be allowed to read..
2010-09-26 21:00:42.337 RingBuf(/net/alderan/mythtv/disk-1/recordings/1004_20100926210000.mpg) Warning: Taking too long to be allowed to read..
2010-09-26 21:00:42.735 RingBuf(/net/alderan/mythtv/disk-1/recordings/1004_20100926210000.mpg) Warning: Taking too long to be allowed to read..
2010-09-26 21:00:42.744 RingBuf(/net/alderan/mythtv/disk-1/recordings/1004_20100926210000.mpg) Warning: Peek() requested 2048 bytes, but only returning 752
2010-09-26 21:00:42.745 Player(0), Error: OpenFile(): Could not read first 2048 bytes of '/net/alderan/mythtv/disk-1/recordings/1004_20100926210000.mpg'
2010-09-26 21:00:43.755 ~MythContext waiting for threads to exit.
2010-09-26 21:00:44.269 PID 7049: exited: status=0, result=0
2010-09-26 21:00:44.320 Launching: /usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000  > /dev/null
2010-09-26 21:00:44.463 PID 7063: launched
2010-09-26 21:00:44.732 MainServer::ANN Monitor
2010-09-26 21:00:44.777 adding: alderan.ncircle.nullnet.fi as a client (events: 0)
2010-09-26 21:00:44.778 MainServer::ANN Monitor
2010-09-26 21:00:44.779 adding: alderan.ncircle.nullnet.fi as a client (events: 1)
2010-09-26 21:00:44.819 PID 7063: exited: status=65280, result=255
2010-09-26 21:00:44.838 Preview Error: Encountered problems running '/usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000  > /dev/null'
2010-09-26 21:00:44.845 commflag: Commercial Detection Finished: "Pirates of the Caribbean: Maailman laidalla (K13)" recorded from channel 1004 at 2010-09-26T21:00:00 (0 commercial break(s))
2010-09-26 21:01:03.156 Reschedule requested for id -1.
2010-09-26 21:00:44.269 PID 7049: exited: status=0, result=0
2010-09-26 21:00:44.320 Launching: /usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000  > /dev/null
2010-09-26 21:00:44.463 PID 7063: launched
2010-09-26 21:00:44.732 MainServer::ANN Monitor
2010-09-26 21:00:44.777 adding: alderan.ncircle.nullnet.fi as a client (events: 0)
2010-09-26 21:00:44.778 MainServer::ANN Monitor
2010-09-26 21:00:44.779 adding: alderan.ncircle.nullnet.fi as a client (events: 1)
2010-09-26 21:00:44.819 PID 7063: exited: status=65280, result=255
2010-09-26 21:00:44.838 Preview Error: Encountered problems running '/usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000  > /dev/null'
2010-09-26 21:00:44.845 commflag: Commercial Detection Finished: "Pirates of the Caribbean: Maailman laidalla (K13)" recorded from channel 1004 at 2010-09-26T21:00:00 (0 commercial break(s))
2010-09-26 21:01:03.156 Reschedule requested for id -1.
2010-09-26 21:01:06.482 Scheduled 401 items in 3.3 = 0.15 match + 3.12 place
2010-09-26 21:01:06.504 scheduler: Scheduled items: Scheduled 401 items in 3.3 = 0.15 match + 3.12 place
2010-09-26 21:03:12.166 AutoExpire: CalcParams(): Max required Free Space: 22.0 GB w/freq: 14 min
2010-09-26 21:03:12.225 Expiring 1122 MBytes for 1001 at Fri Sep 24 20:30:00 2010 => Tv-uutiset ja s<E4><E4>.  Too many episodes, we only want to keep 2.
2010-09-26 21:03:12.230 autoexpire: Expired program: Expiring 1122 MBytes for 1001 at Fri Sep 24 20:30:00 2010 => Tv-uutiset ja sää.  Too many episodes, we only want to keep 2.
2010-09-26 21:03:12.253 Reschedule requested for id 0.
2010-09-26 21:03:12.774 Launching: /usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000  > /dev/null
2010-09-26 21:03:12.788 PID 7584: launched
2010-09-26 21:03:13.067 MainServer::ANN Monitor
2010-09-26 21:03:13.077 adding: alderan.ncircle.nullnet.fi as a client (events: 0)
2010-09-26 21:03:13.078 MainServer::ANN Monitor
2010-09-26 21:03:13.079 adding: alderan.ncircle.nullnet.fi as a client (events: 1)
2010-09-26 21:03:13.117 PID 7584: exited: status=65280, result=255
2010-09-26 21:03:13.120 Preview Error: Encountered problems running '/usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000  > /dev/null'
2010-09-26 21:03:15.562 Scheduled 401 items in 3.3 = 0.00 match + 3.25 place
2010-09-26 21:03:15.580 scheduler: Scheduled items: Scheduled 401 items in 3.3 = 0.00 match + 3.25 place
2010-09-26 21:03:23.297 Reschedule requested for id 0.
2010-09-26 21:03:26.399 Scheduled 401 items in 3.0 = 0.05 match + 2.97 place
2010-09-26 21:03:26.416 scheduler: Scheduled items: Scheduled 401 items in 3.0 = 0.05 match + 2.97 place
2010-09-26 21:04:11.543 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy
2010-09-26 21:08:35.007 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy
2010-09-26 21:12:58.524 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy
2010-09-26 21:17:12.527 AutoExpire: CalcParams(): Max required Free Space: 22.0 GB w/freq: 14 min
2010-09-26 21:17:13.264 UPnpMedia: BuildMediaMap VIDEO scan starting in :/net/alderan/movies/:
2010-09-26 21:17:13.573 UPnpMedia: BuildMediaMap Done. Found 331 objects
2010-09-26 21:17:21.890 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy
2010-09-26 21:21:45.410 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy
2010-09-26 21:26:08.814 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy
2010-09-26 21:30:32.385 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy
2010-09-26 21:31:12.904 AutoExpire: CalcParams(): Max required Free Space: 22.0 GB w/freq: 14 min
2010-09-26 21:34:55.756 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy
2010-09-26 21:36:22.281 Launching: /usr/bin/mythpreviewgen --size 0x0 --chanid 4350 --starttime 20100921233500  > /dev/null
2010-09-26 21:36:22.303 PID 13439: launched
2010-09-26 21:36:23.729 MainServer::ANN Monitor

MythTV Version : 26474M MythTV Branch : trunk Network Protocol : 62 Library API : 0.23.20100917-1 QT Version : 4.6.2 Options compiled in:

linux debug using_alsa using_jack using_oss using_pulse using_pulseoutput using_backend using_bindings_perl using_bindings_python using_directfb using_dvb using_frontend using_hdhomerun using_hdpvr using_iptv using_joystick_menu using_libfftw3 using_lirc using_mheg using_opengl_video using_opengl_vsync using_qtdbus using_qtwebkit using_v4l using_x11 using_xrandr using_xv using_xvmc using_xvmc_vld using_xvmcw using_bindings_perl using_bindings_python using_mythtranscode using_opengl using_vdpau using_ffmpeg_threads using_live using_mheg

Attachments (2)

0001-Redo-the-verbose_lock-in-myth_system.patch (1.5 KB) - added by beirdo 9 years ago.
0001-Remove-VERBOSE-from-myth_system-child.patch (3.2 KB) - added by beirdo 9 years ago.

Download all attachments as: .zip

Change History (13)

comment:1 Changed 9 years ago by beirdo

Resolution: Duplicate
Status: newclosed

Duplicate of what has already been reported in #8952. Once again, your proposed fix is not the right fix.

comment:2 Changed 9 years ago by David Madsen <david.madsen@…>

The past few days I have noticed a few lingering mythbackend processes that have been forked off from the main process, but are locked on the same mutex within libc. The problem doesn't appear to be with the verbose_mutex itsef, but has to do with the tzset_lock within the tzset function of glibc:

void
__tzset (void)
{
  __libc_lock_lock (tzset_lock);

  tzset_internal (1, 1);

  if (!__use_tzfile)
    {
      /* Set `tzname'.  */
      __tzname[0] = (char *) tz_rules[0].name;
      __tzname[1] = (char *) tz_rules[1].name;
    }

  __libc_lock_unlock (tzset_lock);
}

This tzset function is called by QTDateTime::currentDateTime which is used in the VERBOSE macro. What looks to be happening is a thread is writing out a verbose message when a context switch occurs while that thread has locked the tzset_lock. Another thread in mythbackend then proceeds to fork the mythbackend process to eventually exec it's command, but prior to that, we execute this in myth_system_fork:

        /* Child */
        /* In case we forked WHILE it was locked */
        bool unlocked = verbose_mutex.tryLock();
        verbose_mutex.unlock();
        if( !unlocked )
            VERBOSE(VB_IMPORTANT, "Cleared parent's verbose lock");

        /* Close all open file descriptors except stdout/stderr */
        for (int i = sysconf(_SC_OPEN_MAX) - 1; i > 2; i--)
            close(i);

This forcibly unlocks the verbose mutex which the parent process had locked (within the tzset task call). We then proceed to execute a new VERBOSE statement which will again call tzset in the child process. As the process was copied with the tzset_lock locked form the parent, the child will be deadlocked at this point.

Since the child process has not yet been able to close it's open file descriptors, the parent process can later end up with the reported errors since it appears the DVB devices are still open and in use by the child process.

This is the limited backtrace I was able to generate.. not everything was left with debug symbols enabled so it isn't great but at least shows the trail through currentDateTime. Since this process was also just recently forked and hasn't yet execed it's command there isn't much code the hung process could be executing.

#0  0xb8062424 in __kernel_vsyscall ()
No symbol table info available.
#1  0xb4bd3fa3 in __lll_lock_wait_private () from /lib/libc.so.6
No symbol table info available.
#2  0xb4b8337c in _L_lock_1528 () from /lib/libc.so.6
No symbol table info available.
#3  0xb4b832af in tzset () from /lib/libc.so.6
No symbol table info available.
#4  0xb4ee3954 in QDateTime::currentDateTime() () from /usr/lib/qt4/libQtCore.so.4
No symbol table info available.
#5  0x96bcfe90 in ?? ()
No symbol table info available.
#6  0x00000000 in ?? ()
No symbol table info available.

strace output:

Process 5257 attached - interrupt to quit
futex(0xb4c3da4c, FUTEX_WAIT_PRIVATE, 2, NULL

Relevant mythbackend log info:

2010-10-06 19:01:01.046 MPEGRec(/dev/v4l/video2): StartRecording finishing up
2010-10-06 19:01:01.056 MPEGRec(/dev/v4l/video2): StopEncoding
2010-10-06 19:01:02.732 MPEGRec(/dev/v4l/video2): Encoding stopped
2010-10-06 19:01:02.791 TVRec(14): ClearFlags(RecorderRunning,) -> RunMainLoop,RingBufferReady,
2010-10-06 19:01:02.792 RecBase(14:/dev/v4l/video2): SetRecording(0x0)
2010-10-06 19:01:02.860 Launching: /usr/bin/mythpreviewgen --size 0x0 --chanid 4723 --starttime 20101006175900  > /dev/null
2010-10-06 19:01:02.891 TVRec(14): FinishedRecording(MythBusters) in recgroup: Default
2010-10-06 19:01:02.896 PID 5257: launched
2010-10-06 19:01:02.899 Finished recording MythBusters "Hair of the Dog": channel 4723
2010-10-06 19:01:02.903 TVRec(14): ClearFlags(RecorderRunning,) -> RunMainLoop,RingBufferReady,
2010-10-06 19:01:02.926 TVRec(14): Tearing down RingBuffer
2010-10-06 19:01:02.932 TVRec(14): ClearFlags(PENDINGACTIONS,) -> RunMainLoop,RingBufferReady,

lsof showing the open file descriptors in the child process, showing that it has not yet executed the loop to close all of it's open handles:

mythbacke  5257    mythtv   18u      CHR      171,0          0t0       1966 /dev/raw1394
mythbacke  5257    mythtv   19u      CHR      212,3          0t0       2054 /dev/dvb/adapter0/frontend0
mythbacke  5257    mythtv   68r      CHR      212,5          0t0       2049 /dev/dvb/adapter0/dvr0
mythbacke  5257    mythtv   69u      CHR      212,4          0t0       2050 /dev/dvb/adapter0/demux0
mythbacke  5257    mythtv   70u      CHR      212,4          0t0       2050 /dev/dvb/adapter0/demux0
mythbacke  5257    mythtv   71u      CHR      212,4          0t0       2050 /dev/dvb/adapter0/demux0
mythbacke  5257    mythtv   72u      CHR      212,4          0t0       2050 /dev/dvb/adapter0/demux0
mythbacke  5257    mythtv   73u      CHR      212,4          0t0       2050 /dev/dvb/adapter0/demux0
mythbacke  5257    mythtv   78u      CHR      212,4          0t0       2050 /dev/dvb/adapter0/demux0
mythbacke  5257    mythtv   79u      CHR      212,4          0t0       2050 /dev/dvb/adapter0/demux0
mythbacke  5257    mythtv   80u      CHR      212,4          0t0       2050 /dev/dvb/adapter0/demux0

Please let me know if I can provide any more info.

It appears it may not be safe to use the VERBOSE macro in the child process unless there is a way to forcibly unlock the tzset_lock within glibc in the child as well.

comment:3 Changed 9 years ago by David Madsen <david.madsen@…>

Resolution: Duplicate
Status: closednew

comment:4 Changed 9 years ago by David Madsen <david.madsen@…>

After some additional thought, it seems like acquiring the verbose_mutex prior to calling fork, and then releasing that mutex in both the parent and child after the fork may also fix the problem.

comment:5 Changed 9 years ago by beirdo

Owner: set to beirdo
Status: newaccepted

We will look into doing lock-free logging in the child later, but I will definitely look at the tz_lock for now. Thanks.

comment:6 Changed 9 years ago by beirdo

Please try the attached patch.

Changed 9 years ago by beirdo

comment:7 Changed 9 years ago by David Madsen <david.madsen@…>

I've got the patch in place. The deadlocked child processes are somewhat rare, but I have seen maybe one or two per week. I'll let this run with the patch in place and report back after it's had some time to run.

comment:8 Changed 9 years ago by beirdo

On second thought, I do not expect that to work any better. I am simply going to remove the VERBOSE use in the child, replacing it for now with cerr << message.

New patch attached.

Changed 9 years ago by beirdo

comment:9 Changed 9 years ago by beirdo

Milestone: unknown0.24
Priority: minorcritical
Summary: mythbackend hangs due to all DVB-devices becoming busy & unavailableuse of VERBOSE in myth_system child code can wedge the child

comment:10 Changed 9 years ago by anonymous

Yeah you're right. Since the Qt date/time functions can be independently called outside of the verbose_mutex it doesn't eliminate the potential problem of the tzset_lock mutex being copied in a locked state from the parent.

I'll update with the latest patch and let it run for the next week or so. I expect it should eliminate the problem I saw entirely.

comment:11 Changed 9 years ago by beirdo

Resolution: Fixed
Status: acceptedclosed

(In [26701]) Remove all use of VERBOSE in the myth_system child process between fork() and execl(). At times, when we fork, the verbose_mutex is locked, or the tzset_lock in the glibc tzset() is locked. This causes the child to deadlock if VERBOSE is used. Instead, for the remaining 3 error prints, I have put in just cerr << message constructs so at least something is shown as the child breaks.

Fixes #9016

Note: See TracTickets for help on using tickets.