Opened 14 years ago
Closed 14 years ago
#9016 closed defect (Fixed)
use of VERBOSE in myth_system child code can wedge the child
Reported by: | Owned by: | beirdo | |
---|---|---|---|
Priority: | critical | Milestone: | 0.24 |
Component: | MythTV - General | Version: | Master Head |
Severity: | medium | Keywords: | |
Cc: | Ticket locked: | no |
Description
The current trunk has a some sort of locking problem that is somehow related to mythsystem.
The following backtrace can always be seen (or something similar):
(gdb) info thr
- 1 Thread 0x7f805b7de710 (LWP 6878) lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
(gdb) bt #0 lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97 #1 0x00007f807a53844b in _L_lock_2076 () at tzset.c:138 #2 0x00007f807a538331 in tzset () at tzset.c:590 #3 0x00007f807b055156 in QDateTime::currentDateTime() () from /usr/lib/libQtCore.so.4 #4 0x00007f8081ac07e5 in myth_system_fork (command=..., result=@0x7f805b7dd66c) at mythsystem.cpp:373 #5 0x00007f8081abfe2a in myth_system (command=..., flags=3, timeout=0) at mythsystem.cpp:251 #6 0x00007f8083af78fb in PreviewGenerator::Run (this=0x7f8048762a80) at previewgenerator.cpp:244 #7 0x00007f8083af9702 in PreviewGenerator::run (this=0x7f8048762a80) at previewgenerator.cpp:324 #8 0x00007f807b047775 in ?? () from /usr/lib/libQtCore.so.4 #9 0x00007f807cd8a9ca in start_thread (arg=<value optimized out>) at pthread_create.c:300 #10 0x00007f807a5836fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #11 0x0000000000000000 in ?? ()
Unfortunately, I did not have a chance to retrieve a full backtrace of all the threads as of now. However, from past experiences (2-3 past weeks) it looks like if I break/hack the following change:
Index: libs/libmythdb/mythsystem.cpp =================================================================== --- libs/libmythdb/mythsystem.cpp (revision 26296) +++ libs/libmythdb/mythsystem.cpp (working copy) @@ -353,6 +353,14 @@ { QString LOC_ERR = QString("myth_system('%1'): Error: ").arg(command); VERBOSE(VB_IMPORTANT, QString("Launching: %1") .arg(command)); + + + /* In case we forked WHILE it was locked */ + bool unlocked = verbose_mutex.tryLock(); + verbose_mutex.unlock(); + if( !unlocked ) + VERBOSE(VB_IMPORTANT, "Cleared parent's verbose lock"); + pid_t child = fork(); if (child < 0) @@ -366,11 +374,6 @@ else if (child == 0) { /* Child */ - /* In case we forked WHILE it was locked */ - bool unlocked = verbose_mutex.tryLock(); - verbose_mutex.unlock(); - if( !unlocked ) - VERBOSE(VB_IMPORTANT, "Cleared parent's verbose lock"); /* Close all open file descriptors except stdout/stderr */ for (int i = sysconf(_SC_OPEN_MAX) - 1; i > 2; i--)
The hanging problem disappears. Of course the above patch is not a proper fix, but somehow it is able to fix the problem for me.
The following log entries might also be helpful:
010-09-26 21:00:41.327 RingBuf(/net/alderan/mythtv/disk-1/recordings/1004_20100926210000.mpg) Warning: Taking too long to be allowed to read.. 2010-09-26 21:00:42.337 RingBuf(/net/alderan/mythtv/disk-1/recordings/1004_20100926210000.mpg) Warning: Taking too long to be allowed to read.. 2010-09-26 21:00:42.735 RingBuf(/net/alderan/mythtv/disk-1/recordings/1004_20100926210000.mpg) Warning: Taking too long to be allowed to read.. 2010-09-26 21:00:42.744 RingBuf(/net/alderan/mythtv/disk-1/recordings/1004_20100926210000.mpg) Warning: Peek() requested 2048 bytes, but only returning 752 2010-09-26 21:00:42.745 Player(0), Error: OpenFile(): Could not read first 2048 bytes of '/net/alderan/mythtv/disk-1/recordings/1004_20100926210000.mpg' 2010-09-26 21:00:43.755 ~MythContext waiting for threads to exit. 2010-09-26 21:00:44.269 PID 7049: exited: status=0, result=0 2010-09-26 21:00:44.320 Launching: /usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000 > /dev/null 2010-09-26 21:00:44.463 PID 7063: launched 2010-09-26 21:00:44.732 MainServer::ANN Monitor 2010-09-26 21:00:44.777 adding: alderan.ncircle.nullnet.fi as a client (events: 0) 2010-09-26 21:00:44.778 MainServer::ANN Monitor 2010-09-26 21:00:44.779 adding: alderan.ncircle.nullnet.fi as a client (events: 1) 2010-09-26 21:00:44.819 PID 7063: exited: status=65280, result=255 2010-09-26 21:00:44.838 Preview Error: Encountered problems running '/usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000 > /dev/null' 2010-09-26 21:00:44.845 commflag: Commercial Detection Finished: "Pirates of the Caribbean: Maailman laidalla (K13)" recorded from channel 1004 at 2010-09-26T21:00:00 (0 commercial break(s)) 2010-09-26 21:01:03.156 Reschedule requested for id -1. 2010-09-26 21:00:44.269 PID 7049: exited: status=0, result=0 2010-09-26 21:00:44.320 Launching: /usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000 > /dev/null 2010-09-26 21:00:44.463 PID 7063: launched 2010-09-26 21:00:44.732 MainServer::ANN Monitor 2010-09-26 21:00:44.777 adding: alderan.ncircle.nullnet.fi as a client (events: 0) 2010-09-26 21:00:44.778 MainServer::ANN Monitor 2010-09-26 21:00:44.779 adding: alderan.ncircle.nullnet.fi as a client (events: 1) 2010-09-26 21:00:44.819 PID 7063: exited: status=65280, result=255 2010-09-26 21:00:44.838 Preview Error: Encountered problems running '/usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000 > /dev/null' 2010-09-26 21:00:44.845 commflag: Commercial Detection Finished: "Pirates of the Caribbean: Maailman laidalla (K13)" recorded from channel 1004 at 2010-09-26T21:00:00 (0 commercial break(s)) 2010-09-26 21:01:03.156 Reschedule requested for id -1. 2010-09-26 21:01:06.482 Scheduled 401 items in 3.3 = 0.15 match + 3.12 place 2010-09-26 21:01:06.504 scheduler: Scheduled items: Scheduled 401 items in 3.3 = 0.15 match + 3.12 place 2010-09-26 21:03:12.166 AutoExpire: CalcParams(): Max required Free Space: 22.0 GB w/freq: 14 min 2010-09-26 21:03:12.225 Expiring 1122 MBytes for 1001 at Fri Sep 24 20:30:00 2010 => Tv-uutiset ja s<E4><E4>. Too many episodes, we only want to keep 2. 2010-09-26 21:03:12.230 autoexpire: Expired program: Expiring 1122 MBytes for 1001 at Fri Sep 24 20:30:00 2010 => Tv-uutiset ja sää. Too many episodes, we only want to keep 2. 2010-09-26 21:03:12.253 Reschedule requested for id 0. 2010-09-26 21:03:12.774 Launching: /usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000 > /dev/null 2010-09-26 21:03:12.788 PID 7584: launched 2010-09-26 21:03:13.067 MainServer::ANN Monitor 2010-09-26 21:03:13.077 adding: alderan.ncircle.nullnet.fi as a client (events: 0) 2010-09-26 21:03:13.078 MainServer::ANN Monitor 2010-09-26 21:03:13.079 adding: alderan.ncircle.nullnet.fi as a client (events: 1) 2010-09-26 21:03:13.117 PID 7584: exited: status=65280, result=255 2010-09-26 21:03:13.120 Preview Error: Encountered problems running '/usr/bin/mythpreviewgen --size 0x0 --chanid 1004 --starttime 20100926210000 > /dev/null' 2010-09-26 21:03:15.562 Scheduled 401 items in 3.3 = 0.00 match + 3.25 place 2010-09-26 21:03:15.580 scheduler: Scheduled items: Scheduled 401 items in 3.3 = 0.00 match + 3.25 place 2010-09-26 21:03:23.297 Reschedule requested for id 0. 2010-09-26 21:03:26.399 Scheduled 401 items in 3.0 = 0.05 match + 2.97 place 2010-09-26 21:03:26.416 scheduler: Scheduled items: Scheduled 401 items in 3.0 = 0.05 match + 2.97 place 2010-09-26 21:04:11.543 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy 2010-09-26 21:08:35.007 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy 2010-09-26 21:12:58.524 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy 2010-09-26 21:17:12.527 AutoExpire: CalcParams(): Max required Free Space: 22.0 GB w/freq: 14 min 2010-09-26 21:17:13.264 UPnpMedia: BuildMediaMap VIDEO scan starting in :/net/alderan/movies/: 2010-09-26 21:17:13.573 UPnpMedia: BuildMediaMap Done. Found 331 objects 2010-09-26 21:17:21.890 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy 2010-09-26 21:21:45.410 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy 2010-09-26 21:26:08.814 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy 2010-09-26 21:30:32.385 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy 2010-09-26 21:31:12.904 AutoExpire: CalcParams(): Max required Free Space: 22.0 GB w/freq: 14 min 2010-09-26 21:34:55.756 DVBSH(/dev/dvb/adapter2/frontend0): Failed to open DVR device /dev/dvb/adapter2/dvr0 : Device or resource busy 2010-09-26 21:36:22.281 Launching: /usr/bin/mythpreviewgen --size 0x0 --chanid 4350 --starttime 20100921233500 > /dev/null 2010-09-26 21:36:22.303 PID 13439: launched 2010-09-26 21:36:23.729 MainServer::ANN Monitor
MythTV Version : 26474M MythTV Branch : trunk Network Protocol : 62 Library API : 0.23.20100917-1 QT Version : 4.6.2 Options compiled in:
linux debug using_alsa using_jack using_oss using_pulse using_pulseoutput using_backend using_bindings_perl using_bindings_python using_directfb using_dvb using_frontend using_hdhomerun using_hdpvr using_iptv using_joystick_menu using_libfftw3 using_lirc using_mheg using_opengl_video using_opengl_vsync using_qtdbus using_qtwebkit using_v4l using_x11 using_xrandr using_xv using_xvmc using_xvmc_vld using_xvmcw using_bindings_perl using_bindings_python using_mythtranscode using_opengl using_vdpau using_ffmpeg_threads using_live using_mheg
Attachments (2)
Change History (13)
comment:1 Changed 14 years ago by
Resolution: | → Duplicate |
---|---|
Status: | new → closed |
comment:2 Changed 14 years ago by
The past few days I have noticed a few lingering mythbackend processes that have been forked off from the main process, but are locked on the same mutex within libc. The problem doesn't appear to be with the verbose_mutex itsef, but has to do with the tzset_lock within the tzset function of glibc:
void __tzset (void) { __libc_lock_lock (tzset_lock); tzset_internal (1, 1); if (!__use_tzfile) { /* Set `tzname'. */ __tzname[0] = (char *) tz_rules[0].name; __tzname[1] = (char *) tz_rules[1].name; } __libc_lock_unlock (tzset_lock); }
This tzset function is called by QTDateTime::currentDateTime which is used in the VERBOSE macro. What looks to be happening is a thread is writing out a verbose message when a context switch occurs while that thread has locked the tzset_lock. Another thread in mythbackend then proceeds to fork the mythbackend process to eventually exec it's command, but prior to that, we execute this in myth_system_fork:
/* Child */ /* In case we forked WHILE it was locked */ bool unlocked = verbose_mutex.tryLock(); verbose_mutex.unlock(); if( !unlocked ) VERBOSE(VB_IMPORTANT, "Cleared parent's verbose lock"); /* Close all open file descriptors except stdout/stderr */ for (int i = sysconf(_SC_OPEN_MAX) - 1; i > 2; i--) close(i);
This forcibly unlocks the verbose mutex which the parent process had locked (within the tzset task call). We then proceed to execute a new VERBOSE statement which will again call tzset in the child process. As the process was copied with the tzset_lock locked form the parent, the child will be deadlocked at this point.
Since the child process has not yet been able to close it's open file descriptors, the parent process can later end up with the reported errors since it appears the DVB devices are still open and in use by the child process.
This is the limited backtrace I was able to generate.. not everything was left with debug symbols enabled so it isn't great but at least shows the trail through currentDateTime. Since this process was also just recently forked and hasn't yet execed it's command there isn't much code the hung process could be executing.
#0 0xb8062424 in __kernel_vsyscall () No symbol table info available. #1 0xb4bd3fa3 in __lll_lock_wait_private () from /lib/libc.so.6 No symbol table info available. #2 0xb4b8337c in _L_lock_1528 () from /lib/libc.so.6 No symbol table info available. #3 0xb4b832af in tzset () from /lib/libc.so.6 No symbol table info available. #4 0xb4ee3954 in QDateTime::currentDateTime() () from /usr/lib/qt4/libQtCore.so.4 No symbol table info available. #5 0x96bcfe90 in ?? () No symbol table info available. #6 0x00000000 in ?? () No symbol table info available.
strace output:
Process 5257 attached - interrupt to quit futex(0xb4c3da4c, FUTEX_WAIT_PRIVATE, 2, NULL
Relevant mythbackend log info:
2010-10-06 19:01:01.046 MPEGRec(/dev/v4l/video2): StartRecording finishing up 2010-10-06 19:01:01.056 MPEGRec(/dev/v4l/video2): StopEncoding 2010-10-06 19:01:02.732 MPEGRec(/dev/v4l/video2): Encoding stopped 2010-10-06 19:01:02.791 TVRec(14): ClearFlags(RecorderRunning,) -> RunMainLoop,RingBufferReady, 2010-10-06 19:01:02.792 RecBase(14:/dev/v4l/video2): SetRecording(0x0) 2010-10-06 19:01:02.860 Launching: /usr/bin/mythpreviewgen --size 0x0 --chanid 4723 --starttime 20101006175900 > /dev/null 2010-10-06 19:01:02.891 TVRec(14): FinishedRecording(MythBusters) in recgroup: Default 2010-10-06 19:01:02.896 PID 5257: launched 2010-10-06 19:01:02.899 Finished recording MythBusters "Hair of the Dog": channel 4723 2010-10-06 19:01:02.903 TVRec(14): ClearFlags(RecorderRunning,) -> RunMainLoop,RingBufferReady, 2010-10-06 19:01:02.926 TVRec(14): Tearing down RingBuffer 2010-10-06 19:01:02.932 TVRec(14): ClearFlags(PENDINGACTIONS,) -> RunMainLoop,RingBufferReady,
lsof showing the open file descriptors in the child process, showing that it has not yet executed the loop to close all of it's open handles:
mythbacke 5257 mythtv 18u CHR 171,0 0t0 1966 /dev/raw1394 mythbacke 5257 mythtv 19u CHR 212,3 0t0 2054 /dev/dvb/adapter0/frontend0 mythbacke 5257 mythtv 68r CHR 212,5 0t0 2049 /dev/dvb/adapter0/dvr0 mythbacke 5257 mythtv 69u CHR 212,4 0t0 2050 /dev/dvb/adapter0/demux0 mythbacke 5257 mythtv 70u CHR 212,4 0t0 2050 /dev/dvb/adapter0/demux0 mythbacke 5257 mythtv 71u CHR 212,4 0t0 2050 /dev/dvb/adapter0/demux0 mythbacke 5257 mythtv 72u CHR 212,4 0t0 2050 /dev/dvb/adapter0/demux0 mythbacke 5257 mythtv 73u CHR 212,4 0t0 2050 /dev/dvb/adapter0/demux0 mythbacke 5257 mythtv 78u CHR 212,4 0t0 2050 /dev/dvb/adapter0/demux0 mythbacke 5257 mythtv 79u CHR 212,4 0t0 2050 /dev/dvb/adapter0/demux0 mythbacke 5257 mythtv 80u CHR 212,4 0t0 2050 /dev/dvb/adapter0/demux0
Please let me know if I can provide any more info.
It appears it may not be safe to use the VERBOSE macro in the child process unless there is a way to forcibly unlock the tzset_lock within glibc in the child as well.
comment:3 Changed 14 years ago by
Resolution: | Duplicate |
---|---|
Status: | closed → new |
comment:4 Changed 14 years ago by
After some additional thought, it seems like acquiring the verbose_mutex prior to calling fork, and then releasing that mutex in both the parent and child after the fork may also fix the problem.
comment:5 Changed 14 years ago by
Owner: | set to beirdo |
---|---|
Status: | new → accepted |
We will look into doing lock-free logging in the child later, but I will definitely look at the tz_lock for now. Thanks.
Changed 14 years ago by
Attachment: | 0001-Redo-the-verbose_lock-in-myth_system.patch added |
---|
comment:7 Changed 14 years ago by
I've got the patch in place. The deadlocked child processes are somewhat rare, but I have seen maybe one or two per week. I'll let this run with the patch in place and report back after it's had some time to run.
comment:8 Changed 14 years ago by
On second thought, I do not expect that to work any better. I am simply going to remove the VERBOSE use in the child, replacing it for now with cerr << message.
New patch attached.
Changed 14 years ago by
Attachment: | 0001-Remove-VERBOSE-from-myth_system-child.patch added |
---|
comment:9 Changed 14 years ago by
Milestone: | unknown → 0.24 |
---|---|
Priority: | minor → critical |
Summary: | mythbackend hangs due to all DVB-devices becoming busy & unavailable → use of VERBOSE in myth_system child code can wedge the child |
comment:10 Changed 14 years ago by
Yeah you're right. Since the Qt date/time functions can be independently called outside of the verbose_mutex it doesn't eliminate the potential problem of the tzset_lock mutex being copied in a locked state from the parent.
I'll update with the latest patch and let it run for the next week or so. I expect it should eliminate the problem I saw entirely.
comment:11 Changed 14 years ago by
Resolution: | → Fixed |
---|---|
Status: | accepted → closed |
(In [26701]) Remove all use of VERBOSE in the myth_system child process between fork() and execl(). At times, when we fork, the verbose_mutex is locked, or the tzset_lock in the glibc tzset() is locked. This causes the child to deadlock if VERBOSE is used. Instead, for the remaining 3 error prints, I have put in just cerr << message constructs so at least something is shown as the child breaks.
Fixes #9016
Duplicate of what has already been reported in #8952. Once again, your proposed fix is not the right fix.