id,summary,reporter,owner,description,type,status,priority,milestone,component,version,severity,resolution,keywords,cc,mlocked 1882,Fix broken signal handling in housekeeper.c,mrsam@…,Isaac Richards,"A series of events led me to scrutinize the code in housekeeper.c that runs mythfilldatabase, and the mutual interaction between this code and the myth_system() code in libmyth/util.cpp. The series of events in question, was a fairly reproducible instance of my job queue getting wedged. After about a day of running myth 0.19's mythbackend, some mythcommflag or transcode job would inevitably stuck in the ""Running"" stage. This happened every time. A closer examination showed that the job apparently completed succesfully, but jobqueue never got updated, so that this job got wedged in the ""Running"" state, and any new jobs just kept piling on in the job queue, in the ""Queued"" state. An even closer examination revealed that the thread responsible for invoking myth_system(""/usr/bin/mythcommflag [args]""), etcetera, was in no-man's land. Attaching a debugger to that thread revealed the following interesting call trace. Now, I did not, unfortunately, save the actual gdb's backtrace, but the tail end of the stack frame was, in essence: 1. util.cpp:859 # the call to waitpid() in myth_system() 2. A few stack frames in glibc's waitpid() function 3. [signal handler] 4. housekeeper.cpp:26 # The call to wait(0) in reapChild() in housekeeper.cpp 5. The wait() syscall 6. <---- You are here And here, I was stuck. So, now, how did I get from myth_system() to housekeeper.cpp? Well, the answer can be found in the linuxthreads FAQ. http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html#J ============================================================================ J.3: I've installed a handler on a signal. Which thread executes the handler when the signal is received? [ ... ] If the signal is sent to a particular thread using pthread_kill(), then that thread executes the handler. [ ... ] If the signal is sent via kill() or the tty interface (e.g. by pressing ctrl-C), then the POSIX specs say that the handler is executed by any thread in the process that does not currently block the signal. In other terms, POSIX considers that the signal is sent to the process (the collection of all threads) as a whole, and any thread that is not blocking this signal can then handle it. ============================================================================ So, the here's what I believe is happening here. A. The housekeeper thread calls ""signal(SIGCHLD, &reapChild)"", then forks off and runs mythfilldatabase. The intent here is to set ""HouseKeeper_filldb_running = false;"". Housekeeper's main loop checks if mythfilldatabase is already running in the background, using this flag, and, if so, that factors into some decisions. housekeeper does not wait() for mythfilldatabase to end, it just sets the signal handler, forks it off, and lets the SIGCHLD handler reset the ""HouseKeeper_filldb_running"" flag when it's done. But, the signal handler remains set forever. So: B. At some point later, myth_system() forks off and runs mythcommflag, or transcode C. The child process does its thing, and exits. D. The signal handler, reapChild(), that meant to be used when mythfilldatabase exits, now gets invoked to handle the SIGCHLD that gets generated when mythcommflag/transcode exits. It looks like what's happening is that when a child process gets forked, the new process is considered to be a child of the entire process, not just the thread that forked it, so when the child process finishes, the resulting SIGCHLD is *NOT* sent via the pthread_kill() mechanism. It gets posted to the entire process, and, as per linuxthreads FAQ, any thread can pick it up, and run with it. Now, which thread actually gets it, is not really germane. Whichever thread gets chosen, it's going to invoke reapChild(), which the real problem. The fix is NOT to clear the signal handler when mythfilldatabase is done. Consider this sequence of events: 1. myth_system() forks off and runs mythcommflag 2. housekeeper.cpp decides it's time to run mythfilldatabase. 3. housekeeper.cpp sets ""HouseKeeper_filldb_running = true"", sets up a SIGCHLD handler, and forks off the mythfilldatabase 4. mythcommflag terminates, and generates a SIGCHLD. 5. Since there's a signal handler set for SIGCHLD, it gets invoked. 5. reapChild() sets ""HouseKeeper_filldb_running = false"", while mythfilldatabase continues to chug along, doing its business. 6. Boom. So, I think there's a problem here, that needs fixin'. Maybe it's already fixed in SVN -- I'm looking at the 0.19 release -- so this message can be ignored. I believe that 99.9% of the time this does not actually end up causing anything bad -- but this is definitely an issue that needs fixing nevertheless -- and there are other reasons why, in my case, I can reproduce this fairly consistently. Why it's happening to me it's another story, but, to make a long story short, the positive side of me being able to reproduce this badness also means that I can fix it. Which I did. I've patched this, and I've been testing the fix for almost a week now, and it completely eliminates this problem, the problem being: you should either use a signal handler for all child processes, or not. You can't reliably use a signal handler for some processes, but then rely on a wait/waitpid for other child processes. You just can't do it. My fix is, essentially: ditch the signal handler, create a new detached thread that forks() mythfilldatabase, waitpid()s for it, clears the HouseKeeper_filldb_running flag, and goes away. Nice and tidy. As I mentioned, for various reasons I can reliably reproduce the problems caused by improper signal handling. After applying the following patch, no problems have been observed. This patch is against mythtv-0.19 release. ",patch,closed,minor,unknown,mythtv,0.19,medium,fixed,,,0