Opened 13 years ago
Closed 12 years ago
Last modified 12 years ago
#10188 closed Bug Report - Hang/Deadlock (Invalid)
CENTOS6: Severe socket connection
Reported by: | Owned by: | danielk | |
---|---|---|---|
Priority: | trivial | Milestone: | unknown |
Component: | MythTV - General | Version: | Master Head |
Severity: | low | Keywords: | Deadlock, CENTOS |
Cc: | Ticket locked: | no |
Description
Insatlled CENTOS 6 and compiled Myth .24 fixes. After failure to get it working i tried it agin with trunk.
Backend starts up no problem Front end starts up, seemingly no problem Tried to scan videos, received error that backend does not exist
after reviewing logs i found it flooded with
2011-11-27 14:31:13.626067 C Protocol version check failure.
tried to telnet in and preform auth manualy. it failed. Restartee backend, and tried manual auth. worked Tried to rescan for videos then manual auth. failed again
Crash seems to also happen on intialy FE startup asking for the BE for the time zone.
Re-created machine in VMware, same results
Attachments (7)
Change History (31)
Changed 13 years ago by
Attachment: | Frontend.txt added |
---|
Changed 13 years ago by
Attachment: | Backend.txt added |
---|
comment:1 Changed 13 years ago by
comment:2 Changed 13 years ago by
It looks like the mainserver is waiting for the masterFreeSpaceListUpdater to be set to NULL but it already should have been. Can you try replacing:
FreeSpaceUpdater *masterFreeSpaceListUpdater;
with
FreeSpaceUpdater * volatile masterFreeSpaceListUpdater;
That should force that loop to look for the variable in memory.
comment:3 Changed 13 years ago by
Refs #10188. Make masterFreeSpaceListUpdater variable volatile.
This probably won't fix the problem reported, but it will correct a problem on shutdown if the compiler optimizes agressively.
Branch: master Changeset: f470287ee233e966eb91aa0ac6137e6a458dca6b
comment:4 Changed 13 years ago by
I've recompiled mythtv yesterday morning (double checked that I had the latest git changes including the volatile patch). And restarted the BE. I could connect with the FE on my notebook without issues (but that was also working before). But after 24 hours (and 2 recordings were started yesterday evening), this morning I could not connect to the BE anymore. But it does look like it's a bit different then before.
The FE is not complaining about a the mythtv protocol, but it appears it does not get any reponse to the QUERY_RECORDINGS Unsorted
On the FE I see:
2011-11-30 08:28:16.491286 I MythCoreContext: Connecting to backend server: 10.84.100.60:6543 (try 1 of 1) 2011-11-30 08:28:16.491735 I MythSocket(7f3c24015a40:-1): attempting connect() to (10.84.100.60:6543) 2011-11-30 08:28:16.491746 I MSocketDevice::connect: setting Protocol to IPv4 2011-11-30 08:28:16.491799 I MSocketDevice::connect: attempting to create new socket 2011-11-30 08:28:16.492364 I MythSocket(7f3c24015a40:51): write -> 51 30 MYTH_PROTO_VERSION 69 63835135 2011-11-30 08:28:16.502992 I MythSocket(7f3c24015a40:51): read <- 51 13 ACCEPT[]:[]69 2011-11-30 08:28:16.503007 I Using protocol version 69 2011-11-30 08:28:16.503016 I MythSocket(7f3c24015a40:51): write -> 51 44 ANN Monitor titanium.intern.exa-omicron.nl 0 2011-11-30 08:28:16.503445 I MythSocket(7f3c24015a40:51): read <- 51 2 OK 2011-11-30 08:28:16.503665 I MythSocket(7f3c2400bb50:-1): attempting connect() to (10.84.100.60:6543) 2011-11-30 08:28:16.503670 I MSocketDevice::connect: setting Protocol to IPv4 2011-11-30 08:28:16.503672 I MSocketDevice::connect: attempting to create new socket 2011-11-30 08:28:16.504006 I MythSocket(7f3c2400bb50:62): write -> 62 44 ANN Monitor titanium.intern.exa-omicron.nl 1 2011-11-30 08:28:16.504552 I MythSocket(7f3c2400bb50:62): read <- 62 2 OK 2011-11-30 08:28:16.504599 I MythSocket(7f3c24015a40:51): write -> 51 25 QUERY_RECORDINGS Unsorted 2011-11-30 08:28:46.505376 E MythSocket(7f3c24015a40:51): readStringList: Error, timed out after 30000 ms. 2011-11-30 08:28:46.505451 C Reconnection to backend server failed
On the BE I see:
2011-11-30 08:28:16.501037 I MainServer::ANN Monitor 2011-11-30 08:28:16.501040 I adding: titanium.intern.exa-omicron.nl as a client (events: 0) 2011-11-30 08:28:16.501060 I MythSocket(dfe400:122): write -> 122 2 OK 2011-11-30 08:28:16.502052 I MythSocket(e8cf20:125): read <- 125 44 ANN Monitor titanium.intern.exa-omicron.nl 1 2011-11-30 08:28:16.502066 I MainServer::ANN Monitor 2011-11-30 08:28:16.502074 I adding: titanium.intern.exa-omicron.nl as a client (events: 1) 2011-11-30 08:28:16.502083 I MythEvent: SYSTEM_EVENT CLIENT_CONNECTED HOSTNAME titanium.intern.exa-omicron.nl SENDER orion.intern.robmir.nl 2011-11-30 08:28:16.502103 I MythSocket(e8cf20:125): write -> 125 2 OK 2011-11-30 08:28:16.502154 I MythSocket(f80e70:113): write -> 113 129 BACKEND_MESSAGE[]:[]SYSTEM_EVENT CLIENT_CONNECTED HOSTNAME titan... 2011-11-30 08:28:16.502617 I MythSocket(dfe400:122): read <- 122 25 QUERY_RECORDINGS Unsorted 2011-11-30 08:28:31.446002 I Cardid 1: max bitrate 162597 KB/min 2011-11-30 08:28:31.446013 I Maximal bitrate of busy encoders is 162597 KB/min 2011-11-30 08:28:46.445915 I Cardid 1: max bitrate 162597 KB/min 2011-11-30 08:28:46.445925 I Maximal bitrate of busy encoders is 162597 KB/min 2011-11-30 08:28:46.515033 I MythSocket(e79f50:126): read <- 126 30 MYTH_PROTO_VERSION 69 63835135 2011-11-30 08:28:46.515050 I MythSocket(e79f50:126): write -> 126 13 ACCEPT[]:[]69 2011-11-30 08:28:46.515604 I MythSocketThread(sock 0xf80e70:113): socket closed 2011-11-30 08:28:46.515626 I MythEvent: SYSTEM_EVENT CLIENT_DISCONNECTED HOSTNAME titanium.intern.exa-omicron.nl SENDER orion.intern.robmir.nl 2011-11-30 08:28:46.515642 I MythEvent: LOCAL_SLAVE_BACKEND_ENCODERS_OFFLINE 2011-11-30 08:28:46.515711 I MythSocket(e8cf20:125): write -> 125 132 BACKEND_MESSAGE[]:[]SYSTEM_EVENT CLIENT_DISCONNECTED HOSTNAME ti... 2011-11-30 08:28:46.515748 I MythSocket(e79f50:126): read <- 126 44 ANN Monitor titanium.intern.exa-omicron.nl 0 2011-11-30 08:28:46.515766 I MainServer::ANN Monitor 2011-11-30 08:28:46.515772 I adding: titanium.intern.exa-omicron.nl as a client (events: 0) 2011-11-30 08:28:46.515785 I MythSocket(e79f50:126): write -> 126 2 OK 2011-11-30 08:28:46.516441 I MythSocket(e79f50:126): read <- 126 21 QUERY_FREE_SPACE_LIST 2011-11-30 08:28:46.516490 I MythSocket(e79f50:126): write -> 126 243 orion.intern.robmir.nl[]:[]orion:/data0/mythtv/livetv,orion:/dat... 2011-11-30 08:28:46.517158 I Cardid 1: max bitrate 162597 KB/min 2011-11-30 08:28:46.517165 I Maximal bitrate of busy encoders is 162597 KB/min 2011-11-30 08:28:54.222175 I MythSocket(e79f50:126): read <- 126 58 QUERY_IS_ACTIVE_BACKEND[]:[]titanium.intern.exa-omicron.nl 2011-11-30 08:28:54.222199 I MythSocket(e79f50:126): write -> 126 5 FALSE 2011-11-30 08:28:57.386540 I MythSocketThread(sock 0xe79f50:126): socket closed 2011-11-30 08:28:57.386582 I MythEvent: SYSTEM_EVENT CLIENT_DISCONNECTED HOSTNAME titanium.intern.exa-omicron.nl SENDER orion.intern.robmir.nl 2011-11-30 08:28:57.386602 I MythEvent: LOCAL_SLAVE_BACKEND_ENCODERS_OFFLINE 2011-11-30 08:28:57.386688 I MythSocket(e8cf20:125): write -> 125 132 BACKEND_MESSAGE[]:[]SYSTEM_EVENT CLIENT_DISCONNECTED HOSTNAME ti... 2011-11-30 08:28:57.572864 I MythSocketThread(sock 0xe8cf20:125): socket closed 2011-11-30 08:28:57.572893 I MythEvent: SYSTEM_EVENT CLIENT_DISCONNECTED HOSTNAME titanium.intern.exa-omicron.nl SENDER orion.intern.robmir.nl 2011-11-30 08:28:57.572916 I MythEvent: LOCAL_SLAVE_BACKEND_ENCODERS_OFFLINE
comment:5 Changed 13 years ago by
updated head, compiled
same problem
2011-12-03 10:30:43.866256 I MythCoreContext?: Connecting to backend server: 192.168.40.200:6543 (try 1 of 1) 2011-12-03 10:30:50.867906 E MythSocket?(1fc0ef0:68): readStringList: Error, timed out after 7000 ms. 2011-12-03 10:30:50.867971 C Protocol version check failure.
The response to MYTH_PROTO_VERSION was empty. This happens when the backend is too busy to respond, or has deadlocked in due to bugs or hardware failure.
2011-12-03 10:30:50.868013 C Unable to determine master backend time zone settings. If those settings differ from local settings, some functionality will fail. 2011-12-03 10:30:58.635042 I MythCoreContext?: Connecting to backend server: 192.168.40.200:6543 (try 1 of 1) 2011-12-03 10:31:05.638299 E MythSocket?(200a230:68): readStringList: Error, timed out after 7000 ms. 2011-12-03 10:31:05.638364 C Protocol version check failure.
The response to MYTH_PROTO_VERSION was empty. This happens when the backend is too busy to respond, or has deadlocked in due to bugs or hardware failure.
comment:6 Changed 13 years ago by
updated HEAD (Dec 19, 2011), same problem occurs after hours running, multiple good recordings, watching multiple recordings on (seperate) frontend. Backend keeps recording while problem emerges, but won't start new recordings.
top -H -p 24827
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24832 root 20 0 1098m 169m 9m R 99 1.4 89:55.25 mythbackend
28830 root 20 0 1098m 169m 9m S 7 1.4 5:02.70 mythbackend
...
So what I see is that when this problem occurs, a single thread is blocking a core. Sorry, don't know how to identify this thread.
Ubuntu 10.10 - P55-UD3L - i7 870 - 12G RAM - 4x DVB-C
comment:7 Changed 13 years ago by
Same here, although the process on my BE is only taking 4% CPU load (IPTV recording). The deadlock also appears, when no FE is ever connected to the BE, but it does looks like it is somehow related to recordings that keep running. The logfile of the BE tells me the recording is stopped, but the mpg file on the disk keeps growing.
I connected to the BE process with gdb. See backtrace below:
(gdb) bt #0 0x00007f54a8cb0123 in poll () from /lib64/libc.so.6 #1 0x00007f54a4693655 in ?? () from /lib64/libglib-2.0.so.0 #2 0x00007f54a4693a3a in g_main_context_iteration () from /lib64/libglib-2.0.so.0 #3 0x00007f54a98802d3 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQtCore.so.4 #4 0x00007f54a9859442 in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQtCore.so.4 #5 0x00007f54a985970c in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQtCore.so.4 #6 0x00007f54a985b9f9 in QCoreApplication::exec() () from /usr/lib64/libQtCore.so.4 #7 0x000000000050fca0 in run_backend (cmdline=<value optimized out>) at main_helpers.cpp:653 #8 0x000000000044f97a in main (argc=3, argv=<value optimized out>) at main.cpp:127
comment:8 Changed 13 years ago by
Just updated Fixes, updated OS to centos 6.2 and re-compiled.
No Joy, same problem. After MythFRONTEND starts up i try to navigate left or right in the menu soon after i get.
2011-12-27 06:16:51.178 Found mainmenu.xml for theme 'Terra' 2011-12-27 06:16:51.436 MythCoreContext?: Connecting to backend server: 127.0.0.1:6543 (try 1 of 1) 2011-12-27 06:16:58.460 MythSocket?(1a70030:33): readStringList: Error, timed out after
7000 ms. 2011-12-27 06:16:58.460 Protocol version check failure. The response to MYTH_PROTO_VERSION was empty. This happens when the backend is too busy to respond, or has deadlocked in due to bugs or hardware failure.
I will produce step by step directions how to replicate this problem. Seems my problem occurs like CLOCKWORK rightaway ( no need to wait at all )
also i noticed in PS i see
[root@MythFE database]# ps aux | grep myth root 24228 0.3 2.8 1044252 28856 pts/0 Sl 06:16 0:00 mythbackend root 24278 1.1 6.2 1064556 63940 pts/0 Sl 06:16 0:01 mythfrontend root 24306 0.0 0.0 0 0 pts/0 Z 06:16 0:00 [mythfrontend] <defunct> root 24308 0.0 0.0 0 0 pts/0 Z 06:16 0:00 [mythfrontend] <defunct> root 24310 0.0 0.0 0 0 pts/0 Z 06:16 0:00 [mythfrontend] <defunct> root 24312 0.0 0.0 0 0 pts/0 Z 06:16 0:00 [mythfrontend] <defunct> root 24325 0.0 0.0 103292 864 pts/0 S+ 06:19 0:00 grep myth
comment:9 Changed 13 years ago by
How to reproduce the problem:
I used VMWARE Player to install a self-contained myth installation and the deadlock happenes every time.
To re-produce this problem I installed CENTOS 6.2 from the netinstall disk in MINMAL mode, then followed the steps on compiling and installed packages from Compile Mythtv from Source on Centos 6.
After compiling i used the next mythTV Basic Config Test to setup myth and got the dead lock right away after starting mythfrontend.
Any ideas???
comment:10 Changed 13 years ago by
I have uploaded the VMWARE machine after defrag and clone. It can be found at the bottom of this post.
comment:11 Changed 13 years ago by
I've done a little debugging on this. It seems MythServer::incomingConnection() is not being called. However netstat shows sockets in CLOSE_WAIT (with 39 bytes in the receive buffer) so something must be calling listen() correctly. On the face of it that would appear to suggest a problem with QTcpServer. I'll try and find/write a trivial test program for QTcpServer.
comment:12 Changed 13 years ago by
An update on this: I tried Qt's threadedfortuneserver example (& fortuneclient) which uses an override of incomingConnection, which works on CentOS 6.2. I also tried removing the incomingConnection override in mythbackend/server and hooking into newConnection (instead of newConnect). This "works" in that the signal is delivered correctly when the FE attempts to connect, but I can't think of a workaround that avoids the override.
comment:13 Changed 13 years ago by
I wrote a little script/program to see what the backend is watching at any time which might help somebody work out what's going on. Shortly after restarting the backend, I get something like this:
22280: poll fds=[{fd=3, events=1, revents=0} {fd=9, events=1, revents=0} {fd=38, events=1, revents=0} {fd=3, events=1, revents=0}] nfds=0x4 22285: poll fds=[{fd=7, events=1, revents=0}] nfds=0x1 22288: select nfds=0xf readfds=[12 13 14] writefds=0x0 exceptfds=0x0 22444: poll fds=[{fd=26, events=1, revents=0}] nfds=0x1 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME mythbacke 22280 root 3r FIFO 0,8 0t0 117605 pipe mythbacke 22280 root 7r FIFO 0,8 0t0 117615 pipe mythbacke 22280 root 9u IPv4 117616 0t0 TCP *:lds-dump (LISTEN) mythbacke 22280 root 12u IPv4 117619 0t0 UDP *:apc-6549 mythbacke 22280 root 13u IPv4 117620 0t0 UDP 239.255.255.250:ssdp mythbacke 22280 root 14u IPv4 117621 0t0 UDP 255.255.255.255:ssdp mythbacke 22280 root 26r FIFO 0,8 0t0 118008 pipe mythbacke 22280 root 38u IPv4 118014 0t0 TCP *:lds-distrib (LISTEN)
so we can see that the main thread is watching some pipe (twice) and the two listening TCP ports.
A little later, we get this:
22285: poll fds=[{fd=7, events=1, revents=0}] nfds=0x1 22288: select nfds=0xf readfds=[12 13 14] writefds=0x0 exceptfds=0x0 22444: poll fds=[{fd=26, events=1, revents=0}] nfds=0x1 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME mythbacke 22280 root 7r FIFO 0,8 0t0 117615 pipe mythbacke 22280 root 12u IPv4 117619 0t0 UDP *:apc-6549 mythbacke 22280 root 13u IPv4 117620 0t0 UDP 239.255.255.250:ssdp mythbacke 22280 root 14u IPv4 117621 0t0 UDP 255.255.255.255:ssdp mythbacke 22280 root 26r FIFO 0,8 0t0 118008 pipe
and strace tells us the main thread is blocked reading the pipe that it was watching:
# strace -p 22280 Process 22280 attached - interrupt to quit read(3,
It seems this read never returns.
comment:14 Changed 13 years ago by
yurko, the log text in the gdb.txt file shows this backend was unable to bind to the mythtv port. This usually indicates there is already a mythbackend process running. The backtrace doesn't look to be of a backend with any deadlocked threads.
J.Ali Harlow, the problem description you give doesn't match the backtrace already attached to this ticket. Please open a separate ticket with the information you've added here plus a backtrace showing the read() in question.
comment:15 Changed 13 years ago by
I had an incident too
backtrace attached. threads 36 and 35 and 20 look interesting.
Im not sure as I dont have too much logging on but I suspect scheduler deadlock somehow. happened on recording end I think. no diagnosis yet.
reasonably current build 20120108. still have core.
note if you get a deadlock use gcore to dump current state.
comment:16 Changed 13 years ago by
The odd part is i swiched over to neweset Fedora and pritty much the same setup works fine! Its only when compiled on centos does this problem occure.
Changed 13 years ago by
Attachment: | backend_deadlock_backtrace.txt added |
---|
Backtrace of BE in deadlock situation (triggered by manually ending a recording)
comment:18 Changed 13 years ago by
Just updated git to latest version master [v0.25pre-4071-g664a880] I started the BE, everything ok, then connected with a FE, all still ok. Then I started a recording. Everything still ok. Then I manually stopped the recording.
The recording keeps running in the background. And the FE gives back the following error:
2012-01-16 10:12:37.708916 I Connected to database 'mythconverg' at host: 10.84.100.60 2012-01-16 10:13:51.980575 E MythSocket(7f0408013d30:51): readStringList: Error, timed out after 30000 ms. 2012-01-16 10:13:51.980669 N Connection to backend server lost
I ran gcore to create a core file from the BE, and ran a "thread apply all bt full" output is attached ( backend_deadlock_backtrace.txt ) I think it has the about the same output as the last attached backtrace, but that trace ended with a "not enough registers or memory" line, so I don't know if that trace is complete.
This situation also happens when a scheduled recording stops.
comment:19 Changed 13 years ago by
Keywords: | Deadlock CENTOS added; Dead Lock removed |
---|---|
Priority: | major → trivial |
Severity: | medium → low |
Summary: | Deadlocks constantly → CENTOS6: Severe socket connection |
Has anyone here tested with the RHEL edition that CENTOS6 is based on?
I lowered the priority because CENTOS is just not a likely candidate for running MythTV. This is most likely a distro problem, but I have not yet close this as an 'upstream problem' because I don't think we really understand the problem well enough yet.
comment:20 Changed 13 years ago by
Kenni has reported on the dev list that he hasn't seen this with 0.24-fixes under Scientific Linux 6. I have a laptop with the same mix that I have used both independently and as a frontend without these problems. My SL6 is up to date; the fixes build is the most recent from ATrpms, dating from Nov 2011.
This post is to get the info attached to the ticket.
comment:21 Changed 13 years ago by
Owner: | set to danielk |
---|---|
Status: | new → assigned |
comment:23 Changed 12 years ago by
Resolution: | → Invalid |
---|---|
Status: | assigned → closed |
No response to request for information.
comment:24 Changed 12 years ago by
I'm not sure if I had the exact same problem but I got the dreaded:
Protocol version check failure. The response to MYTH_PROTO_VERSION was empty. This happens when the backend is too busy to respond, or has deadlocked in due to bugs or hardware failure.
On a fresh install of CentOS 6.3 with both fixes/0.26 and fixes/0.25. Mine was slightly different in that I never got a single successful connection.
I was really hoping to go with a more LTS solution like CentOS but I don't have time for it to be down anymore so I'm installing Fedora 17.
I think I have the same problem. see my mail to the mailinglist from http://www.mythtv.org/pipermail/mythtv-users/2011-September/320944.html