Opened 8 years ago

Closed 7 years ago

Last modified 7 years ago

#10188 closed Bug Report - Hang/Deadlock (Invalid)

CENTOS6: Severe socket connection

Reported by: yurko@… Owned by: danielk
Priority: trivial Milestone: unknown
Component: MythTV - General Version: Master Head
Severity: low Keywords: Deadlock, CENTOS
Cc: Ticket locked: no

Description

Insatlled CENTOS 6 and compiled Myth .24 fixes. After failure to get it working i tried it agin with trunk.

Backend starts up no problem Front end starts up, seemingly no problem Tried to scan videos, received error that backend does not exist

after reviewing logs i found it flooded with

2011-11-27 14:31:13.626067 C Protocol version check failure.

tried to telnet in and preform auth manualy. it failed. Restartee backend, and tried manual auth. worked Tried to rescan for videos then manual auth. failed again

Crash seems to also happen on intialy FE startup asking for the BE for the time zone.

Re-created machine in VMware, same results

Attachments (7)

Frontend.txt (16.2 KB) - added by anonymous 8 years ago.
Backend.txt (3.8 KB) - added by anonymous 8 years ago.
GDB.txt (17.4 KB) - added by anonymous 8 years ago.
myth-poll (2.0 KB) - added by J. Ali Harlow <ali@…> 8 years ago.
Script to see what mythbackend is watching
read_mem.c (2.5 KB) - added by J. Ali Harlow <ali@…> 8 years ago.
Helper program for myth-poll
deadlock.backtrace.bz2 (9.1 KB) - added by Mark Spieth 8 years ago.
deadlock backtrace
backend_deadlock_backtrace.txt (65.2 KB) - added by Robert Verspuy <robert@…> 8 years ago.
Backtrace of BE in deadlock situation (triggered by manually ending a recording)

Download all attachments as: .zip

Change History (31)

Changed 8 years ago by anonymous

Attachment: Frontend.txt added

Changed 8 years ago by anonymous

Attachment: Backend.txt added

Changed 8 years ago by anonymous

Attachment: GDB.txt added

comment:1 Changed 8 years ago by Robert Verspuy <robert@…>

I think I have the same problem. see my mail to the mailinglist from http://www.mythtv.org/pipermail/mythtv-users/2011-September/320944.html

comment:2 Changed 8 years ago by danielk

It looks like the mainserver is waiting for the masterFreeSpaceListUpdater to be set to NULL but it already should have been. Can you try replacing:

    FreeSpaceUpdater *masterFreeSpaceListUpdater;

with

    FreeSpaceUpdater * volatile masterFreeSpaceListUpdater;

That should force that loop to look for the variable in memory.

comment:3 Changed 8 years ago by Github

Refs #10188. Make masterFreeSpaceListUpdater variable volatile.

This probably won't fix the problem reported, but it will correct a problem on shutdown if the compiler optimizes agressively.

Branch: master Changeset: f470287ee233e966eb91aa0ac6137e6a458dca6b

comment:4 Changed 8 years ago by Robert Verspuy <robert@…>

I've recompiled mythtv yesterday morning (double checked that I had the latest git changes including the volatile patch). And restarted the BE. I could connect with the FE on my notebook without issues (but that was also working before). But after 24 hours (and 2 recordings were started yesterday evening), this morning I could not connect to the BE anymore. But it does look like it's a bit different then before.

The FE is not complaining about a the mythtv protocol, but it appears it does not get any reponse to the QUERY_RECORDINGS Unsorted

On the FE I see:

2011-11-30 08:28:16.491286 I  MythCoreContext: Connecting to backend server: 10.84.100.60:6543 (try 1 of 1)
2011-11-30 08:28:16.491735 I  MythSocket(7f3c24015a40:-1): attempting connect() to (10.84.100.60:6543)
2011-11-30 08:28:16.491746 I  MSocketDevice::connect: setting Protocol to IPv4
2011-11-30 08:28:16.491799 I  MSocketDevice::connect: attempting to create new socket
2011-11-30 08:28:16.492364 I  MythSocket(7f3c24015a40:51): write -> 51 30      MYTH_PROTO_VERSION 69 63835135
2011-11-30 08:28:16.502992 I  MythSocket(7f3c24015a40:51): read  <- 51 13      ACCEPT[]:[]69
2011-11-30 08:28:16.503007 I  Using protocol version 69
2011-11-30 08:28:16.503016 I  MythSocket(7f3c24015a40:51): write -> 51 44      ANN Monitor titanium.intern.exa-omicron.nl 0
2011-11-30 08:28:16.503445 I  MythSocket(7f3c24015a40:51): read  <- 51 2       OK
2011-11-30 08:28:16.503665 I  MythSocket(7f3c2400bb50:-1): attempting connect() to (10.84.100.60:6543)
2011-11-30 08:28:16.503670 I  MSocketDevice::connect: setting Protocol to IPv4
2011-11-30 08:28:16.503672 I  MSocketDevice::connect: attempting to create new socket
2011-11-30 08:28:16.504006 I  MythSocket(7f3c2400bb50:62): write -> 62 44      ANN Monitor titanium.intern.exa-omicron.nl 1
2011-11-30 08:28:16.504552 I  MythSocket(7f3c2400bb50:62): read  <- 62 2       OK
2011-11-30 08:28:16.504599 I  MythSocket(7f3c24015a40:51): write -> 51 25      QUERY_RECORDINGS Unsorted
2011-11-30 08:28:46.505376 E  MythSocket(7f3c24015a40:51): readStringList: Error, timed out after 30000 ms.
2011-11-30 08:28:46.505451 C  Reconnection to backend server failed

On the BE I see:

2011-11-30 08:28:16.501037 I  MainServer::ANN Monitor
2011-11-30 08:28:16.501040 I  adding: titanium.intern.exa-omicron.nl as a client (events: 0)
2011-11-30 08:28:16.501060 I  MythSocket(dfe400:122): write -> 122 2       OK
2011-11-30 08:28:16.502052 I  MythSocket(e8cf20:125): read  <- 125 44      ANN Monitor titanium.intern.exa-omicron.nl 1
2011-11-30 08:28:16.502066 I  MainServer::ANN Monitor
2011-11-30 08:28:16.502074 I  adding: titanium.intern.exa-omicron.nl as a client (events: 1)
2011-11-30 08:28:16.502083 I  MythEvent: SYSTEM_EVENT CLIENT_CONNECTED HOSTNAME titanium.intern.exa-omicron.nl SENDER orion.intern.robmir.nl
2011-11-30 08:28:16.502103 I  MythSocket(e8cf20:125): write -> 125 2       OK
2011-11-30 08:28:16.502154 I  MythSocket(f80e70:113): write -> 113 129     BACKEND_MESSAGE[]:[]SYSTEM_EVENT CLIENT_CONNECTED HOSTNAME titan...
2011-11-30 08:28:16.502617 I  MythSocket(dfe400:122): read  <- 122 25      QUERY_RECORDINGS Unsorted
2011-11-30 08:28:31.446002 I  Cardid 1: max bitrate 162597 KB/min
2011-11-30 08:28:31.446013 I  Maximal bitrate of busy encoders is 162597 KB/min
2011-11-30 08:28:46.445915 I  Cardid 1: max bitrate 162597 KB/min
2011-11-30 08:28:46.445925 I  Maximal bitrate of busy encoders is 162597 KB/min
2011-11-30 08:28:46.515033 I  MythSocket(e79f50:126): read  <- 126 30      MYTH_PROTO_VERSION 69 63835135
2011-11-30 08:28:46.515050 I  MythSocket(e79f50:126): write -> 126 13      ACCEPT[]:[]69
2011-11-30 08:28:46.515604 I  MythSocketThread(sock 0xf80e70:113): socket closed
2011-11-30 08:28:46.515626 I  MythEvent: SYSTEM_EVENT CLIENT_DISCONNECTED HOSTNAME titanium.intern.exa-omicron.nl SENDER orion.intern.robmir.nl
2011-11-30 08:28:46.515642 I  MythEvent: LOCAL_SLAVE_BACKEND_ENCODERS_OFFLINE
2011-11-30 08:28:46.515711 I  MythSocket(e8cf20:125): write -> 125 132     BACKEND_MESSAGE[]:[]SYSTEM_EVENT CLIENT_DISCONNECTED HOSTNAME ti...
2011-11-30 08:28:46.515748 I  MythSocket(e79f50:126): read  <- 126 44      ANN Monitor titanium.intern.exa-omicron.nl 0
2011-11-30 08:28:46.515766 I  MainServer::ANN Monitor
2011-11-30 08:28:46.515772 I  adding: titanium.intern.exa-omicron.nl as a client (events: 0)
2011-11-30 08:28:46.515785 I  MythSocket(e79f50:126): write -> 126 2       OK
2011-11-30 08:28:46.516441 I  MythSocket(e79f50:126): read  <- 126 21      QUERY_FREE_SPACE_LIST
2011-11-30 08:28:46.516490 I  MythSocket(e79f50:126): write -> 126 243     orion.intern.robmir.nl[]:[]orion:/data0/mythtv/livetv,orion:/dat...
2011-11-30 08:28:46.517158 I  Cardid 1: max bitrate 162597 KB/min
2011-11-30 08:28:46.517165 I  Maximal bitrate of busy encoders is 162597 KB/min
2011-11-30 08:28:54.222175 I  MythSocket(e79f50:126): read  <- 126 58      QUERY_IS_ACTIVE_BACKEND[]:[]titanium.intern.exa-omicron.nl
2011-11-30 08:28:54.222199 I  MythSocket(e79f50:126): write -> 126 5       FALSE
2011-11-30 08:28:57.386540 I  MythSocketThread(sock 0xe79f50:126): socket closed
2011-11-30 08:28:57.386582 I  MythEvent: SYSTEM_EVENT CLIENT_DISCONNECTED HOSTNAME titanium.intern.exa-omicron.nl SENDER orion.intern.robmir.nl
2011-11-30 08:28:57.386602 I  MythEvent: LOCAL_SLAVE_BACKEND_ENCODERS_OFFLINE
2011-11-30 08:28:57.386688 I  MythSocket(e8cf20:125): write -> 125 132     BACKEND_MESSAGE[]:[]SYSTEM_EVENT CLIENT_DISCONNECTED HOSTNAME ti...
2011-11-30 08:28:57.572864 I  MythSocketThread(sock 0xe8cf20:125): socket closed
2011-11-30 08:28:57.572893 I  MythEvent: SYSTEM_EVENT CLIENT_DISCONNECTED HOSTNAME titanium.intern.exa-omicron.nl SENDER orion.intern.robmir.nl
2011-11-30 08:28:57.572916 I  MythEvent: LOCAL_SLAVE_BACKEND_ENCODERS_OFFLINE

comment:5 Changed 8 years ago by yurko@…

updated head, compiled

same problem

2011-12-03 10:30:43.866256 I MythCoreContext?: Connecting to backend server: 192.168.40.200:6543 (try 1 of 1) 2011-12-03 10:30:50.867906 E MythSocket?(1fc0ef0:68): readStringList: Error, timed out after 7000 ms. 2011-12-03 10:30:50.867971 C Protocol version check failure.

The response to MYTH_PROTO_VERSION was empty. This happens when the backend is too busy to respond, or has deadlocked in due to bugs or hardware failure.

2011-12-03 10:30:50.868013 C Unable to determine master backend time zone settings. If those settings differ from local settings, some functionality will fail. 2011-12-03 10:30:58.635042 I MythCoreContext?: Connecting to backend server: 192.168.40.200:6543 (try 1 of 1) 2011-12-03 10:31:05.638299 E MythSocket?(200a230:68): readStringList: Error, timed out after 7000 ms. 2011-12-03 10:31:05.638364 C Protocol version check failure.

The response to MYTH_PROTO_VERSION was empty. This happens when the backend is too busy to respond, or has deadlocked in due to bugs or hardware failure.

comment:6 Changed 8 years ago by macarno@…

updated HEAD (Dec 19, 2011), same problem occurs after hours running, multiple good recordings, watching multiple recordings on (seperate) frontend. Backend keeps recording while problem emerges, but won't start new recordings.

top -H -p 24827

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24832 root 20 0 1098m 169m 9m R 99 1.4 89:55.25 mythbackend
28830 root 20 0 1098m 169m 9m S 7 1.4 5:02.70 mythbackend
...

So what I see is that when this problem occurs, a single thread is blocking a core. Sorry, don't know how to identify this thread.

Ubuntu 10.10 - P55-UD3L - i7 870 - 12G RAM - 4x DVB-C

comment:7 Changed 8 years ago by Robert Verspuy <robert@…>

Same here, although the process on my BE is only taking 4% CPU load (IPTV recording). The deadlock also appears, when no FE is ever connected to the BE, but it does looks like it is somehow related to recordings that keep running. The logfile of the BE tells me the recording is stopped, but the mpg file on the disk keeps growing.

I connected to the BE process with gdb. See backtrace below:

(gdb) bt
#0  0x00007f54a8cb0123 in poll () from /lib64/libc.so.6
#1  0x00007f54a4693655 in ?? () from /lib64/libglib-2.0.so.0
#2  0x00007f54a4693a3a in g_main_context_iteration () from /lib64/libglib-2.0.so.0
#3  0x00007f54a98802d3 in QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQtCore.so.4
#4  0x00007f54a9859442 in QEventLoop::processEvents(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQtCore.so.4
#5  0x00007f54a985970c in QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () from /usr/lib64/libQtCore.so.4
#6  0x00007f54a985b9f9 in QCoreApplication::exec() () from /usr/lib64/libQtCore.so.4
#7  0x000000000050fca0 in run_backend (cmdline=<value optimized out>) at main_helpers.cpp:653
#8  0x000000000044f97a in main (argc=3, argv=<value optimized out>) at main.cpp:127

comment:8 Changed 8 years ago by yurko@…

Just updated Fixes, updated OS to centos 6.2 and re-compiled.

No Joy, same problem. After MythFRONTEND starts up i try to navigate left or right in the menu soon after i get.

2011-12-27 06:16:51.178 Found mainmenu.xml for theme 'Terra' 2011-12-27 06:16:51.436 MythCoreContext?: Connecting to backend server: 127.0.0.1:6543 (try 1 of 1) 2011-12-27 06:16:58.460 MythSocket?(1a70030:33): readStringList: Error, timed out after

7000 ms.
2011-12-27 06:16:58.460 Protocol version check failure.
                        The response to MYTH_PROTO_VERSION was empty.
                        This happens when the backend is too busy to respond,
                        or has deadlocked in due to bugs or hardware failure.

I will produce step by step directions how to replicate this problem. Seems my problem occurs like CLOCKWORK rightaway ( no need to wait at all )

also i noticed in PS i see

[root@MythFE database]# ps aux | grep myth
root     24228  0.3  2.8 1044252 28856 pts/0   Sl   06:16   0:00 mythbackend
root     24278  1.1  6.2 1064556 63940 pts/0   Sl   06:16   0:01 mythfrontend
root     24306  0.0  0.0      0     0 pts/0    Z    06:16   0:00 [mythfrontend] <defunct>
root     24308  0.0  0.0      0     0 pts/0    Z    06:16   0:00 [mythfrontend] <defunct>
root     24310  0.0  0.0      0     0 pts/0    Z    06:16   0:00 [mythfrontend] <defunct>
root     24312  0.0  0.0      0     0 pts/0    Z    06:16   0:00 [mythfrontend] <defunct>
root     24325  0.0  0.0 103292   864 pts/0    S+   06:19   0:00 grep myth


comment:9 Changed 8 years ago by yurko@…

How to reproduce the problem:

I used VMWARE Player to install a self-contained myth installation and the deadlock happenes every time.

To re-produce this problem I installed CENTOS 6.2 from the netinstall disk in MINMAL mode, then followed the steps on compiling and installed packages from Compile Mythtv from Source on Centos 6.

After compiling i used the next mythTV Basic Config Test to setup myth and got the dead lock right away after starting mythfrontend.

Any ideas???

comment:10 Changed 8 years ago by darkdrgn2k@…

I have uploaded the VMWARE machine after defrag and clone. It can be found at the bottom of this post.

comment:11 Changed 8 years ago by J. Ali Harlow <ali@…>

I've done a little debugging on this. It seems MythServer::incomingConnection() is not being called. However netstat shows sockets in CLOSE_WAIT (with 39 bytes in the receive buffer) so something must be calling listen() correctly. On the face of it that would appear to suggest a problem with QTcpServer. I'll try and find/write a trivial test program for QTcpServer.

comment:12 Changed 8 years ago by J. Ali Harlow <ali@…>

An update on this: I tried Qt's threadedfortuneserver example (& fortuneclient) which uses an override of incomingConnection, which works on CentOS 6.2. I also tried removing the incomingConnection override in mythbackend/server and hooking into newConnection (instead of newConnect). This "works" in that the signal is delivered correctly when the FE attempts to connect, but I can't think of a workaround that avoids the override.

comment:13 Changed 8 years ago by J. Ali Harlow <ali@…>

I wrote a little script/program to see what the backend is watching at any time which might help somebody work out what's going on. Shortly after restarting the backend, I get something like this:

22280: poll fds=[{fd=3, events=1, revents=0} {fd=9, events=1, revents=0} {fd=38, events=1, revents=0} {fd=3, events=1, revents=0}] nfds=0x4
22285: poll fds=[{fd=7, events=1, revents=0}] nfds=0x1
22288: select nfds=0xf readfds=[12 13 14] writefds=0x0 exceptfds=0x0
22444: poll fds=[{fd=26, events=1, revents=0}] nfds=0x1
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
mythbacke 22280 root    3r  FIFO    0,8      0t0 117605 pipe
mythbacke 22280 root    7r  FIFO    0,8      0t0 117615 pipe
mythbacke 22280 root    9u  IPv4 117616      0t0    TCP *:lds-dump (LISTEN)
mythbacke 22280 root   12u  IPv4 117619      0t0    UDP *:apc-6549 
mythbacke 22280 root   13u  IPv4 117620      0t0    UDP 239.255.255.250:ssdp 
mythbacke 22280 root   14u  IPv4 117621      0t0    UDP 255.255.255.255:ssdp 
mythbacke 22280 root   26r  FIFO    0,8      0t0 118008 pipe
mythbacke 22280 root   38u  IPv4 118014      0t0    TCP *:lds-distrib (LISTEN)

so we can see that the main thread is watching some pipe (twice) and the two listening TCP ports.

A little later, we get this:

22285: poll fds=[{fd=7, events=1, revents=0}] nfds=0x1
22288: select nfds=0xf readfds=[12 13 14] writefds=0x0 exceptfds=0x0
22444: poll fds=[{fd=26, events=1, revents=0}] nfds=0x1
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
mythbacke 22280 root    7r  FIFO    0,8      0t0 117615 pipe
mythbacke 22280 root   12u  IPv4 117619      0t0    UDP *:apc-6549 
mythbacke 22280 root   13u  IPv4 117620      0t0    UDP 239.255.255.250:ssdp 
mythbacke 22280 root   14u  IPv4 117621      0t0    UDP 255.255.255.255:ssdp 
mythbacke 22280 root   26r  FIFO    0,8      0t0 118008 pipe

and strace tells us the main thread is blocked reading the pipe that it was watching:

# strace -p 22280
Process 22280 attached - interrupt to quit
read(3, 

It seems this read never returns.

Changed 8 years ago by J. Ali Harlow <ali@…>

Attachment: myth-poll added

Script to see what mythbackend is watching

Changed 8 years ago by J. Ali Harlow <ali@…>

Attachment: read_mem.c added

Helper program for myth-poll

comment:14 Changed 8 years ago by danielk

yurko, the log text in the gdb.txt file shows this backend was unable to bind to the mythtv port. This usually indicates there is already a mythbackend process running. The backtrace doesn't look to be of a backend with any deadlocked threads.

J.Ali Harlow, the problem description you give doesn't match the backtrace already attached to this ticket. Please open a separate ticket with the information you've added here plus a backtrace showing the read() in question.

Changed 8 years ago by Mark Spieth

Attachment: deadlock.backtrace.bz2 added

deadlock backtrace

comment:15 Changed 8 years ago by Mark Spieth

I had an incident too

backtrace attached. threads 36 and 35 and 20 look interesting.

Im not sure as I dont have too much logging on but I suspect scheduler deadlock somehow. happened on recording end I think. no diagnosis yet.

reasonably current build 20120108. still have core.

note if you get a deadlock use gcore to dump current state.

Last edited 8 years ago by Mark Spieth (previous) (diff)

comment:16 Changed 8 years ago by yurko@…

The odd part is i swiched over to neweset Fedora and pritty much the same setup works fine! Its only when compiled on centos does this problem occure.

comment:17 Changed 8 years ago by J. Ali Harlow <ali@…>

@danielk: Thanks, ticket #10265 created.

Changed 8 years ago by Robert Verspuy <robert@…>

Backtrace of BE in deadlock situation (triggered by manually ending a recording)

comment:18 Changed 8 years ago by Robert Verspuy <robert@…>

Just updated git to latest version master [v0.25pre-4071-g664a880] I started the BE, everything ok, then connected with a FE, all still ok. Then I started a recording. Everything still ok. Then I manually stopped the recording.

The recording keeps running in the background. And the FE gives back the following error:

2012-01-16 10:12:37.708916 I  Connected to database 'mythconverg' at host: 10.84.100.60
2012-01-16 10:13:51.980575 E  MythSocket(7f0408013d30:51): readStringList: Error, timed out after 30000 ms.
2012-01-16 10:13:51.980669 N  Connection to backend server lost

I ran gcore to create a core file from the BE, and ran a "thread apply all bt full" output is attached ( backend_deadlock_backtrace.txt ) I think it has the about the same output as the last attached backtrace, but that trace ended with a "not enough registers or memory" line, so I don't know if that trace is complete.

This situation also happens when a scheduled recording stops.

comment:19 Changed 8 years ago by danielk

Keywords: Deadlock CENTOS added; Dead Lock removed
Priority: majortrivial
Severity: mediumlow
Summary: Deadlocks constantlyCENTOS6: Severe socket connection

Has anyone here tested with the RHEL edition that CENTOS6 is based on?

I lowered the priority because CENTOS is just not a likely candidate for running MythTV. This is most likely a distro problem, but I have not yet close this as an 'upstream problem' because I don't think we really understand the problem well enough yet.

comment:20 Changed 8 years ago by J.Pilk@…

Kenni has reported on the dev list that he hasn't seen this with 0.24-fixes under Scientific Linux 6. I have a laptop with the same mix that I have used both independently and as a frontend without these problems. My SL6 is up to date; the fixes build is the most recent from ATrpms, dating from Nov 2011.

This post is to get the info attached to the ticket.

comment:21 Changed 7 years ago by beirdo

Owner: set to danielk
Status: newassigned

comment:22 Changed 7 years ago by danielk

Is this still a problem with in the devel/mythsocket branch?

comment:23 Changed 7 years ago by danielk

Resolution: Invalid
Status: assignedclosed

No response to request for information.

comment:24 Changed 7 years ago by hobbes1069@…

I'm not sure if I had the exact same problem but I got the dreaded:

Protocol version check failure. The response to MYTH_PROTO_VERSION was empty. This happens when the backend is too busy to respond, or has deadlocked in due to bugs or hardware failure.

On a fresh install of CentOS 6.3 with both fixes/0.26 and fixes/0.25. Mine was slightly different in that I never got a single successful connection.

I was really hoping to go with a more LTS solution like CentOS but I don't have time for it to be down anymore so I'm installing Fedora 17.

Note: See TracTickets for help on using tickets.