Opened 13 years ago

Closed 13 years ago

#9885 closed Bug Report - General (Fixed)

Deadlock on slave backend disconnect

Reported by: Ian Dall <ian@…> Owned by: danielk
Priority: major Milestone: 0.25
Component: MythTV - General Version: Master Head
Severity: medium Keywords:
Cc: Ticket locked: no

Description

I have a setup with a master BE, 2 slave BEs and 1 - 3 frontends. I am running code compiled from git (v0.25pre-2145-gf199a84-dirty).

The behaviour is that nothing works and one slave is dead and the master backend is deadlocked. FE's don't work and accessing the status port with a browser times-out.

The slave death is accompanied by kernel syslog messages like: kernel: [371375.689820] mythbackend: page allocation failure. order:0, mode:0x20 Maybe the slave death is due to a kernel bug, BUT the master should not deadlock!

The attached backtrace shows that master Thread 16 is trying, in SlaveDisconnected? to get a shedlock, when shedlock is already held by Scheduler::run further up the stack.

I saw exactly the same problem with an older version: 0.24-7.fc14 (464fa28373) but went to git head in the hope that this had been fixed :-(

I notice other deadlock tickets (esp #9745), but none seem to have quite the same description, and the version I am running has the #9745 fix included.

Attachments (5)

hex.backtrace (20.3 KB) - added by Ian Dall <ian@…> 13 years ago.
Backtrace all threads
hex.log.gz (25.1 KB) - added by Ian Dall <ian@…> 13 years ago.
Master backend log.
slavedisconnect.patch (3.7 KB) - added by Jonatan <mythtv@…> 13 years ago.
hex-a41e965.backtrace (27.6 KB) - added by Ian Dall <ian@…> 13 years ago.
Backtrace all threads as of commit a41e965
hex-a41e965.log.gz (201.9 KB) - added by Ian Dall <ian@…> 13 years ago.
Master backend log as of commit a41e965

Download all attachments as: .zip

Change History (12)

Changed 13 years ago by Ian Dall <ian@…>

Attachment: hex.backtrace added

Backtrace all threads

Changed 13 years ago by Ian Dall <ian@…>

Attachment: hex.log.gz added

Master backend log.

Changed 13 years ago by Jonatan <mythtv@…>

Attachment: slavedisconnect.patch added

comment:1 Changed 13 years ago by Jonatan <mythtv@…>

I have also seen a similar deadlock a few times on 0.24. I have been running the backend with the attached patch for a while now without any problems.

comment:2 Changed 13 years ago by Ian Dall <ian@…>

I haven't tried Jonatan's patch yet but will soon (I missed it until now).

Without the patch, the issue is still there as of version: v0.25pre-2563-ga41e965

See thread 21 in the attached backtrace.

Changed 13 years ago by Ian Dall <ian@…>

Attachment: hex-a41e965.backtrace added

Backtrace all threads as of commit a41e965

Changed 13 years ago by Ian Dall <ian@…>

Attachment: hex-a41e965.log.gz added

Master backend log as of commit a41e965

comment:3 Changed 13 years ago by Ian Dall <ian@…>

I have been running with Jonatan's patch for a week now and it seems to fix the problem. I deliberately tried to provoke it by killing and restarting the backend many times and never saw the deadlock.

Can this patch be applied?

comment:4 Changed 13 years ago by danielk

Ian, Janathan's patch is more of a debugging patch, it just disables a bit of code. And can't be applied as is. But if it fixed the problem for you, it does show that you are both experiencing the same deadlock and the same fix will help both of you.

comment:5 Changed 13 years ago by Github

Refs #9885. Fixes deadlock when a slave backend disconnect is first seen from within the Scheduler thread. Patch by Ian Dall.

Keeping ticket open since this should be backported to 0.24-fixes.

Branch: master Changeset: 1fae22a8bc56a5474375332f7799b0ee91bb6244

comment:6 Changed 13 years ago by danielk

Milestone: unknown0.25
Owner: set to danielk
Status: newassigned

comment:7 Changed 13 years ago by danielk

Resolution: Fixed
Status: assignedclosed

Fixed in [3a5f78862a5be39e02ee549551d59e9c40fa575d]

Fixes #9885. Fixes deadlock when a slave backend disconnect is first seen from within the Scheduler thread. Patch by Ian Dall.

This has been running in master for 4 weeks without reports of regression.

Note: See TracTickets for help on using tickets.