Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#10771 closed Bug Report - Hang/Deadlock (fixed)

BE deadlocks after backporting #10541

Reported by: warpme@… Owned by: danielk
Priority: major Milestone: 0.25.1
Component: MythTV - EIT Version: 0.25-fixes
Severity: medium Keywords: EIT
Cc: Ticket locked: no

Description

Hi, I had hopes backporting #10541 to 0.25-fixes solves issue from #10016. Currently as workaround #10016 I forced active EIT to work on night window where probability that livetv meets EIT scanner is almost zero (#10076). Yesterday I build v0.25-123-gd3a5b0a to give run for #10541 and turn-on continuous EIT active scan on my system (configured only on 2nd tuner).

I had following scenario:

-start BE.

-make sure EIT active scanner starts (configured on 2nd tuner).

-in meantime there was some recordings on 1st tuner.

-there was recording on 2nd tuner. EIT scanner stops.

-after end of recording on 2nd tuner EIT scanner NOT resumed (bad).

-after 1h of idle I manually scheduled recording on 2nd tuner.

-since start of recording on 2nd tuner BE is deadlocked.

BE trace attached.

Attachments (8)

BE-trace.txt (23.6 KB) - added by warpme@… 12 years ago.
BE trace
trace_vanilla_d3a5b0a.txt (28.5 KB) - added by warpme@… 12 years ago.
trace from vanilla d3a5b0a
mythbackend.20120529182901.24284.log (75.6 KB) - added by warpme@… 12 years ago.
be log from vanilla d3a5b0a
10771-v1.patch (4.1 KB) - added by danielk 12 years ago.
Possible fix -- patch against master
trace_10771patchv1.txt (20.1 KB) - added by warpme@… 12 years ago.
trace with 10771_patch_v1
mythbackend.20120529224205.8292.log (13.0 KB) - added by warpme@… 12 years ago.
BE log with 10771_patch_v1
trace_20120706 (29.3 KB) - added by warpme@… 12 years ago.
Trace for scenario2
be.log (65.5 KB) - added by warpme@… 12 years ago.
BE.log for scenario2

Download all attachments as: .zip

Change History (40)

Changed 12 years ago by warpme@…

Attachment: BE-trace.txt added

BE trace

comment:1 Changed 12 years ago by danielk

Milestone: unknown0.25.1
Owner: changed from Stuart Auchterlonie to danielk
Priority: minormajor
Status: newassigned

comment:2 Changed 12 years ago by Daniel Kristjansson <danielk@…>

In a9edae22461ff7ef6749abaed21603fc8bf98cb7/mythtv:

Refs #10771. Revert "Fixes #10541. Fix EIT scanning deadlock."

This appears to be _causing_ a deadlock for warpme.

This reverts commit 6dce025ee1ac69e5a78c2e270b06aeff852765e7.

comment:3 Changed 12 years ago by danielk

Warpme, are you running with some other patches? If so can you reproduce with [d3a5b0a] without the patches?

The backtrace says TVRec::run() was sitting on line 1535, but that function ends on line 1418... The backtrace for Thread 22 shows similar anomalies.

comment:4 Changed 12 years ago by danielk

Status: assignedinfoneeded

Another question, did you modify the live tv order from the default?

comment:5 Changed 12 years ago by warpme@…

Daniel, thx for quick replay. Indeed I applied patch from #10076, but 3 weeks ago - when You introduce fix for #10541 in master - I do tests without #10076 by back-porting #10541 to 0.25-fixes. Results were exactly the same, but I decided not to bother You at that time. Regarding line numbers mismatch - sure I'll reproduce on vanilla [d3a5b0a] today evening. BR

comment:6 Changed 12 years ago by warpme@…

Forgot to answer about livetv order: yes, it is changed. I have 3 tuners with 4 virt. each. Recording order is: 1,2,3,4,5,6,7,8,9,10,11,12. LiveTV is: 12,8,4,11,7,3,10,6,2,9,5,1

comment:7 Changed 12 years ago by warpme@…

Daniel, I managed to get trace+log on vanilla [d3a5b0a]. Scenario was following:

1.start BE, wait till active EIT starts

2.created manual schedulle on tuner doing EIT

3.FE rec. list shows that recording starts but reported size was 0 during 3 min

4.killed ongoing recording, wait 1min

5.schedulled again new recording on this tuner

6.when recording should start - BE deadlocks

Observing correlation between BE & FE I see that recorder starts but not writing. It looks like it is blocked by previously working EIT scanner on this tuner. Scheduling another recording on this blocked tuner deadlocks BE. Log + trace attached.

Changed 12 years ago by warpme@…

Attachment: trace_vanilla_d3a5b0a.txt added

trace from vanilla d3a5b0a

Changed 12 years ago by warpme@…

be log from vanilla d3a5b0a

comment:8 Changed 12 years ago by danielk

Status: infoneededassigned

Thanks.

Threads 27 and 28 are the main issue here. StopActiveScan?() in 28 is being run from within the event loop, but thread 27 was already in SetChannel?() and is waiting for the event loop to get to the end of the loop. StopActiveScan?() is waiting for SetChannel?() to complete, hence the deadlock.

Threads 2, 4, 9, 12, 23 and 24 are also blocked on waiting for thread 9 streamhandler.cpp:159, but I'm treating this as a red herring for now.

Changed 12 years ago by danielk

Attachment: 10771-v1.patch added

Possible fix -- patch against master

comment:9 Changed 12 years ago by danielk

Status: assignedinfoneeded

warpme, I've attached a patch which should address the deadlock if I diagnosed the cause correctly. The patch is against master but it will apply on top of vanilla [d3a5b0a].

Please test and report back if it works or doesn't.

Please attach another backtrace if it doesn't.

Thanks

comment:10 Changed 12 years ago by warpme@…

Daniel, I applied patch over [d3a5b0a]. It looks like now active EIT scanner is not starting at all. I tried launch livetv for moment - maybe it will unblock scanner (it uses the same tuner). No go. BE log +trace log attached.

Changed 12 years ago by warpme@…

Attachment: trace_10771patchv1.txt added

trace with 10771_patch_v1

Changed 12 years ago by warpme@…

BE log with 10771_patch_v1

comment:11 Changed 12 years ago by mayfields@…

Adding a comment here to indicate that I am also seeing similar symptoms with 0.25-fixes. Occasional deadlocks experienced whilst switching to LiveTV on a tuner. After a1d9015924480bb8a733452e294fb72ebbfce94a (from #10541) was applied to 0.25-fixes deadlocks occurred multiple times requiring the backend to be forcefully killed as it did not respond correctly to SIGTERM. Backend still running (housecleaning running, responds to mythbackend --printsched), but not responding to requests from frontends, built-in web server not responding. No trace available as I compiled the latest 0.25-fixes with the commit reverted.

comment:12 Changed 12 years ago by warpme@…

@mayfields,

In my case only effective solution is workaround like described in ticket description. If You will apply patch from #10076, set EITCrawIdleStart to 3900. By this EIT active scanner will work from 1:05AM+60s*tuner_number and end 4h later. Assuming You are not watching livetv between 1AM-5AM - You shouldn't have deadlocks (oh well You can have sporadic deadlocks when in livetv You are changing channels exactly when recording starts. I had such 1 per month or two). -br

comment:13 in reply to:  12 Changed 12 years ago by mayfields@…

After following the above advice from warpme@... to restrict EIT scanning to specified hours where there are no recordings being made I have had no issues in the past 7 days.

comment:14 Changed 12 years ago by stuartm

Can you please test the following patch? http://pastebin.ca/2162527

comment:15 Changed 12 years ago by warpme@…

Stuart, Is this patch for 0.25-fixes ?. I can't apply it cleanly over 0.25-fixes. Backporting manually show some different code in scheduler.cpp hunks.

comment:16 Changed 12 years ago by stuartm

The patch is against master, I'll create a version against -fixes but I'll wait a couple of days to see that there are no regressions before I go to the trouble.

comment:17 Changed 12 years ago by mayfields@…

I modified the patch for 0.25-fixes. Should this be applied along with the patch from 10541 (which was later reverted), or without?

comment:18 Changed 12 years ago by stuartm

Without any other patch

comment:19 Changed 12 years ago by Stuart Morgan <smorgan@…>

In 788bc510ede427e43e77e8aaf53d8cf047d89b46/mythtv:

Protect the reschedule queue with a seperate lock to fix a backend deadlock.

Fixes a deadlock in mythbackend caused when a reschedule request occurs
immedately prior to a recording starting.

The recorder thread locks the scheduler when starting a recording. The pending
reschedule request handled in the event loop waits for the scheduler lock to be
released blocking the event loop. The backend is now stopped from handling any
further events or protocol requests, meanwhile the recorder is waiting on
events to be processed before it will release the lock resulting in an
irrecoverable deadlock (or hang) and the failure of the recording.

This deadlock was more likely to be triggered for users of EIT because it will
ask for a reschedule every 5 minutes in addition to any reschedules triggered
through other means.

Refs #10771

comment:20 Changed 12 years ago by warpme@…

Stuart, I'm really exited that reschedule blockiness was finally resolved. Believe me - those 10-20sec deaf periods i.e during LiveTV zapping are really annoying. I hope You still have plans to create this patch also for 0.25-fixes. Again - thx for working on this issue !

comment:21 Changed 12 years ago by stuartm

This fixes one cause of that particular issue, but I'm aware of others which are a little harder to fix.

I'll backport this change to 0.25-fixes when I'm satisfied that it doesn't introduce any new bugs which means a week or more of testing in master.

comment:22 Changed 12 years ago by warpme@…

Oh, sure. perfectly clear. If You need 0.25 tester before releasing it officially I'm ready to apply it on my prod. system. PM me and I'll get You feedback how it works on 0.25-fixes. br

comment:23 Changed 12 years ago by Stuart Morgan <smorgan@…>

In 5eb97790f3c36f9d2524275e545189cafda9d48a/mythtv:

Protect the reschedule queue with a seperate lock to fix a backend deadlock.

Fixes a deadlock in mythbackend caused when a reschedule request occurs
immedately prior to a recording starting.

The recorder thread locks the scheduler when starting a recording. The pending
reschedule request handled in the event loop waits for the scheduler lock to be
released blocking the event loop. The backend is now stopped from handling any
further events or protocol requests, meanwhile the recorder is waiting on
events to be processed before it will release the lock resulting in an
irrecoverable deadlock (or hang) and the failure of the recording.

This deadlock was more likely to be triggered for users of EIT because it will
ask for a reschedule every 5 minutes in addition to any reschedules triggered
through other means.

Refs #10771

Backported from master 788bc510ede427e43e77e8aaf53d8cf047d89b46

comment:24 Changed 12 years ago by Stuart Morgan <smorgan@…>

In 48ac7cfb51e82c3ddbaab489810b560fb8271531/mythtv:

Revert "Protect the reschedule queue with a seperate lock to fix a backend deadlock."

This reverts commit 5eb97790f3c36f9d2524275e545189cafda9d48a.

While this seems to work well in master, it appears to make the
situation worse in 0.25 and I really don't have the time to figure out
why. Anyone experiencing the pre-existing deadlock will need to
upgrade to 0.26 when it's released. Refs #10771

comment:25 Changed 12 years ago by Stuart Morgan <smorgan@…>

In 11d7795503b869aeb09b66d55f0430fcf504f924/mythtv:

Protect the reschedule queue with a seperate lock to fix a backend deadlock.

Fixes a deadlock in mythbackend caused when a reschedule request occurs
immedately prior to a recording starting.

The recorder thread locks the scheduler when starting a recording. The pending
reschedule request handled in the event loop waits for the scheduler lock to be
released blocking the event loop. The backend is now stopped from handling any
further events or protocol requests, meanwhile the recorder is waiting on
events to be processed before it will release the lock resulting in an
irrecoverable deadlock (or hang) and the failure of the recording.

This deadlock was more likely to be triggered for users of EIT because it will
ask for a reschedule every 5 minutes in addition to any reschedules triggered
through other means.

Refs #10771

comment:26 Changed 12 years ago by beirdo

See also #10702

comment:27 Changed 12 years ago by beirdo

See also #10876

comment:28 Changed 12 years ago by warpme@…

Stuart, #11d7795503h nicely improves user expierience as with this patch - when zapping LiveTV - I don't have deaf to Ch+/Ch- periods around full hours (when scheduler reschedules). Also "Opening video buffers failed too many times" when used zaps LiveTV around reschedule almost gone. Unfortunately it causes frequent BE deadlocks. So far I can describe 2 scenarios:

  1. I have applied #10076 with EIT scan window for 1:05AM-5:01AM. Scenario for deadlock is following:

-Start BE, wait whole night to EIT scan

-Launch LiveTV after night. It gets TLMSC but after 5-10sec of black screen going back to MainMenu? with "Opening video buffers failed too many times".

-Launching LiveTV again gives Popup "Connection to BE was lost" and BE stops to communicate with all FE (any action on FE gives 5-10sec stale and goes to MainMenu?)

  1. Turning-off EIT scan at time window (original way of gathering EIT), I can trigger deadlock in following way:

-Start BE and wait till EIT active scanner starts

-Start LiveTV few times

-After 1..3 LiveTV starts, I get "Opening video buffers failed too many times" and return to MainMenu?

-Now BE is in deadlock state where asking for LiveTV or recordings Playback gives immediate back to MainMenu?.

For this scenario I'm attaching BE.log & backtrace.

Generally - after 10 months of living with #10016 I think root cause is related to how active EIT scanner is stopped when recorder starts on given tuner. Doing some tests on scenario 2 shows me that deadlock occurs when given tuner is just grabbing EIT and grabber is preempted by LiveTV recorder.

If I'm reading code correctly - looking on eitscanner.cpp:83, main scanner loop can be ended in quite rare execution places. Lest imagine: if LiveTV recorder just tuning to new channel and eit scanner is just doing any action between eitscanner.cpp:86...151 - we have resource access conflict as tuner just grabbing EIT and in the same time is tuning to new channel. I believe issue described in #9480 comment #7 is exactly result of these conflicts (EIT is on Cyfra mplex while LiveTV tunes to Polsat).

I think we should look how exactly EIT scanner is tear down when recorder starts....

Changed 12 years ago by warpme@…

Attachment: trace_20120706 added

Trace for scenario2

Changed 12 years ago by warpme@…

Attachment: be.log added

BE.log for scenario2

comment:29 Changed 12 years ago by stuartm

The bt provided doesn't show a deadlock, at least not obviously. It also doesn't involve the scheduler. That commit does mean we're not blocking for so long under certain conditions, which might mean it's exposing a race condition elsewhere but it seems to be a new problem. Can you open a new ticket and maybe get a couple of new backtraces for comparison?

comment:30 Changed 12 years ago by warpme@…

Stuart, Now I revert 11d7795503h and will wait 3-4 days to see how 11d7795503h is related to problem described in comment#28. Regarding separate ticket - I already done it (some time ago). It is under #10016 - but currently first LiveTV launch causes deadlock while in past it was only LiveTV failure. I still think ideas expressed by me in comment#28 might be valid for #10016.

comment:31 Changed 12 years ago by David Engel <dengel@…>

Resolution: fixed
Status: infoneededclosed

In ca273324961b95e300890ed40e833249b5460bd8/mythtv:

Hopefully remove more scheduler/recorder deadlocks.

Change the scheduler to unlock schedLock when calling the recorder
synchronously. If the recorder happens to be calling the scheduler at
the same time, this allows the recorder's call to grab the lock as
needed.

Fixes #10647, #10770 and #10771 .

comment:32 Changed 12 years ago by David Engel <dengel@…>

In 4386e90e25ed40948bb208e8a325a8b072411e88/mythtv:

Hopefully remove more scheduler/recorder deadlocks.

Change the scheduler to unlock schedLock when calling the recorder
synchronously. If the recorder happens to be calling the scheduler at
the same time, this allows the recorder's call to grab the lock as
needed.

Refs #10647, #10770 and #10771 .
(cherry picked from commit ca273324961b95e300890ed40e833249b5460bd8)

Conflicts:

mythtv/programs/mythbackend/scheduler.cpp

Note: See TracTickets for help on using tickets.