Opened 2 years ago

Closed 2 years ago

Last modified 2 years ago

#13072 closed Bug Report - General (fixed)

Schedule busy looping with sched_sleep of 0

Reported by: lucylangthorne55@… Owned by: gigem
Priority: minor Milestone: 29.0
Component: MythTV - Scheduling Version: 0.28.0
Severity: medium Keywords:
Cc: Ticket locked: no

Description

Setup: 0.28.0+fixes.20170715.eef6a480b0-0ubuntu3 with master, one slave, mythweb frontend.

Several times a day the scheduler runs 100% even though disks and mysql are idle.

It appears that sched_sleep has becomes 0 while a recording is about to start (around line 2040). This then causes continuous scheduling since it remains at zero so does not sleep and re-loops hundreds of times a second.

Strangely, the s2n only appears to decrease every few seconds whereas I expect it to be per second.

The busy looping appears to be responsible for an overall unresponsive system (can't start frontend), possibly due to a lock held.

I suspect altering sched_sleep from max(...,0) to max(...,1000) (around line 2019) to ensure a minimum sleep of a second would help.

Attachments (4)

schedulebusyloop.log.gz (335.2 KB) - added by anonymous 2 years ago.
Log file with extra debug
schedulebusyloop2.log.gz (8.1 KB) - added by lucylangthorne55@… 2 years ago.
Busy loop with extra logging
schedulebusyloop2.patch (9.6 KB) - added by lucylangthorne55@… 2 years ago.
Patch for extra logging used in schedulebusyloop2.log.gz
13072-testfix1.patch (866 bytes) - added by gigem 2 years ago.

Download all attachments as: .zip

Change History (24)

Changed 2 years ago by anonymous

Attachment: schedulebusyloop.log.gz added

Log file with extra debug

comment:1 Changed 2 years ago by gigem

Status: newinfoneeded_new

I'm confused about a couple of things.

First, have you changed the log messages and also added your own? Your reported version matches the current git, but the messages in your log file don't.

Second, sched_sleep appears to be 0. That means something is needs to be done now. What is it and why isn't it getting done. You might have to add more debugging to show why nextWakeTime and sched_sleep don't increase.

comment:2 Changed 2 years ago by lucylangthorne55@…

Thanks for getting back.

Yes I had added some debug to try to determine the problem, sorry I hadn't made it clear.

So I was logging the secs2next etc at 2031, just above the if(sched_sleep) check for sleeping since there was no logging for why it was looping.

I ended up disabling the slave for the moment.

I'm thinking that since scheduled run time in my example log file is 122 that maybe what happens is that normally my reschedules are fast (<3s) but sometimes I get a reschedule that takes a very long time (60s+, sometimes several minutes), possibly because of the number of rules I have that match against description, maybe because slave connects/disconnects.

So, perhaps for most people we get "scheduled run time takes 2s", hence the busy loop is very fast since we often sleep until very near (within a couple of seconds of the start of the next programme), whereas for me sometimes it will say that scheduler run time could take three minutes so it busy loops for those three minutes, which then causes knock-on effects.

Now that I look at the code more, the schedRunTime is always the absolute max, rather than a "recent" max so that suggests that if a schedule takes three minutes at any point in the day then we assume all future schedules also take three minutes, so then we go around the loop, say we can't reschedule because it would take too long and nothing is due to start within 30s, and then busy loop for 2.5 minutes doing the same checks.

So all we need is for a reschedule to happen with three minutes of the start of the next programme for us to enter the busy looping, e.g., I think OTA EIT causes a reschedule every few minutes.

I've never managed to track down why a reschedule sometimes takes a long time with the same rule queries (even with a separate drive) sometimes being immediate and sometimes a couple of seconds, figuring it might be due to concurrent OTA EIT updates.

comment:3 Changed 2 years ago by lucylangthorne55@…

Looking at the code a bit more I noticed that the first schedule (which will be a slower full schedule) is excluded from the max schedRunTime, hence schedRunTime is typically for fast "schedule single item" queries.

But I think that when a slave connects, the master does a full reschedule (which is time consuming) and this is included in the schedRunTime max. Once that maximum is set, it is used for all future loops.

This could explain why this issue occurs more frequently when run with a slave.

comment:4 Changed 2 years ago by gigem

I apologize again for taking so long to reply.

The reason a reschedule is needed can cause drastic differences in how long it takes. The first pass and any time the guide data changes take the longest. The rest tend to be pretty quick. I have a long term goal to split the scheduler into two threads so that long schedules don't impact the starting of recordings. That isn't going to happen any time soon, though.

Back to your immediate problem. I'm still stumped as to why sched_sleep stays at 0. It, along with nextWakeTime, should increase as the scheduler processes the upcoming recordings. For reasons that aren't clear yet, nextWakeTime isn't advancing. Since you're adept at adding your own debugging, would you mind adding some debugging whereever nextWakeTime gets updated? If you'd rather I do it, I will try to get to it in the next day or two.

comment:5 Changed 2 years ago by lucylangthorne55@…

Not a problem. It will take a couple of days before I have time to properly debug since the extra logging fills my disk so I need to find a good time when no shows are scheduled.

comment:6 Changed 2 years ago by gigem

Okay. One thing you can do to reduce the logging is to use LOG_INFO for the stuff you're interested in so you don't have to run with --log-level debug.

comment:7 Changed 2 years ago by lucylangthorne55@…

I've added some debug but hadn't added enough. And now it's not failing at all, perhaps I don't have enough EIT events/programmes starting to trigger it.

One thing that happened was that when we loop to call HandleRecordings?, afterwards the nextStartTime was unchanged, which means the sched_sleep was zero since nextStartTime is in the past on the next doRun loop. But I didn't have enough debug to see what was happening, and it didn't result in the long busy loops that I have been getting.

I've added some debug around there to see how the nextStarttime is being modified and just waiting for it to happen again.

It looks like it just happened and looped for 30 seconds. I'll double check and see what was happening at the time. Looks like it was just prior to a programme start.

Changed 2 years ago by lucylangthorne55@…

Attachment: schedulebusyloop2.log.gz added

Busy loop with extra logging

Changed 2 years ago by lucylangthorne55@…

Attachment: schedulebusyloop2.patch added

Patch for extra logging used in schedulebusyloop2.log.gz

comment:8 Changed 2 years ago by lucylangthorne55@…

I've added the logfile from above that shows the busy loop. All timestamps are GMT to make it easier. The patch shows where the logging is occurring.

I've not done any analysis yet but I'll take a look at the code and log a little later see if I can help with diagnosis. Initially the problem seems to be around the HandleRecording? loop and maybe occurs when a new programme is about to be scheduled to start.

comment:9 Changed 2 years ago by gigem

Thanks. I'll try to do some analysis later today.

comment:10 Changed 2 years ago by lucylangthorne55@…

Patch is against 7a1ab5001cb0f9576a913f012fcda773590d6c95 (v29) whereas original issue was against 0.28.

(I would've mentioned it earlier but I was blocked for an hour as spam for too many posts).

Looking at the log, it suggests that since we woke at 14:28:56 with a next start time of 14:30, the HandleRecording? doesn't update nextWakeTime since it always takes the min and 14:28:56 is less than "14:30-preroll of 60 seconds" for the >0 case.

This means we'll always sleep for zero seconds and busy loop.

I'll add a bit more debug and schedule a few more programmes later to see what else is going on.

I don't know if the logic of the reclist allows us to unlock/sched_yield/relock the mutex each time around the doRun loop to give the frontend/mythweb a chance of grabbing the mutex so they don't get blocked for the few minutes.

comment:11 Changed 2 years ago by gigem

Okay, I know what's going on. It's related to the use of slave backends. Basically, we are scheduled to check if slaves can be put to sleep, but we are prevented from doing that because of the "too close to next recording" check. We then continue wanting to sleep until it's time to check slaves again, even though that time has already passed.

The patch I'm about to attach will probably work around the problem. It's untested, so can you please verify that it makes the problem better. It will always sleep for one second in the problem case when there is nothing else to do. In the meantime, I'll try to come up with a better solution.

Changed 2 years ago by gigem

Attachment: 13072-testfix1.patch added

comment:12 Changed 2 years ago by gigem

Also, the testfix1 patch is agains your already patched version.

comment:13 Changed 2 years ago by lucylangthorne55@…

Thanks for the quick patch. I've applied it but won't be able to test it properly until tomorrow.

comment:14 Changed 2 years ago by lucylangthorne55@…

I now have logfiles that show it entering both code paths and numItersBeforeSleep now doesn't exceed 4 (previously it was as high as 24726). So looks like it works. Thanks.

However, my slave seems to disconnect every time a scheduled programme is about to start with "Error, timed out after 7000ms". That must be coincidence since your patch shouldn't affect that logic, if anything your patch should made it better since locks are no longer held. The programme still gets recorded, so it's probably just that I'm paying more attention to the logfile and this problem is unrelated to this bug.

I've a few shows scheduled to record so will re-run the master without the debugger and will know in a few hours.

comment:15 Changed 2 years ago by gigem

Glad that worked. I'll make the real patch later today. Yes, the time out sounds like an unrelated issue. Keep an eye on it and please open a new ticket if it causes any problems.

comment:16 Changed 2 years ago by David Engel <dengel@…>

Resolution: fixed
Status: infoneeded_newclosed

In 871f9e82b95c3848978a5e032a40c58906b24565/mythtv:

Avoid busy looping in some cases in the scheduler when the time to
check on slaves that cat be put to sleep has already passed. Thanks
to lucylangthorne55 for helping to debug this issue.

Fixes #13072

comment:17 Changed 2 years ago by David Engel <dengel@…>

In 18c6f49975b28e36fe2a976edba5b77bfa701303/mythtv:

Avoid busy looping in some cases in the scheduler when the time to
check on slaves that can be put to sleep has already passed. Thanks
to lucylangthorne55 for helping to debug this issue.

Cherry picked form master commit 871f9e82b95c3848978a5e032a40c58906b24565.

Refs #13072

comment:18 Changed 2 years ago by David Engel <dengel@…>

In 3b18a09ad152ff4323effaaad45e5845e7d2d042/mythtv:

Avoid busy looping in some cases in the scheduler when the time to
check on slaves that can be put to sleep has already passed. Thanks
to lucylangthorne55 for helping to debug this issue.

Cherry picked from master commit 871f9e82b95c3848978a5e032a40c58906b24565.

Refs #13072

comment:19 Changed 2 years ago by gigem

Milestone: unknown29.0

comment:20 Changed 2 years ago by lucylangthorne55@…

Thanks for the quick commit. Yes, there's something odd about the slave timeout, but it's only happened once the last few hours. If I see some pattern I'll raise a ticket.

Note: See TracTickets for help on using tickets.