Opened 10 years ago

Closed 9 years ago

#7847 closed defect (fixed)

After resume from long S3 sleep, scheduling via EPG isn't refreshed on EPG

Reported by: warpme@… Owned by: cpinkham
Priority: minor Milestone: 0.24
Component: MythTV - User Interface Library Version: Master Head
Severity: medium Keywords:
Cc: Ticket locked: no

Description

I have issue which is quite annoying and related to FE using S3 sleep-resume. Symptom: scheduling via EPG isn't refreshed on EPG.

Scenario to reproduce: 1.Start system, launch FE. 2.Enter system into S3 sleep for longer than approx 15min. 3.Resume system form S3 sleep. 4.Go to EPG. 5.Select show and press Enter. 6.Choose desirable schedule options, press Save. 7.After return to EPG there is no this show refreshed state.

EPG refresh is working OK if sleep is shorter than approx 15 min, so it looks like sleeping longer that approch 15 minutes somehow broke FE-BE communication.

This issue is quite annoying and strongly decreasing WAF, as scheduling via EPG hasn't feedback to user (and my wife is complying that myth is not recording her lovely shows...)

My sys is minimyth based FE with 0.22-fixes SVN23069 + ticket7836

br

Attachments (8)

be.log.zip (490.8 KB) - added by warpme@… 10 years ago.
BE logs
fe.log.zip (177.0 KB) - added by warpme@… 10 years ago.
FE log
fe.log.2.zip (118.3 KB) - added by warpme@… 10 years ago.
FE log file after 8h (when issue apeared)
be.log (6.1 KB) - added by warpme@… 10 years ago.
BE log with 7836/7839/8024 and changset 23397
fe.log (3.3 KB) - added by warpme@… 10 years ago.
BE log with 7836/7839/8024 and changset 23397
mythbackend.log.bz2 (101.9 KB) - added by warpme@… 9 years ago.
FE/BE logs
mythfrontend-fe-test.log.bz2 (84.0 KB) - added by warpme@… 9 years ago.
FE/BE logs
7847_reconnect_event_socket.diff (478 bytes) - added by cpinkham 9 years ago.
Reconnect the event socket when we lose the main BE connection

Download all attachments as: .zip

Change History (27)

comment:1 Changed 10 years ago by stuartm

Milestone: 0.220.23
Status: newinfoneeded_new

We need frontend and backend logs.

Changed 10 years ago by warpme@…

Attachment: be.log.zip added

BE logs

Changed 10 years ago by warpme@…

Attachment: fe.log.zip added

FE log

Changed 10 years ago by warpme@…

Attachment: fe.log.2.zip added

FE log file after 8h (when issue apeared)

comment:2 Changed 10 years ago by warpme@…

Hi, FE/BE (verbose all) logs attached. Generated with scenario: -clear all logs -start be -start fe -wait 20min -on fe go to schedule, schedule "record once". First refresh was OK. Second schedule not (fe.log.zip). -fe was suspended for 8h -after 8 h resume fe, go to epg. schedule show. No refresh on epg. (fe.log.2.zip)

br

comment:3 Changed 10 years ago by anonymous

This will happen any time the BE uses the event socket while the FE is asleep; the length of sleep time is irrelevant. E.g. if one FE updates the schedule while another is sleeping, the BE tries and fails to update the sleeping FE, resulting in the BE closing its side of the event socket. When the FE wakes up, it doesn't know that the BE has closed the connection, so the FE forever listens on the dead socket.

[23397] will help, and with some interval tuning, it could be made to work reasonably well. But as Mark alluded, an application-level heartbeat on the event socket from the FE would be a more complete way to address this issue.

comment:4 Changed 10 years ago by warpme@…

Hi,

Thx for quick response.

I applied this patch for testing purposes. Tested on 0.22-fixes 23426 + ticket 7836

Results:

After resume from sleep, with sleep period longer than keep-alive-time+(keep-alive-intrval*keep-alive-probes), user sees dialog "backend connection lost".

By this I would say changset 23397 is serious regression for setups with separated FE & BE and when user is using S3 on FE.

I think it is expected result, as keep-alive is closing all TCP connections between BE-FE when sleep time is longer than keep-alive-time+(keep-alive-intrval*keep-alive-probes).

From this perspective, 23397 is serious regression as it causes FE-BE connection loss on sleeping FEs. In my case user sees "Backend connection lost" popup. When user trying still select recording to watch - FE crashes (this crash is not result of this patch. I have it also i.e. when I restart BE when FE sleeps).

Anyway - I think scenario with separated BE and FE when FE are using S3 for sleeps needs rethinking.
I would consider i.e. using TCP connections for all conn. initiated from FE to BE, and for long, persistent connections initiated from BE to FE I would consider switch from TCP to UDP. As UDP is stateless - maybe it will help solve problem ?

TCP Keep-alive solution is definitely bad idea IMHO and should be reverted form trunk.

Regarding heart-beat approach - I think it will be not so easy, as we will need in BE a way discover FE in sleeping state. Without it, BE will falsely threat slipping FEs as dead FEs.

Probably simpler approach will be approach with keeping all FE->BE connections as always-on, and changing persistent BE->FE connections to state-less type (UDP)

br

comment:5 in reply to:  4 ; Changed 10 years ago by Jeff Lu <jll544@…>

Replying to warpme@…:

By this I would say changset 23397 is serious regression for setups with separated FE & BE and when user is using S3 on FE.

[23397] is not a regression; it just exposes two existing bugs:

  1. Backend reconnection fails inside SendReceiveStringList? - I have submitted patch #8024 for this issue.
  1. Segfault related to the popup window - someone better versed in the UI will need to check this one.

comment:6 in reply to:  5 Changed 10 years ago by warpme@…

Replying to Jeff Lu <jll544@…>:

Replying to warpme@…:

By this I would say changset 23397 is serious regression for setups with separated FE & BE and when user is using S3 on FE.

[23397] is not a regression; it just exposes two existing bugs:

I have question here. How it will behave when mentioned bugs will be nailed ?.

Today in scenario with sleeping FE and restarted BE (as this scenario is good simulation of keep-alive closed all BE-FE connections) - after FE resume user sees following:

-resume FE

-after few sec main menu appears

-user kick watch recordings

-FE shows select group pop-up

-user selects particular group

-FE halts for 20-30 sec

-FE popups dialog "Backend Connection Lost"

Can we assume that those 20-30 sec waiting is also SendReceiveStringList? bug result and when this bug it will be cleared - lost of FE-BE conn. will be immediately reestablished ?

If Yes - then this is progress. If not - 23397 is for me steep back as by design will cause 20-30 delay after FE resume.

  1. Backend reconnection fails inside SendReceiveStringList? - I have submitted patch #8024 for this issue.
  1. Segfault related to the popup window - someone better versed in the UI will need to check this one.

Should I fill separate bug repot for it ?

br

comment:7 Changed 10 years ago by warpme@…

I applied ticket 8024 with changset 23397 and tickets 7836 & 7839

Indeed - with ticket 8024 - there is no "Backend Connection Lost" and 20-30sec delay after resume now is approx few sec. Unfortunately EPG auto-refresh still isn't working.

I'm attaching short logs form BE/FE

br

Changed 10 years ago by warpme@…

Attachment: be.log added

BE log with 7836/7839/8024 and changset 23397

Changed 10 years ago by warpme@…

Attachment: fe.log added

BE log with 7836/7839/8024 and changset 23397

comment:8 Changed 10 years ago by paulh

Status: infoneeded_newnew

comment:9 Changed 10 years ago by robertm

Milestone: 0.230.24

comment:10 Changed 9 years ago by robertm

Component: MythTV - GeneralMythTV - User Interface Library
Owner: changed from Isaac Richards to stuartm
Status: newassigned

comment:11 Changed 9 years ago by stuartm

Owner: changed from stuartm to paulh

comment:12 Changed 9 years ago by paulh

Owner: paulh deleted
Status: assignednew

Doesn't look like this is specific to the program guide but is a problem with our socket communication which is held together with string at the best of time. Throwing back into the pool for someone more knowledgeable of that area of code to look at.

comment:13 Changed 9 years ago by stuartm

Owner: set to cpinkham
Status: newassigned

comment:14 Changed 9 years ago by cpinkham

Status: assignedinfoneeded

If you can reproduce this issue, please run your frontend and backend with "-v network,extra,socket" and paste the relevant portions of the logs so I can see what is going on over the wire when the issue occurs.

comment:15 Changed 9 years ago by warpme@…

Chris,

Thx keeping eye on this ticked.

Please find FE/BE logs.

Sceario:

1.start backend

2.boot FE

3.sleep FE

4.wait few hours

5.resume FE

6.enter EPG

7.schedulle recording (no refresh)

8.exit EPG

9.enter EPG again. Schedule is from 7. is visible

sys is:

BE: Arch, myth trunk 26331

FE: minimyth derivate, myth trunk 26375

br

Changed 9 years ago by warpme@…

Attachment: mythbackend.log.bz2 added

FE/BE logs

Changed 9 years ago by warpme@…

FE/BE logs

Changed 9 years ago by cpinkham

Reconnect the event socket when we lose the main BE connection

comment:16 Changed 9 years ago by cpinkham

Can you try the attached 7847_reconnect_event_socket.diff patch and attempt to reproduce the issue again. We aren't currently reconnecting the event socket if it dies since we never try to send to it. The attached patch will reconnect the event socket if the main command socket has to be reconnected. In the logs, you can see the message "Connection to backend server lost" when we detect the command socket needs to be reconnected but the only connection after that is the command socket, the event socket is never reconnected so you don't receive any more events after unsuspending. This means the FE misses the scheduler's event signalling that the schedule has been changed and needs to be reloaded, so the FE never reloads the new list until after you exit and reenter the screen.

comment:17 Changed 9 years ago by cpinkham

Version: 0.22-fixesTrunk Head

comment:18 Changed 9 years ago by warpme@…

Chris,

7847 seems to help. We can close ticket as resolved. Thx

comment:19 Changed 9 years ago by cpinkham

Resolution: fixed
Status: infoneededclosed

(In [26433]) If we lose the backend command socket, such as when we are suspended for a long period of time, also close the event socket before reconnecting to the master backend. Since we don't ever send to the event socket, we weren't detecting it as closed. We should be able to detect this and handle it in a better manner and only reconnect the event socket when needed, but this option is safer closer to a release.

Fixes #7847.

Note: See TracTickets for help on using tickets.