Opened 13 years ago

Closed 12 years ago

#9792 closed Bug Report - Hang/Deadlock (Fixed)

Deadlocked scheduler

Reported by: warped <warpme@…> Owned by:
Priority: minor Milestone: 0.25
Component: MythTV - General Version: Master Head
Severity: medium Keywords:
Cc: Ticket locked: no

Description

Hi,

Today I have quite unpleasant system state (not seen ever before) where:

-at 9:25PM I discovered be missed 2 recordings (6:30PM and 7:00PM)

-be log since 5:35PM shows only auto-expire entries

-mythweb at 9:25PM in upcoming recordings page shows 7:00PM rec. as still ongoing (show starts at 7PM and should end 7:30PM)

-mythweb status shows all tuners idle

-schedulled recording at 9:30PM not started (mythweb upcoming recordings page shows 7:00PM rec. as still ongoing)

-I can start LiveTV and mythweb status shows correctly shows tuner

It looks like scheduler is deadlocked and no any new recordings can be started
BE backtrace attached.

I'm not sure is this problem related to 9773 - but symptoms are different as in this case for all FE backends seems to be fully functional.

Attachments (11)

trace3.txt (23.4 KB) - added by warped <warpme@…> 13 years ago.
BE trace
trace4.txt (22.2 KB) - added by warpme@… 13 years ago.
Another trace with the same symphtoms
trace5.txt (20.9 KB) - added by warped <warpme@…> 13 years ago.
trace of deadlock with tuned DB
tuned_db.txt (2.7 KB) - added by warped <warpme@…> 13 years ago.
mysqltuner.pl output on tuned DB
db_on_defaults.txt (2.7 KB) - added by warped <warpme@…> 13 years ago.
mysqltuner.pl output on defaults DB
be.log.zip (5.0 KB) - added by warpme@… 13 years ago.
be log from 29/06 deadlock
trce.log.zip (2.3 KB) - added by warpme@… 13 years ago.
be trace from 29/06 deadlock
trace_dd2d7a14.txt (21.4 KB) - added by warpme@… 13 years ago.
trace after dd2d7a14
db_status_dd2d7a14.txt (4.3 KB) - added by warpme@… 13 years ago.
DB stat after dd2d7a14
db_proceslist_dd2d7a14.txt (1.7 KB) - added by warpme@… 13 years ago.
DB proc.lst after dd2d7a14
backend-debug_dd2d7a14.zip (57.3 KB) - added by warpme@… 13 years ago.
BE log after dd2d7a14

Download all attachments as: .zip

Change History (31)

Changed 13 years ago by warped <warpme@…>

Attachment: trace3.txt added

BE trace

comment:1 Changed 13 years ago by danielk

Duplicate of [9773]. Again the problem is not returning from a DB query.

comment:2 Changed 13 years ago by warpme@…

Daniel, Thx. It looks like I have to do something with DB - right ? Just for sanity checking - I'm attaching another trace with exactly the same symphtoms. It will be really great if You find minute ot two and look at it - just for confirmation....

Also may You advice me what will be best to do. Sorry bothering You with such question but You are person best knowing context of this issue.

Which option is best to choose:

1.backup, clean, restore tables

2.clean DB and go with virgin tables

3.keep tables and downgrade engine to older version

4.<maybe You have any great idea>

thx

Changed 13 years ago by warpme@…

Attachment: trace4.txt added

Another trace with the same symphtoms

comment:3 Changed 13 years ago by warpme@…

Opps, Daniel, sorry. I miss Your post on 9773 !

comment:4 Changed 13 years ago by warped <warpme@…>

Well,
Week ago I do backup,delete,create,restore DB. I'm not sure also move I to default mysql settings at that time or earlier. Anyway - since week I was on my.cfg from arch package).
Week of stressing (45-50 rec. per day; alot of TV zapping with LiveTV) - all was OK. Today I was ready to put comment on this ticket that issue resolved.
3h ago I compiled latest HEAD (g62d7084), install and also decide to tune DB by mysqltuner.pl script. Following changes are added into my.conf:
changed table_open_cache 64->192
added query_cache_size = 12M
added query_cache_type = 1
added join_buffer_size = 384k
added tmp_table_size = 24M
added max_heap_table_size = 24M
added thread_cache_size = 8
Tuning gives mysqltuner.pl output like attached file. (also output on default settings is attached). After tuning DB tuning i.e. sched time was reduced from 83sec to 23sec. Refreshing rec. UI is really faster with tuned DB.
1h I was really sad to discover deadlock again. Trace attached.
Hmm - first what comes to mind is tuning impact. Is it possible that this is DB bug ?

Changed 13 years ago by warped <warpme@…>

Attachment: trace5.txt added

trace of deadlock with tuned DB

Changed 13 years ago by warped <warpme@…>

Attachment: tuned_db.txt added

mysqltuner.pl output on tuned DB

Changed 13 years ago by warped <warpme@…>

Attachment: db_on_defaults.txt added

mysqltuner.pl output on defaults DB

comment:5 Changed 13 years ago by danielk

If you look at "Thread 20 (Thread 0xaaba4b70 (LWP 1532)):" it still looks like it is deadlocked in the mysql libraries.

What Linux distro and release are you running? (I'm assuming you are up to date with fixes)

Also from the timing it sounds like running mysqltuner.pl was the immediate cause of the deadlock. This may be a false positive, but had you run that script on the previous database that was deadlocking?

comment:6 Changed 13 years ago by warped <warpme@…>

Daniel,
I'm running on ArchLinux? (in fact I'm using Arch as base for my own distro). In context of fixes Arch is rolling release distro - so sys is usually more fresh than other distros. Average delay between upstream and arch packagers is week or two.
Last pkg sync between my distro and arch pkg base was at 20110513 - so I would say OS is latest greatest.
As I understand deadlock is in mysql libraries (so not in myth code). Is this mean issue is in OS ?. If so - shame on Arch (other than that - it is really great distro for appliances).
I started to drill in DB optimization and currently I'm experimenting with query_caching strategies.
Now I'm testing "Do not cache results in or retrieve results from the query cache". 24h so far OK
"Cache all cacheable query results except for those that begin with SELECT SQL_NO_CACHE"
and "Cache results only for cacheable queries that begin with SELECT SQL_CACHE" both gave deadlocks within hours.

In past I wasn't use mysqltuner.pl - but I had tuned DB settings from some posts in users-mailing list. I just checked - my tunings were touching caching strategy. It was set to "Cache all cacheable query results except for those that begin with SELECT SQL_NO_CACHE". If current tests with "Do not cache results in or retrieve results from the query cache" will finalize OK - it will probably mean that for some reason MySql? 5.1.11, 5.1.12 & 5.5.12 (only those I tested) can't use caching with MythTV on Archlinux....
Daniel - btw thx for Your time helping to deal with issue probably not related to myth !

comment:7 Changed 13 years ago by warpme@…

Hi,

After week of heavy stress tests I'm happy to report that if I optimize mysql query caching with:

"Do not cache results in or retrieve results from the query cache"

system works stable. Test scenarios, results I will post on relevant to ticket9773 thread on dev-list as I think this ticket and 9773 have common root cause related to mysql running on ArchLinux?.
Situation where mysql query caching with:

"Cache all cacheable query results except for those that begin with SELECT SQL_NO_CACHE" and
"Cache results only for cacheable queries that begin with SELECT SQL_CACHE"

are giving deadlocks is still strange for me as issue is with 3 diferent mysql versions and 4 full system upgrades within quite long periods of time (so almost all key OS components are changed). I think both tickets can be closed with status invalid.
Big thx for Your help !

comment:8 Changed 13 years ago by Ian Dall <ian@…>

There are mysql bug reports concerning query_cache_type and query_cache_size. Apparently, query_cache_size = 0 is the default but is not enough to turn query caching off in some versions of mysql.

In any case, this turns out to not work for me. I set query_cache_type = 0, and I still get lock ups.

I have a master BE and two slave BE's. This is fairly new for me. Until recently, I had a single BE (most often with single FE on the same machine). I did not see these backend lock ups when I only had one!

comment:9 Changed 13 years ago by warped <warpme@…>

Ian,

Thx for this post. Your insights intrigued me so I do quick research on bugs.mysql and it seems that query_cache has still unresolved problems. Many ppl are suggesting turn-off this feature and what is more important - query_cache_size=0 is not enough to completely turn it off as query_cache_type is defaulting to 1 - and being non-zero still acquires internal DB spin-locks. It is highly recommended to set both (query_cache_size and query_cache_type) to 0. I do this and will observe system stability. Thx again and good luck !

comment:10 Changed 13 years ago by warpme@…

Hi,

Since my last post into this ticket I do following:

-build fresh mysql 5.1.57 (most recent from 5.1 mainline) & rebuild all dependent pkg (qt, perl/python, etc)

-reinstall DB from full scratch (also with deleting DB dir). Also I disable query_cache as described in my previous post.

-restore myth BD via mythconverg_restore.pl

-turn off all other apps using mysql. Now there is only mythtv.

It was 3 weeks ago. Since that time I had 2 times situation when:

-myth is still recording via already defined recording rules.

-no any new rule can be added (mythweb, UI; I can add it normally but rule simply is not appearing in recording rules).

-after few hours of such state some events causes unblocking scheduler and all missed records are queued and triggered to record at once. This produces series of 0B recordings. I however still can't add new rules into recording rules list.

-be restart helps.

From usage point of view - scheduler is programmed by me & wife. I entered sometimes ago many custom advanced rules. My wife is using myth exactly like VCR - she is going into EPG and manually program their shows usually with "record once". She is doing this really quickly and usually 5-10 shows during programming session.

History of last deadlock: wife ended programing 3 shows approx at 11PM. Next myth finished recording at 11:05PM. Next be log for 6h entries are only about eutoexpiring. BE misses 6 recordings in this period. At 6:50AM there is DB related entry log. Exactly after this BE kicks all queued delayed recordings to record. All them are 0B.

I'm attaching BE debug log and trace from BE when it was in "no accepting new rules" state. There is NO any entry in any system log saying anything about mysql error, restart, whatever.

Hope log/trace helps. (I don't want to mention about v.hard time with WAF. This is bad luck. Lastly I spent considerable time bisecting FE issues but in reality wife sees her missed recordings. Oh...)

br

Changed 13 years ago by warpme@…

Attachment: be.log.zip added

be log from 29/06 deadlock

Changed 13 years ago by warpme@…

Attachment: trce.log.zip added

be trace from 29/06 deadlock

comment:11 Changed 13 years ago by stuartm

@ Ian - What version of mysql is being used and which distribution? I'd like to see if there are any common factors with other database related deadlocks.

comment:12 Changed 13 years ago by Ian Dall <ian@…>

I am using Fedora 14, mysql server version 5.1.56.

comment:13 Changed 13 years ago by beirdo

Type: Bug Report - GeneralBug Report - Hang/Deadlock

comment:14 Changed 13 years ago by Github

Enable automatic reconnection for MySQL connections.

After 155138a127 changed the minimum supported Qt version to 4.6, we now have support for MYSQL_OPT_RECONNECT in QSqlDatabase::setConnectOptions().

Note that before MySQL 5, reconnecting connections were the default in the MySQL C API (so, were always enabled for the Qt-MySQL driver). Unfortunately, depending on how the application is written, reconnections may cause issues (since reconnections have effects on transactions, table locks, temporary tables, session and user variables, prepared statements, and such), so MySQL 5.0.3 changed the default to disable automatic reconnections, and MySQL 5.0.13 added MYSQL_OPT_RECONNECT to make reconnection an application-controllable option. Since the Qt-MySQL driver didn't have support for specifying MYSQL_OPT_RECONNECT, it was impossible to get reconnecting connections with MySQL 5.0.3+ in Qt applications. Finally in Qt4.6, Qt devs added support for specifying MYSQL_OPT_RECONNECT, so we can re-enable automatic reconnect.

Thanks to David Edwards for the reminder to set this option.

Refs #9704, #9773, #9792.

Branch: master Changeset: dd2d7a14d10c55f8bc8d12e944adb3b3c9ac8cea

comment:15 Changed 13 years ago by warpme@…

Hi, I rebuild BE/FE to dd2d7a14. Unfortunately I had deadlock again. Trace, BE log, DB stats/proc_list attached

Changed 13 years ago by warpme@…

Attachment: trace_dd2d7a14.txt added

trace after dd2d7a14

Changed 13 years ago by warpme@…

Attachment: db_status_dd2d7a14.txt added

DB stat after dd2d7a14

Changed 13 years ago by warpme@…

Attachment: db_proceslist_dd2d7a14.txt added

DB proc.lst after dd2d7a14

Changed 13 years ago by warpme@…

Attachment: backend-debug_dd2d7a14.zip added

BE log after dd2d7a14

comment:16 Changed 13 years ago by Github

Fix SQL reconnection logic. Refs #9704. Refs #9773. Refs #9792.

What this does is get rid of the old KickDatabase?() logic and replaces it with a simple reconnect if the DB socket is disconnected, and also gets rid of a recent re-enablement of the automatic reconnects in [dd2d7a14d1] and importantly it extends the reconnection logic in [528324d8ba] so that it resends the prepare() and rebinds the bindings when a reconnect happens in an exec().

This also makes the QSqlQuery parent of MSqlQuery private and adds some wrappers so we can prevent MSqlQuery users from accessing methods that may invalidate some assumption in MSqlQuery + it adds debugging wrappers for first(), last(), previous(), and seek() like we already had for next().

A little bit of history is in order.

In Qt3 the SQL driver automatically reconnected whenever a connection was lost. But the first query after this auto-reconnect always failed. So in 2003 mdz added KickDatabase?() which would recover disconnect and reconnect an automatically all was good with the world.

In Qt4 the QSqlQuery::prepare() started to actually do prepared statements instead of emulating them. This caused a small problem in now it was possible that a disconnect happened between the call to prepare() and exec() and then the exec() would fail. But odds of this were still pretty low for most uses of the database in MythTV. But at some point trolltech also got rid of the automatic reconnections which caused us trouble in Qt3. To fix this we added reconnection logic in [528324d8ba] and this is where the real trouble began. While this reconnection logic really does exactly the same thing that the auto-reconnection logic did in Qt3, the fact that prepare() doesn't just emulate prepared statements in Qt4 means that it's now that when we reconnect to the database in exec() the prepared statement has not been sent to the database.

Note: This should be back-ported to 0.24-fixes after a reasonable burn-in time in master.

Branch: master Changeset: 4dfcdb8dd0c80d096ed95d5eccd489c692efa90d

comment:17 Changed 13 years ago by Github

Few bits missing in [4dfcdb8dd]. Refs #9704. Refs #9773. Refs #9792.

Branch: master Changeset: 1fb0a77fc17089b5cd2bf36ef07998bf68461261

comment:18 Changed 13 years ago by danielk

Note: until all the db reconnection issues are resolved and this ticket is possibly fixed, I will only be referencing the #9704 on reconnection related commits.

comment:19 Changed 12 years ago by warpme@…

Hi, After Daniel's MySql? reconnect commits I didn't observe this issue within 4 weeks on 24/7 working production system. From my perspective we can close this ticket with status resolved.

comment:20 Changed 12 years ago by stuartm

Milestone: unknown0.25
Resolution: Fixed
Status: newclosed
Note: See TracTickets for help on using tickets.