Opened 18 years ago

Closed 18 years ago

#551 closed defect (invalid)

Scheduler consuming all CPU cycles and disk bandwidth (via mysql).

Reported by: anonymous Owned by: bjm
Priority: major Milestone: 0.19
Component: mythtv Version: head
Severity: medium Keywords: blocking scheduler
Cc: Ticket locked: no

Description (last modified by danielk)

SVN for the last week has had some major issues with blocking in file writing. In a test system that has not had any changes except SVN versions of Myth, we've seen blocking very frequently. The problem typically comes up when recording two HD shows or viewing one and recording another. This had worked fine, including recording 4 HD streams and viewing another using SVN from three weeks ago. When failures occured, we tried increasing the ring buffer size 4 times the default without any help. The system is not out of memory either. Doing a top shows the mythbackend running at 60%-100% CPU usage, which I've never seen before. The system load (via the w command or top) has shown the load between 6-14 at times, when nothing else except system processes were running. Stopping the backend returns the system back to a normal load. DMA has been checked on the drives, so it's not this (And since nothing changed since it worked fine.)

I think the system load jumping so high may be a sign of a threading problem for file writing?

We've disabled the eit-parser since this was what I've noticed the most experimental changes in lately but it had no effect.

The end result are video files with major corruption. Over a 5 minute span, only 2-5 second clips every 5-30 seconds are saved.

Attachments (2)

basic.txt (74.1 KB) - added by danielk 18 years ago.
time to schedule normally
aftermythfilldb.txt (37.2 KB) - added by danielk 18 years ago.
time to reschedule after clearing program table and running mythfilldatabase

Download all attachments as: .zip

Change History (46)

comment:1 Changed 18 years ago by anonymous

Some Debug (-v record)

2005-11-01 08:44:46.322 IOBOUND - blocking in ThreadedFileWriter::Write() 2005-11-01 08:44:48.633 DVB#0 Rec: DVB Buffer overflow error detected on read 2005-11-01 08:44:48.637 DVB#0 Rec: WARNING - Got data from card after 500 ms. 2005-11-01 08:44:48.637 DVB#0 Rec: PID 49 _continuity_count 2 cc 8 2005-11-01 08:44:48.637 DVB#0 Rec: PID 52 _continuity_count 1 cc 14 2005-11-01 08:44:51.277 IOBOUND - blocking in ThreadedFileWriter::Write() 2005-11-01 08:44:58.168 DVB#0 Rec: DVB Buffer overflow error detected on read 2005-11-01 08:44:58.176 DVB#0 Rec: WARNING - Got data from card after 500 ms. 2005-11-01 08:44:58.176 DVB#0 Rec: PID 49 _continuity_count 5 cc 6 2005-11-01 08:44:58.177 DVB#0 Rec: PID 52 _continuity_count 5 cc 6 2005-11-01 08:45:00.609 IOBOUND - blocking in ThreadedFileWriter::Write() 2005-11-01 08:45:09.966 DVB#0 Rec: DVB Buffer overflow error detected on read 2005-11-01 08:45:09.971 DVB#0 Rec: WARNING - Got data from card after 500 ms. 2005-11-01 08:45:09.971 DVB#0 Rec: PID 49 _continuity_count 5 cc 4 2005-11-01 08:45:09.972 DVB#0 Rec: PID 52 _continuity_count 7 cc 13 2005-11-01 08:45:13.187 IOBOUND - blocking in ThreadedFileWriter::Write() 2005-11-01 08:45:19.970 DVB#0 Rec: DVB Buffer overflow error detected on read 2005-11-01 08:45:19.978 DVB#0 Rec: WARNING - Got data from card after 500 ms. 2005-11-01 08:45:19.978 DVB#0 Rec: PID 52 _continuity_count 13 cc 1

comment:2 Changed 18 years ago by danielk

Description: modified (diff)
Owner: changed from Isaac Richards to danielk
Status: newassigned

I couldn't reproduce this initially, but I can now..

I believe the only change was that I upgraded from Linux 2.6.12 to 2.6.13...

I'll take a look at this tomorrow.

comment:3 Changed 18 years ago by anonymous

I'm running 2.6.12, athlon xp 2500. Memory is fine, no swapping, etc. Disk usage looks fine. I've swapped usable space between drives (Using LVM) and it doesn't matter which drive(s) it's writing to. I don't notice throughput to the disk change either, and it's blocking when between 300k and 2.3MB/s are being written to the disk, which any disk can sustain. I've even played with dma settings, caching read and writes, and none of it has an effect, so I'm fairly certain I've cleared hardware as the problem. Nothing has been changed except myth versions. I'll sometimes see the load drop to 92% CPU usage, but system load still sits around 15, which is extremely high. Network transfers (if going on), and everything seams to stutter/take a hit when the problem is going on. It's really feeling like something is capturing the io and holding it. Write syncing, threads, or reads, or something. Maybe it's a problem with the file reader? .. Just some ideas.

comment:4 Changed 18 years ago by danielk

I also added a couple SATA drives. I'm not actually using them yet so I didn't think it was relevant, but there is a message in the mythtv-users list "HDTV digital static with DVB" where someone makes the claim that the Si 3112 SATA drivers can cause this type of problem. I still think it is a Red Herring, but I'm using this chipset. I'm going to try to revert to the non-SATA drivers for my chipset as a first step.

Valgrind is reporting a memory problem with the write() on safe_write, but if that really was the problem I would expect my PVR-250 recordings to be messed up as well.

comment:5 Changed 18 years ago by anonymous

I'm using ATA, seagate and maxtor drives. I have 4 drives on the motherboard and 2 more on a PCI IDE controller. One other thing, I've noticed the blocking runs from 2-10 minutes, then it it's suddenly fine for sometimes the rest of the recording, or sometimes problems come back after a few minutes. I can try running valgrind and see what goes on.

Other things I've noticed, sometimes if I kill the backend when it's having problems and start things up again the blocking starts 1-2 seconds after it begins recording (2-4 seconds after starting the backend), I have the buffer set to 96k (max) and I doubt this is filled up that fast...

comment:6 Changed 18 years ago by Adam Egger

I got 14,000 small files on my system on 11/05 with 4,000 new db entries. I haven't seen this behavior since then. My system DVB-t/Germany, debian32 sid, a very small load usually, IDE samsung HDD, Kernel 2.6.13.

comment:7 Changed 18 years ago by danielk

(In [7802]) References #551. Small cleanup + documentation of TFW.

comment:8 Changed 18 years ago by danielk

I'm getting the glitches but I'm actually not getting these warning messages. I've made the warnings a little more informative in 7802. Can you try this and attach the resulting log to the ticket?

(You'll have to make a scheduled recording, rather than using LiveTV, right now.)

comment:9 Changed 18 years ago by danielk

Hmm, I don't believe my glitches are coming from the threadedfilewriter. I added some instrumentation on DVBRecorder::ReadFromDMX(), so that it prints out anytime this takes longer than 2 ms. What I got was mostly boring.

2005-11-08 17:47:45.489 ReadFromDMX took 3 ms
2005-11-08 17:47:45.500 ReadFromDMX took 4 ms
2005-11-08 17:47:45.509 ReadFromDMX took 3 ms
2005-11-08 17:47:45.519 ReadFromDMX took 3 ms
2005-11-08 17:47:45.528 HandleKeyframe()
2005-11-08 17:47:45.528 SavePositionMap()
2005-11-08 17:47:45.529 ReadFromDMX took 3 ms
2005-11-08 17:47:45.539 ReadFromDMX took 4 ms

All this would indicate that we should probably double the default buffer sizes.

But once in a while you get something like this:

2005-11-08 17:47:46.045 ReadFromDMX took 3 ms
2005-11-08 17:47:46.052 HandleKeyframe()
2005-11-08 17:47:46.052 SavePositionMap()
2005-11-08 17:47:46.081 find keyframes took 29 ms
2005-11-08 17:47:46.099 ReadFromDMX took 47 ms
2005-11-08 17:47:46.105 ReadFromDMX took 4 ms
2005-11-08 17:47:46.114 ReadFromDMX took 4 ms
}}

So it looks like we are blocking on database saves, not disk writes..
And this is way bigger than the DVBRecorder 8KB buffers, which allow
ReadFromDMX() to take up to 3.4 ms without the chance of an overflow..

The HDTVRecorder is not affected by this because it grows it's buffer
when something like this happens.

comment:10 Changed 18 years ago by danielk

Priority: majorcritical

I believe most of the glitches I'm seeing have nothing to do with the recording.

I am getting glitches due to the position map DB save, but these happen much less frequently; except in the first few seconds of the recording. Most of the glitches I'm getting are on playback, and these are due to avformatdecoder not keeping the buffers full.

Here is a typical buffer line with NTSC: UUUUAUUUUUUUUUUUUUUUUUUUUUUUUUL And with ATSC 1920x1080: AAAAAAAAAAAAAAAUAALAAAULAAAAAAA As you can see most of the buffers are "U"navailable in the NTSC case, and "A"vailable in the ATSC case. We want most of the buffers to be full and hence "U"navailable during normal playback. I'm also noticing that the frontend is now consuming 80-95% of my CPU, where before it was in the 70's.

I'm doing some player benchmarking now to see where all the CPU is going. This still happens with libmpeg2, so ffmpeg is not the prime suspect.

comment:11 Changed 18 years ago by anonymous

I ran the mythbackend with all debug on, I'll link to a 500k file that contains a chunk of debug just before until just after the writes blocked.

linuxis.us/blocking

comment:12 Changed 18 years ago by danielk

Hmm, well it is definately a TFW blocking issue that you are experiencing. can you add this line to TFW @ line 40 (just before the while): VERBOSE(VB_IMPORTANT, LOC + "safe_write(size: "<<sz<<")");

Also I fixed the verbose macro for the IOBOUND message.

Can you also give me the output of "hdparm /dev/diskdevice" and "hdparm --direct -t /dev/diskdevice" for the recording volume.

For reference, this is what I get for the read test:

"Timing O_DIRECT disk reads: 186 MB in 3.03 seconds = 61.35 MB/sec"

I believe what I'm experiencing is mostly a RingBuffer? issue...

comment:13 Changed 18 years ago by danielk

(In [7819]) References #551.

This makes dvbrecorder less sensitive to delays in writing to disk or DB.

This just sets the DVB buffer sizes large enough so that a delay of up to 500 ms between reads is ok.

comment:14 Changed 18 years ago by anonymous

Here's a few blockings.. -- This happened very shortly after starting the recording, so I'm not 100% sure this is the typical blocking I see, I'll post another dump when that comes up. This has your extra printout and this mornings svn version 7830.

file: linuxis.us/dump3.blocking : Blocking IO is 3/4ths the length down, 1k lines.

I'll only post one set of hdparms, since all the settings are identical on each drive (I've checked this a dozen times so I know they're synced.

/dev/hdg:

Timing O_DIRECT disk reads: 82 MB in 3.07 seconds = 58.73 MB/sec

/dev/hdg:

multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 24792/255/63, sectors = 398297088, start = 0

/dev/hdg:

Model=Maxtor 6Y200P0, FwRev?=XXXXXX, SerialNo?=XXXXXX Config={ Fixed } RawCHS=16383/16/63, TrkSize?=0, SectSize?=0, ECCbytes=57 BuffType?=DualPortCache?, BuffSize?=7936kB, MaxMultSect?=16, MultSect?=16 CurCHS=4047/16/255, CurSects?=16511760, LBA=yes, LBAsects=268435455 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 udma6 AdvancedPM=yes: disabled (255) WriteCache?=enabled Drive conforms to: (null):

  • signifies the current active mode

comment:15 Changed 18 years ago by anonymous

Okay, it's obvious I fudged one number, I have my drives in udma mode 2 right now. I set it this way the other day to see if this helped. I prefer to reboot to reset the udma mode to it's default udma6, but I haven't wanted to do this right now. I normally get about 60-70MB/s read, which is why I put that number in, I can reboot and let you ssh into the box if you want the proof it's normally like that. ;)

comment:16 Changed 18 years ago by Dibblah

What's the filesystem here?

Looks like the delete that happens: 2005-11-10 08:37:37.790 About to unlink/delete file: /media/myth/recordings1052_20051109070000.mpg 2005-11-10 08:37:37.914 TFW: safe_write(size: 265456) 2005-11-10 08:37:38.073 TFW: safe_write(size: 264328) 2005-11-10 08:37:38.232 TFW: safe_write(size: 265268) 2005-11-10 08:37:38.390 TFW: safe_write(size: 265644) 2005-11-10 08:37:38.549 TFW: safe_write(size: 264892) 2005-11-10 08:37:38.709 TFW: safe_write(size: 265644) 2005-11-10 08:37:38.867 TFW: safe_write(size: 265456) 2005-11-10 08:37:39.969 TFW, Error: Write() -- IOBOUND begin cnt(188) free(11) 2005-11-10 08:37:40.202 TFW: safe_write(size: 91736) 2005-11-10 08:37:40.202 TFW: safe_write(size: 432552) 2005-11-10 08:37:40.204 TFW: safe_write(size: 524288) 2005-11-10 08:37:40.206 TFW: safe_write(size: 524288) 2005-11-10 08:37:40.208 TFW, Error: Write() -- IOBOUND end 2005-11-10 08:37:40.208 DVB#0 Rec: Warning - TS packet write took 239 ms! 2005-11-10 08:37:40.236 DVB#0 Rec: Warning - ReadFromDMX took 268 ms

Throws everything off...

comment:17 Changed 18 years ago by anonymous

ReiserFS.

comment:18 Changed 18 years ago by danielk

Hmm, I'm curious about the filesystem too, which version of ReiserFS?

XFS and JFS used to be the only Linux filesystems that handled large file deletion well, but I don't know anything about how ReiserFS v4 handles deletion.

But from the debugging it is fairly obvious that if the ringbuffer released the buffer lock between some of the writes, and responded more quickly to a low buffer situation things would work (at least now that [7819] makes the DVB buffers a reasonable size). I'm going to test some of the patches in my queue today, then I'll have a look at the ringbuffer problems again. My guess is that at some point the buffer was made larger, and then your system stopped being able to drain it quickly enough.

comment:19 Changed 18 years ago by anonymous

ReiserFS 3.6.19. I have run XFS and JFS in the past, and at times I still received the blocking errors. I've been running ReiserFS for about 2 months and things worked fine until the middle of October. I think this delete file brought out /a/ blocking problem, but I don't believe it's the main blocking error we're trying to track down. This was a very short block, the ones that are the true problem are when it goes on for 2-5 minutes of blocking and there was no deleting prior. I'll set to record a show and get it to block, then put the dump online.

For fun, here is what happens if I run dd if=/dev/zero of=/file_on_lvm. (see linuxis.us/dump4.blocking) I've added comments to the dump where the dd started and ended. You'll notice this is much different than the other dumps because it practically stops myth in it's place. Myth freezes for 15 seconds or so, then does a little work, then waits again (Note: the CPU load is about 10%). In the problem we're specically trying to solve Myth is constantly running, never freezes/waits, it's constantly spitting out io blocking, and processessing data and, the CPU usage for mythbackend maxes the CPU at 100%. I'll attach a dump of this as soon as it happens again... (Withing a couple hours)

comment:20 Changed 18 years ago by anonymous

Two more debug logs.

linuxis.us/dump5.blocking has a couple IO blocking errors, minor failure...

linuxis.us/dump6.blocking is a full blown/doesn't correct for minutes blocking state. The system load at this point is over 12, mythbackend taking up 99% of the CPU, no other tasks going on.

comment:21 Changed 18 years ago by danielk

Changeset [7844] affects this, it increases the write buffer size to 4 MB for DVB recordings.

comment:22 Changed 18 years ago by danielk

(In [7846]) References #551. Documentation, plus a little tweaking in DiskLoop?().

I added code to lower the write size in DiskLoop?() if TFW::Write() is already blocking. This should be small enough so that the Write() will unblock soon enough so that packets get don't get dropped, but large enough so that the next full size disk write completes before Write() blocks again.

I also added bufferHasData.wakeAll(); in Write() when it gets blocked, so that the DiskLoop? doesn't just sit around waiting for more data when a buffer full situation has been reached... This added up to 100 ms to the block time.

comment:23 Changed 18 years ago by danielk

anon, can you give this latest code a try?

You may need to tweak some values.

1/ The TFW_BLK_WRITE_SIZE. See line 26 of ThreadedFileWriter?.cpp 2/ The TWF buffer size. See line 774 of tv_rec.cpp

Try increasing the first until you don't get back to back blocking events. Try increasing the second until the blocking events become infrequent.

The values I picked may work, but they are just educated guesses, actual values from a malfunctioning system would probably work better.

comment:24 Changed 18 years ago by anonymous

I've been playing with numbers for hours.. Basically I can't get anything to work. I've put buffers up to 128MB/s and it still eventually fills them and IOBOUNDs on writing.

One thing I notice though, is the CPU load his 100% and if the buffer is 32-64MB, it's at 100% usage for 10-20 seconds before I start to get IO blocking. .. I'm wondering if something else is going on and the blocking on writing is a result, not a cause. Maybe if I can find what's maxing the CPU and that's fixed, all these problems will go away.

Is there a specific way myth should be profiled? I'll gladly profile it I can get the instructions on how for Myth. I think if we can find what's killing the CPU, we can find out what's causing these problems.

One side note, with the buffer set high I do start seeing "taking a long time to flush" which is coming from the TFW.

comment:25 Changed 18 years ago by danielk

Profiling MythTV backend:

 ./configure --compile-type=profile [+ normal options]
 make clean; make
 su
 make install
 opcontrol --deinit ; opcontrol --init
 # assuming Intel P4, the next line is correct, for other CPUs you may
 # may need to count something other than INSTR_RETIRED
 opcontrol --image=/usr/local/bin/mythbackend -e INSTR_RETIRED:6000:0x0f:0:1
 /etc/init.d/mythbackend restart
 # now start a recording and wait for bad behaviour, then run the following
 opcontrol --reset ; opcontrol --start
 # now wait a minute or two then run
 opcontrol --stop ; opcontrol --dump ; opreport --merge tgid -l --image-path=/usr/local/bin > report.txt
 emacs report.txt &

This assumes you have oprofile installed, I'm using version 0.9.1. On my machine mythbackend only uses a few % of CPU and the highest percentage is used by malloc.

comment:26 Changed 18 years ago by anonymous

I'm running an athlon 2500XP, and I couldn't find a -e option that worked, so I ran without this command line option. I'm not sure what to make of the results though (If this is normal/fine or not).

CPU: Athlon, speed 2088.63 MHz (estimated)

Just as a reminder, when blocking is going on, mythbackend is consuming 99%-100% of the CPU.

Run 1: 15 seconds while blocking was going on.

Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name

2 22.2222 QValueList<QString>::detachInternal()

2 22.2222 QValueListPrivate<QString>::QValueListPrivate()

2 22.2222 QValueListPrivate<QString>::insert(QValueListIterator<QString>, QString const&)

1 11.1111 MainServer::qt_invoke(int, QUObject*)

1 11.1111 QValueListPrivate<QString>::~QValueListPrivate()

1 11.1111 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char>

::_M_really_sync(unsigned int, unsigned int)

<EOF> .. so in 15 seconds myth only made these 6 calls it appears...

Run2: ~5 minute run with blocking happening during the run, but not the entire time.

samples % symbol name

110 36.6667 Scheduler::RunScheduler?()

40 13.3333 QValueList<QString>::detachInternal()

31 10.3333 QValueListPrivate<QString>::insert(QValueListIterator<QString>, QString const&)

27 9.0000 QValueListPrivate<QString>::QValueListPrivate(QValueListPrivate<QString> const&)

26 8.6667 QValueListPrivate<QString>::~QValueListPrivate()

<SNIP>

Run 3: I left this running for a couple hours, blocking occured during this run, but not the entire time.

samples % symbol name

1132 28.4637 Scheduler::FindNextConflict?(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, ProgramInfo?*, std::_List_iterator<ProgramInfo?*, ProgramInfo?*&, ProgramInfo?>&)

620 15.5896 Scheduler::AddNewRecords?()

240 6.0347 void std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::merge<bool (*)(ProgramInfo?*, ProgramInfo?*)>(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, bool (*)(ProgramInfo?*, ProgramInfo?*))

175 4.4003 Scheduler::MarkShowingsList?(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, ProgramInfo?*)

131 3.2939 void std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::sort<bool (*)(ProgramInfo?*, ProgramInfo?*)>(bool (*)(ProgramInfo?*, ProgramInfo?*))

<SNIP>

comment:27 Changed 18 years ago by danielk

Hmmm, this is very strange, it seems to be spending all it's time rescheduling, not in writing to disk.

Can you try disabling EIT parsing? To see if that makes the problem go away.

Also when you rerun OProfile try this

opcontrol --image=/usr/local/bin/mythbackend -e CPU_CLK_UNHALTED:10000:0x00:1:1 /etc/init.d/mythbackend restart

It lowers the cycles between samples from 100,000 to 10,000 which should give you more meaningful results more quickly. If opcontrol doesn't allow such a low value, it should tell you what the minimum is.

comment:28 Changed 18 years ago by anonymous

As far as I know, EIT is disabled. I used to run the disable eit configure option, but I believe now it's disabled by default? (Is this correct?) What (if any) other ways should I verify EIT is disabled? I don't recall a setup option for this (But then I do remember a commit about one a week or so ago?).

I have about a years worth of recording information on this particular system. maybe I need to clean a db table?

comment:29 Changed 18 years ago by anonymous

This was a 4 minute run with the cycles increased. Blocking was going on for all but the last 15 seconds. I cleared the recordedmarkup table (88MB for MYD and 58MB for the MYI files), the people (3MB) and credits (2MB) tables also prior to this run.

Is there a way to set when the schedular updates? Is there a DB table that may be causing problems? What are the safe tables to reset without losing any critical information?.. Would be nice if there was a "flush information that's not critical to cause missed recordings so those of us with year(s) of information can clean things up without starting a new DB" Or maybe a "export recording schedules, music, video's, etc...

<COMPLETE PROFILE DUMP> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 10000

samples % symbol name

786 16.0835 Scheduler::FindNextConflict?(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, ProgramInfo?*, std::_List_iterator<ProgramInfo?*, ProgramInfo?*&, ProgramInfo?>&)

547 11.1930 Scheduler::AddNewRecords?()

377 7.7143 void std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::merge<bool (*)(ProgramInfo?*, ProgramInfo?*)>(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, bool (*)(ProgramInfo?*, ProgramInfo?*))

286 5.8523 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::_M_stringbuf_init(std::_Ios_Openmode)

225 4.6041 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::str() const

200 4.0925 comp_priority(ProgramInfo?*, ProgramInfo?*)

188 3.8469 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::overflow(int)

187 3.8265 void std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::sort<bool (*)(ProgramInfo?*, ProgramInfo?*)>(bool (*)(ProgramInfo?*, ProgramInfo?*))

157 3.2126 anonymous symbol from section .plt

155 3.1717 Scheduler::MarkShowingsList?(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, ProgramInfo?*)

144 2.9466 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::_M_really_sync(unsigned int, unsigned int)

131 2.6806 comp_overlap(ProgramInfo?*, ProgramInfo?*)

115 2.3532 QMapPrivate<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::find(int const&) const

113 2.3123 char* std::string::_S_construct<char*>(char*, char*, std::allocator<char> const&, std::forward_iterator_tag)

112 2.2918 comp_redundant(ProgramInfo?*, ProgramInfo?*)

109 2.2304 AutoExpire::FillDBOrdered(int)

101 2.0667 Scheduler::RunScheduler?()

97 1.9849 QValueListPrivate<QString>::~QValueListPrivate()

85 1.7393 QValueListPrivate<QString>::QValueListPrivate()

67 1.3710 Scheduler::SchedNewRecords?()

57 1.1664 Scheduler::BuildListMaps?()

56 1.1459 std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<char*>(char*, char*,std::allocator<char> const&)

51 1.0436 QMapPrivate<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::find(QString const&) const

49 1.0027 Scheduler::PruneOverlaps?()

45 0.9208 QMapPrivate<int, bool>::find(int const&) const

44 0.9003 QValueList<QString>::detachInternal()

43 0.8799 QValueListPrivate<QString>::at(unsigned int) const

41 0.8390 QValueListPrivate<QString>::insert(QValueListIterator<QString>, QString const&)

36 0.7366 QValueListPrivate<QString>::QValueListPrivate(QValueListPrivate<QString> const&)

34 0.6957 Scheduler::PruneRedundants?()

29 0.5934 QMapPrivate<RecordingType?, int>::find(RecordingType? const&) const

24 0.4911 QMap<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::operator[](int const&)

24 0.4911 std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::_M_create_node(ProgramInfo?* const&)

23 0.4706 std::_Rb_tree<QString, QString, std::_Identity<QString>, std::less<QString>, std::allocator<QString> >::find(QString const&) const

18 0.3683 std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::insert(std::_List_iterator<ProgramInfo?*, ProgramInfo?*&, ProgramInfo?>, ProgramInfo?* const&)

12 0.2455 AutoExpire::UpdateDontExpireSet?()

7 0.1432 AutoExpire::Sleep(int)

6 0.1228 AutoExpire::IsInDontExpireSet?(QString, QDateTime)

6 0.1228 QMap<RecordingType?, int>::operator[](RecordingType? const&)

6 0.1228 QMapPrivate<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::insertSingle(QString const&)

6 0.1228 QMapPrivate<int, bool>::clear(QMapNode<int, bool>*)

6 0.1228 QMapPrivate<int, bool>::insert(QMapNodeBase*, QMapNodeBase*, int const&)

5 0.1023 std::_List_base<ProgramInfo?*, std::allocator<ProgramInfo?*> >::clear()

4 0.0818 QMapPrivate<int, bool>::insertSingle(int const&)

4 0.0818 QMapPrivate<int, jobqueueentry>::clear(QMapNode<int, jobqueueentry>*)

4 0.0818 comp_recstart(ProgramInfo?*, ProgramInfo?*)

4 0.0818 std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::erase(std::_List_iterator<ProgramInfo?*, ProgramInfo?*&, ProgramInfo?>)

3 0.0614 HouseKeeper::wantToRun(QString const&, int, int, int)

3 0.0614 QMap<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::operator[](QString const&)

3 0.0614 QMapPrivate<QString, QString>::find(QString const&) const

3 0.0614 QMapPrivate<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::clear(QMapNode<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >*)

3 0.0614 Scheduler::PruneOldRecords?()

3 0.0614 std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::operator=(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > const&)

2 0.0409 AutoExpire::RunExpirer?()

2 0.0409 HouseKeeper::RunHouseKeeping?()

2 0.0409 MainServer::customEvent(QCustomEvent*)

2 0.0409 MainServer::qt_invoke(int, QUObject*)

2 0.0409 MythEvent::~MythEvent?()

2 0.0409 PlaybackSock::UpRef?()

2 0.0409 QMap<RecordingType?, int>::insert(RecordingType? const&, int const&, bool)

2 0.0409 QMapIterator<QString, int>::inc()

2 0.0409 QMapPrivate<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::clear(QMapNode<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >*)

2 0.0409 QString::~QString()

2 0.0409 jobqueueentry::jobqueueentry()

1 0.0205 AutoExpire::ExpireEpisodesOverMax?()

1 0.0205 AutoExpire::SendDeleteMessages?(unsigned int, unsigned int, bool)

1 0.0205 EncoderLink::MatchesRecording?(ProgramInfo? const*)

1 0.0205 MainServer::DoHandleDeleteRecording?(ProgramInfo?*, PlaybackSock?*, bool)

1 0.0205 MainServer::deferredDeleteSlot()

1 0.0205 MainServer::staticMetaObject()

1 0.0205 QMap<QString, QString>::operator[](QString const&)

1 0.0205 QMap<QString, int>::operator[](QString const&)

1 0.0205 QMap<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::insert(QString const&, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > const&, bool)

1 0.0205 QMapPrivate<QString, int>::clear()

1 0.0205 QMapPrivate<int, jobqueueentry>::clear()

1 0.0205 QMapPrivate<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::insert(QMapNodeBase*, QMapNodeBase*, int const&)

1 0.0205 QMapPrivate<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::insertSingle(int const&)

1 0.0205 Scheduler::FillEncoderFreeSpaceCache?()

1 0.0205 Scheduler::FillRecordList?()

1 0.0205 Scheduler::MoveHigherRecords?()

1 0.0205 Scheduler::PrintRec?(ProgramInfo?*, char const*)

1 0.0205 std::_Rb_tree<QString, QString, std::_Identity<QString>, std::less<QString>, std::allocator<QString> >::_M_create_node(QString const&)

1 0.0205 std::_Rb_tree<QString, QString, std::_Identity<QString>, std::less<QString>, std::allocator<QString> >::_M_erase(std::_Rb_tree_node<QString>*)

1 0.0205 std::_Rb_tree_rebalance(std::_Rb_tree_node_base*, std::_Rb_tree_node_base*&)

comment:30 Changed 18 years ago by anonymous

I don't know much about the scheduler, but can you compile MythTV without dvb-eit support and see if that makes the problem go away?

BTW it looks like ReiserFS v3 is 1500% to 5000% times worse than jfs at file deletion.

http://fsbench.netnation.com/new_hardware/2.6.0-test8/ide/bonnie.html

I guess we really should point this out in the documentation with these types of disparities. ReiserFS v4 is only about 100% slower than JFS, but even that is significantly worse than any other commonly used filesystem.

comment:31 Changed 18 years ago by anonymous

I recompiled last night and used --disable-dvb-eit. .. I still get the blocking issues. When I did have EIT on, sometimes I would get "Got data from card (>500), but not always. A few times the system was blocking when EIT was going on. Sometimes EIT ran and everything ran just fine.

I went back through 2 weeks of logs and went line by line through about 30 (random) blocking sessions. Every last IOBOUND session stopped when the schedular reported it had finished. I had never noticed this before. I don't think that expiring/deleting a file was is an issue, it was the reschedule that was done because a file was auto-expired and then rescheduled.

comment:32 Changed 18 years ago by danielk

Keywords: scheduler added; threadedfilewriter removed
Owner: changed from danielk to gigem
Status: assignednew
Summary: blocking in ThreadedFileWriterScheduler consuming all CPU cycles and disk bandwidth (via mysql).

I'm reassigning this to gigem, this is looking like a scheduler problem now.

It looks like the scheduler is consuming alot of CPU, but the scheduler is running at a lower threading priority so I don't think its CPU use is actually causing the problem.

I think the reason it blocks the ThreadedFileWriter? is probably because it uses the mysql server which both uses the filesystem and is running at the same priority as the the file writer.

comment:33 Changed 18 years ago by jonese@…

Could this be the same issue I (and others) had with mythfilldatabase causing "IOBOUND - blocking in ThreadedFileWriter::Write()" errors during a recording? See #348

comment:34 Changed 18 years ago by anonymous

I'm 100% sure the scheduler is causing the problems. Upon marking 9 channels that dvb scan found as not visable via mythweb, I've gone over 24 hours, including recording 3 shows at the same time, with 5 total IOBOUND/blocking messages (Prior this was several hundred to thousands). I had 5 channels in my listing that were USDTV channels, and normally encrypted. 2 others were simulcast but SD versions, and the other two were channels that went off. The 5 USDTV and 2 off air channels did not have any guide information.

Over the next few weeks I'll play with enabling various channels to see if I can narrow the problem down to specifics.

comment:35 Changed 18 years ago by danielk

Brandon, can you attach the output of this (just until the initial scheduling is done): mythbackend -v schedule | grep -B1 "\-\-" with and without the problem channels?

I think the problem might be Scheduler::AddNewRecords?(void), it is a complicated query with joins on a bunch of non-indexed columns, and it takes over 500 ms on my somewhat speedy machine. My guess is that NULLs in the joined columns might slow things down even more.

Also, can you try running mysql niced with the problem channels and tell me if that makes any difference?

comment:36 Changed 18 years ago by danielk

(In [7911]) References #551.

Implements a suggestion by gigem to break up the AddNewRecords? query by 'record' rows. This doesn't make the total query take any less time, but allows us to throttle back a little on how quickly we demand the data from the DB. As suspected the problem is that mysql is running at the same priority as the recording thread, so if it is running all out for 20-30 seconds at a go, the recorder doesn't get a chance to write enough data to disk.

This adds a 50 ms sleep between each record row, but this can be changed by changing the 'sleepTime' variable in AddNewRecords?(). I don't know yet if this is big enough or bigger than it needs to be, so try increasing it if you still have problems, and lowering if you do not.

comment:37 Changed 18 years ago by Samuel@…

Has anyone tried an Explain on that query to see what indexes it does use? All it would take is an improper index being used or an all scan of a table to cause such lockups.

comment:38 Changed 18 years ago by bjm

Could someone who is having this problem please try the following test.

Run "mythbackend -v schedule". Start a test recording or recordings under the conditions when you see this problem (two local recordings on the master or whatever). From another shell, run "mythbackend --resched" and note the time that it ran. Look in the backend log and find the secheduler run at that time. Post the output starting from:

2005-11-18 00:56:56.443 Reschedule requested for id -1.
Query 0: /record.recordid = 

up through about 50-100 lines later where it says "Scheduling:"

2005-11-18 00:56:57.080 Sort by priority...
2005-11-18 00:56:57.081 BuildListMaps...
2005-11-18 00:56:57.081 SchedNewRecords...
2005-11-18 00:56:57.081 Scheduling:

comment:39 Changed 18 years ago by danielk

Samuel,

EXPLAIN was the first thing I tried. The only things that aren't indexed are on small tables. For kicks I tried adding indexes for them and it made no difference. Also, ANALYZE on the tables made no noticable difference.

Changed 18 years ago by danielk

Attachment: basic.txt added

time to schedule normally

Changed 18 years ago by danielk

Attachment: aftermythfilldb.txt added

time to reschedule after clearing program table and running mythfilldatabase

comment:40 Changed 18 years ago by danielk

BJM, my reschedules are not so bad as to destroy my recordings. Perhaps because the DB uses a seperate disk. But I've attached two examples that do spend an extremely long time in a single query nonetheless. One is just a basic run on a DB with cruft from EIT scans, the other is with a pristene 'program' table after clearing it out completely and getting all new data via datadirect.

As you can see it takes between 30x and 60x times the time it really should in either case, but AddNewRecords? is twice as slow on a non-pristene db. There were approximately the same number of rows 34000 vs 36000 in the program table.

I also tried wacky values for the title, subtitle, and programid columns. The duplicates in the title sped up the query quite a bit, while duplicates in the others increased the query time, but only by a few milliseconds.

comment:41 Changed 18 years ago by danielk

(In [7946]) References #551.

This makes the eitscanner a little smarter about handling multiple EIT scanning cards. It adds randomness to when scanning starts and when channel changes happen, this should make it much less likely that two or more cards act in lockstep when it comes to changing channels and calling for reschedules.

This also implements rate limiting across all eitscanners on the same backend on the calls to reschedule.

This also adds randomness to which channel the scanners start at. This just assures that when multiple scanners first start up two or more of them aren't scanning the same exact channels. This doesn't address Brandon's problem, but only makes the scan faster right after you first start the backend if you do have multiple cards. After a while the scanners would be out of lockstep anyway, this just speeds the process.

comment:42 Changed 18 years ago by danielk

Priority: criticalmajor

I'm lowering the priority because I've gotten a report from Brandon Beattie that this appears to no longer causing glitches with ATSC recordings.

The longish reschedule queries (1750 ms) could still cause problems for other recorders such as ivtv and bttv based recorders.

Two potential workarounds are:

  • Place the DB (usually in /var) on a different hard drive than the recordings
  • Run the mysql db niced

comment:43 Changed 18 years ago by bjm

Owner: changed from gigem to bjm

No scheduler output from a run that took over 20 seconds has been provided. The only example posted were 1.1 and ~2.5 sec which proves nothig. Further there has been no information given about the fileystem layout, in particular if the /var partition with the active database files are on the same physical disk where two HD file are being written (or read and write) at the same time.

If this were the case, the disk heads would be thrashing and could not keep up with the queued requests. Coding around a near failure situation is not a solution. For anyone having this probelm, move /var to a different physical disk than that of the video recording directory.

Do not reopen this ticket without providing:

The scheduler output as described above for a schedule run of more than 20 seconds.

The scheduler output from the same system when there are no recordings or playback in progress.

The output of:

cat /etc/fstab

df /var/lib/mysql/mythconverg

df /your/video/dir

comment:44 Changed 18 years ago by anonymous

Resolution: invalid
Status: newclosed
Note: See TracTickets for help on using tickets.