Opened 18 years ago
Closed 18 years ago
#551 closed defect (invalid)
Scheduler consuming all CPU cycles and disk bandwidth (via mysql).
Reported by: | anonymous | Owned by: | bjm |
---|---|---|---|
Priority: | major | Milestone: | 0.19 |
Component: | mythtv | Version: | head |
Severity: | medium | Keywords: | blocking scheduler |
Cc: | Ticket locked: | no |
Description (last modified by )
SVN for the last week has had some major issues with blocking in file writing. In a test system that has not had any changes except SVN versions of Myth, we've seen blocking very frequently. The problem typically comes up when recording two HD shows or viewing one and recording another. This had worked fine, including recording 4 HD streams and viewing another using SVN from three weeks ago. When failures occured, we tried increasing the ring buffer size 4 times the default without any help. The system is not out of memory either. Doing a top shows the mythbackend running at 60%-100% CPU usage, which I've never seen before. The system load (via the w command or top) has shown the load between 6-14 at times, when nothing else except system processes were running. Stopping the backend returns the system back to a normal load. DMA has been checked on the drives, so it's not this (And since nothing changed since it worked fine.)
I think the system load jumping so high may be a sign of a threading problem for file writing?
We've disabled the eit-parser since this was what I've noticed the most experimental changes in lately but it had no effect.
The end result are video files with major corruption. Over a 5 minute span, only 2-5 second clips every 5-30 seconds are saved.
Attachments (2)
Change History (46)
comment:1 Changed 18 years ago by
comment:2 Changed 18 years ago by
Description: | modified (diff) |
---|---|
Owner: | changed from Isaac Richards to danielk |
Status: | new → assigned |
I couldn't reproduce this initially, but I can now..
I believe the only change was that I upgraded from Linux 2.6.12 to 2.6.13...
I'll take a look at this tomorrow.
comment:3 Changed 18 years ago by
I'm running 2.6.12, athlon xp 2500. Memory is fine, no swapping, etc. Disk usage looks fine. I've swapped usable space between drives (Using LVM) and it doesn't matter which drive(s) it's writing to. I don't notice throughput to the disk change either, and it's blocking when between 300k and 2.3MB/s are being written to the disk, which any disk can sustain. I've even played with dma settings, caching read and writes, and none of it has an effect, so I'm fairly certain I've cleared hardware as the problem. Nothing has been changed except myth versions. I'll sometimes see the load drop to 92% CPU usage, but system load still sits around 15, which is extremely high. Network transfers (if going on), and everything seams to stutter/take a hit when the problem is going on. It's really feeling like something is capturing the io and holding it. Write syncing, threads, or reads, or something. Maybe it's a problem with the file reader? .. Just some ideas.
comment:4 Changed 18 years ago by
I also added a couple SATA drives. I'm not actually using them yet so I didn't think it was relevant, but there is a message in the mythtv-users list "HDTV digital static with DVB" where someone makes the claim that the Si 3112 SATA drivers can cause this type of problem. I still think it is a Red Herring, but I'm using this chipset. I'm going to try to revert to the non-SATA drivers for my chipset as a first step.
Valgrind is reporting a memory problem with the write() on safe_write, but if that really was the problem I would expect my PVR-250 recordings to be messed up as well.
comment:5 Changed 18 years ago by
I'm using ATA, seagate and maxtor drives. I have 4 drives on the motherboard and 2 more on a PCI IDE controller. One other thing, I've noticed the blocking runs from 2-10 minutes, then it it's suddenly fine for sometimes the rest of the recording, or sometimes problems come back after a few minutes. I can try running valgrind and see what goes on.
Other things I've noticed, sometimes if I kill the backend when it's having problems and start things up again the blocking starts 1-2 seconds after it begins recording (2-4 seconds after starting the backend), I have the buffer set to 96k (max) and I doubt this is filled up that fast...
comment:6 Changed 18 years ago by
I got 14,000 small files on my system on 11/05 with 4,000 new db entries. I haven't seen this behavior since then. My system DVB-t/Germany, debian32 sid, a very small load usually, IDE samsung HDD, Kernel 2.6.13.
comment:7 Changed 18 years ago by
comment:8 Changed 18 years ago by
I'm getting the glitches but I'm actually not getting these warning messages. I've made the warnings a little more informative in 7802. Can you try this and attach the resulting log to the ticket?
(You'll have to make a scheduled recording, rather than using LiveTV, right now.)
comment:9 Changed 18 years ago by
Hmm, I don't believe my glitches are coming from the threadedfilewriter. I added some instrumentation on DVBRecorder::ReadFromDMX(), so that it prints out anytime this takes longer than 2 ms. What I got was mostly boring.
2005-11-08 17:47:45.489 ReadFromDMX took 3 ms 2005-11-08 17:47:45.500 ReadFromDMX took 4 ms 2005-11-08 17:47:45.509 ReadFromDMX took 3 ms 2005-11-08 17:47:45.519 ReadFromDMX took 3 ms 2005-11-08 17:47:45.528 HandleKeyframe() 2005-11-08 17:47:45.528 SavePositionMap() 2005-11-08 17:47:45.529 ReadFromDMX took 3 ms 2005-11-08 17:47:45.539 ReadFromDMX took 4 ms
All this would indicate that we should probably double the default buffer sizes.
But once in a while you get something like this:
2005-11-08 17:47:46.045 ReadFromDMX took 3 ms 2005-11-08 17:47:46.052 HandleKeyframe() 2005-11-08 17:47:46.052 SavePositionMap() 2005-11-08 17:47:46.081 find keyframes took 29 ms 2005-11-08 17:47:46.099 ReadFromDMX took 47 ms 2005-11-08 17:47:46.105 ReadFromDMX took 4 ms 2005-11-08 17:47:46.114 ReadFromDMX took 4 ms }} So it looks like we are blocking on database saves, not disk writes.. And this is way bigger than the DVBRecorder 8KB buffers, which allow ReadFromDMX() to take up to 3.4 ms without the chance of an overflow.. The HDTVRecorder is not affected by this because it grows it's buffer when something like this happens.
comment:10 Changed 18 years ago by
Priority: | major → critical |
---|
I believe most of the glitches I'm seeing have nothing to do with the recording.
I am getting glitches due to the position map DB save, but these happen much less frequently; except in the first few seconds of the recording. Most of the glitches I'm getting are on playback, and these are due to avformatdecoder not keeping the buffers full.
Here is a typical buffer line with NTSC: UUUUAUUUUUUUUUUUUUUUUUUUUUUUUUL And with ATSC 1920x1080: AAAAAAAAAAAAAAAUAALAAAULAAAAAAA As you can see most of the buffers are "U"navailable in the NTSC case, and "A"vailable in the ATSC case. We want most of the buffers to be full and hence "U"navailable during normal playback. I'm also noticing that the frontend is now consuming 80-95% of my CPU, where before it was in the 70's.
I'm doing some player benchmarking now to see where all the CPU is going. This still happens with libmpeg2, so ffmpeg is not the prime suspect.
comment:11 Changed 18 years ago by
I ran the mythbackend with all debug on, I'll link to a 500k file that contains a chunk of debug just before until just after the writes blocked.
linuxis.us/blocking
comment:12 Changed 18 years ago by
Hmm, well it is definately a TFW blocking issue that you are experiencing. can you add this line to TFW @ line 40 (just before the while): VERBOSE(VB_IMPORTANT, LOC + "safe_write(size: "<<sz<<")");
Also I fixed the verbose macro for the IOBOUND message.
Can you also give me the output of "hdparm /dev/diskdevice" and "hdparm --direct -t /dev/diskdevice" for the recording volume.
For reference, this is what I get for the read test:
"Timing O_DIRECT disk reads: 186 MB in 3.03 seconds = 61.35 MB/sec"
I believe what I'm experiencing is mostly a RingBuffer? issue...
comment:13 Changed 18 years ago by
comment:14 Changed 18 years ago by
Here's a few blockings.. -- This happened very shortly after starting the recording, so I'm not 100% sure this is the typical blocking I see, I'll post another dump when that comes up. This has your extra printout and this mornings svn version 7830.
file: linuxis.us/dump3.blocking : Blocking IO is 3/4ths the length down, 1k lines.
I'll only post one set of hdparms, since all the settings are identical on each drive (I've checked this a dozen times so I know they're synced.
/dev/hdg:
Timing O_DIRECT disk reads: 82 MB in 3.07 seconds = 58.73 MB/sec
/dev/hdg:
multcount = 16 (on) IO_support = 1 (32-bit) unmaskirq = 1 (on) using_dma = 1 (on) keepsettings = 0 (off) readonly = 0 (off) readahead = 256 (on) geometry = 24792/255/63, sectors = 398297088, start = 0
/dev/hdg:
Model=Maxtor 6Y200P0, FwRev?=XXXXXX, SerialNo?=XXXXXX Config={ Fixed } RawCHS=16383/16/63, TrkSize?=0, SectSize?=0, ECCbytes=57 BuffType?=DualPortCache?, BuffSize?=7936kB, MaxMultSect?=16, MultSect?=16 CurCHS=4047/16/255, CurSects?=16511760, LBA=yes, LBAsects=268435455 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 udma6 AdvancedPM=yes: disabled (255) WriteCache?=enabled Drive conforms to: (null):
- signifies the current active mode
comment:15 Changed 18 years ago by
Okay, it's obvious I fudged one number, I have my drives in udma mode 2 right now. I set it this way the other day to see if this helped. I prefer to reboot to reset the udma mode to it's default udma6, but I haven't wanted to do this right now. I normally get about 60-70MB/s read, which is why I put that number in, I can reboot and let you ssh into the box if you want the proof it's normally like that. ;)
comment:16 Changed 18 years ago by
What's the filesystem here?
Looks like the delete that happens: 2005-11-10 08:37:37.790 About to unlink/delete file: /media/myth/recordings1052_20051109070000.mpg 2005-11-10 08:37:37.914 TFW: safe_write(size: 265456) 2005-11-10 08:37:38.073 TFW: safe_write(size: 264328) 2005-11-10 08:37:38.232 TFW: safe_write(size: 265268) 2005-11-10 08:37:38.390 TFW: safe_write(size: 265644) 2005-11-10 08:37:38.549 TFW: safe_write(size: 264892) 2005-11-10 08:37:38.709 TFW: safe_write(size: 265644) 2005-11-10 08:37:38.867 TFW: safe_write(size: 265456) 2005-11-10 08:37:39.969 TFW, Error: Write() -- IOBOUND begin cnt(188) free(11) 2005-11-10 08:37:40.202 TFW: safe_write(size: 91736) 2005-11-10 08:37:40.202 TFW: safe_write(size: 432552) 2005-11-10 08:37:40.204 TFW: safe_write(size: 524288) 2005-11-10 08:37:40.206 TFW: safe_write(size: 524288) 2005-11-10 08:37:40.208 TFW, Error: Write() -- IOBOUND end 2005-11-10 08:37:40.208 DVB#0 Rec: Warning - TS packet write took 239 ms! 2005-11-10 08:37:40.236 DVB#0 Rec: Warning - ReadFromDMX took 268 ms
Throws everything off...
comment:18 Changed 18 years ago by
Hmm, I'm curious about the filesystem too, which version of ReiserFS?
XFS and JFS used to be the only Linux filesystems that handled large file deletion well, but I don't know anything about how ReiserFS v4 handles deletion.
But from the debugging it is fairly obvious that if the ringbuffer released the buffer lock between some of the writes, and responded more quickly to a low buffer situation things would work (at least now that [7819] makes the DVB buffers a reasonable size). I'm going to test some of the patches in my queue today, then I'll have a look at the ringbuffer problems again. My guess is that at some point the buffer was made larger, and then your system stopped being able to drain it quickly enough.
comment:19 Changed 18 years ago by
ReiserFS 3.6.19. I have run XFS and JFS in the past, and at times I still received the blocking errors. I've been running ReiserFS for about 2 months and things worked fine until the middle of October. I think this delete file brought out /a/ blocking problem, but I don't believe it's the main blocking error we're trying to track down. This was a very short block, the ones that are the true problem are when it goes on for 2-5 minutes of blocking and there was no deleting prior. I'll set to record a show and get it to block, then put the dump online.
For fun, here is what happens if I run dd if=/dev/zero of=/file_on_lvm
. (see linuxis.us/dump4.blocking) I've added comments to the dump where the dd started and ended. You'll notice this is much different than the other dumps because it practically stops myth in it's place. Myth freezes for 15 seconds or so, then does a little work, then waits again (Note: the CPU load is about 10%). In the problem we're specically trying to solve Myth is constantly running, never freezes/waits, it's constantly spitting out io blocking, and processessing data and, the CPU usage for mythbackend maxes the CPU at 100%. I'll attach a dump of this as soon as it happens again... (Withing a couple hours)
comment:20 Changed 18 years ago by
Two more debug logs.
linuxis.us/dump5.blocking has a couple IO blocking errors, minor failure...
linuxis.us/dump6.blocking is a full blown/doesn't correct for minutes blocking state. The system load at this point is over 12, mythbackend taking up 99% of the CPU, no other tasks going on.
comment:21 Changed 18 years ago by
Changeset [7844] affects this, it increases the write buffer size to 4 MB for DVB recordings.
comment:22 Changed 18 years ago by
(In [7846]) References #551. Documentation, plus a little tweaking in DiskLoop?().
I added code to lower the write size in DiskLoop?() if TFW::Write() is already blocking. This should be small enough so that the Write() will unblock soon enough so that packets get don't get dropped, but large enough so that the next full size disk write completes before Write() blocks again.
I also added bufferHasData.wakeAll(); in Write() when it gets blocked, so that the DiskLoop? doesn't just sit around waiting for more data when a buffer full situation has been reached... This added up to 100 ms to the block time.
comment:23 Changed 18 years ago by
anon, can you give this latest code a try?
You may need to tweak some values.
1/ The TFW_BLK_WRITE_SIZE. See line 26 of ThreadedFileWriter?.cpp 2/ The TWF buffer size. See line 774 of tv_rec.cpp
Try increasing the first until you don't get back to back blocking events. Try increasing the second until the blocking events become infrequent.
The values I picked may work, but they are just educated guesses, actual values from a malfunctioning system would probably work better.
comment:24 Changed 18 years ago by
I've been playing with numbers for hours.. Basically I can't get anything to work. I've put buffers up to 128MB/s and it still eventually fills them and IOBOUNDs on writing.
One thing I notice though, is the CPU load his 100% and if the buffer is 32-64MB, it's at 100% usage for 10-20 seconds before I start to get IO blocking. .. I'm wondering if something else is going on and the blocking on writing is a result, not a cause. Maybe if I can find what's maxing the CPU and that's fixed, all these problems will go away.
Is there a specific way myth should be profiled? I'll gladly profile it I can get the instructions on how for Myth. I think if we can find what's killing the CPU, we can find out what's causing these problems.
One side note, with the buffer set high I do start seeing "taking a long time to flush" which is coming from the TFW.
comment:25 Changed 18 years ago by
Profiling MythTV backend:
./configure --compile-type=profile [+ normal options] make clean; make su make install opcontrol --deinit ; opcontrol --init # assuming Intel P4, the next line is correct, for other CPUs you may # may need to count something other than INSTR_RETIRED opcontrol --image=/usr/local/bin/mythbackend -e INSTR_RETIRED:6000:0x0f:0:1 /etc/init.d/mythbackend restart # now start a recording and wait for bad behaviour, then run the following opcontrol --reset ; opcontrol --start # now wait a minute or two then run opcontrol --stop ; opcontrol --dump ; opreport --merge tgid -l --image-path=/usr/local/bin > report.txt emacs report.txt &
This assumes you have oprofile installed, I'm using version 0.9.1. On my machine mythbackend only uses a few % of CPU and the highest percentage is used by malloc.
comment:26 Changed 18 years ago by
I'm running an athlon 2500XP, and I couldn't find a -e option that worked, so I ran without this command line option. I'm not sure what to make of the results though (If this is normal/fine or not).
CPU: Athlon, speed 2088.63 MHz (estimated)
Just as a reminder, when blocking is going on, mythbackend is consuming 99%-100% of the CPU.
Run 1: 15 seconds while blocking was going on.
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name
2 22.2222 QValueList<QString>::detachInternal()
2 22.2222 QValueListPrivate<QString>::QValueListPrivate()
2 22.2222 QValueListPrivate<QString>::insert(QValueListIterator<QString>, QString const&)
1 11.1111 MainServer::qt_invoke(int, QUObject*)
1 11.1111 QValueListPrivate<QString>::~QValueListPrivate()
1 11.1111 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char>
::_M_really_sync(unsigned int, unsigned int)
<EOF> .. so in 15 seconds myth only made these 6 calls it appears...
Run2: ~5 minute run with blocking happening during the run, but not the entire time.
samples % symbol name
110 36.6667 Scheduler::RunScheduler?()
40 13.3333 QValueList<QString>::detachInternal()
31 10.3333 QValueListPrivate<QString>::insert(QValueListIterator<QString>, QString const&)
27 9.0000 QValueListPrivate<QString>::QValueListPrivate(QValueListPrivate<QString> const&)
26 8.6667 QValueListPrivate<QString>::~QValueListPrivate()
<SNIP>
Run 3: I left this running for a couple hours, blocking occured during this run, but not the entire time.
samples % symbol name
1132 28.4637 Scheduler::FindNextConflict?(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, ProgramInfo?*, std::_List_iterator<ProgramInfo?*, ProgramInfo?*&, ProgramInfo?>&)
620 15.5896 Scheduler::AddNewRecords?()
240 6.0347 void std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::merge<bool (*)(ProgramInfo?*, ProgramInfo?*)>(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, bool (*)(ProgramInfo?*, ProgramInfo?*))
175 4.4003 Scheduler::MarkShowingsList?(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, ProgramInfo?*)
131 3.2939 void std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::sort<bool (*)(ProgramInfo?*, ProgramInfo?*)>(bool (*)(ProgramInfo?*, ProgramInfo?*))
<SNIP>
comment:27 Changed 18 years ago by
Hmmm, this is very strange, it seems to be spending all it's time rescheduling, not in writing to disk.
Can you try disabling EIT parsing? To see if that makes the problem go away.
Also when you rerun OProfile try this
opcontrol --image=/usr/local/bin/mythbackend -e CPU_CLK_UNHALTED:10000:0x00:1:1 /etc/init.d/mythbackend restart
It lowers the cycles between samples from 100,000 to 10,000 which should give you more meaningful results more quickly. If opcontrol doesn't allow such a low value, it should tell you what the minimum is.
comment:28 Changed 18 years ago by
As far as I know, EIT is disabled. I used to run the disable eit configure option, but I believe now it's disabled by default? (Is this correct?) What (if any) other ways should I verify EIT is disabled? I don't recall a setup option for this (But then I do remember a commit about one a week or so ago?).
I have about a years worth of recording information on this particular system. maybe I need to clean a db table?
comment:29 Changed 18 years ago by
This was a 4 minute run with the cycles increased. Blocking was going on for all but the last 15 seconds. I cleared the recordedmarkup table (88MB for MYD and 58MB for the MYI files), the people (3MB) and credits (2MB) tables also prior to this run.
Is there a way to set when the schedular updates? Is there a DB table that may be causing problems? What are the safe tables to reset without losing any critical information?.. Would be nice if there was a "flush information that's not critical to cause missed recordings so those of us with year(s) of information can clean things up without starting a new DB" Or maybe a "export recording schedules, music, video's, etc...
<COMPLETE PROFILE DUMP> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 10000
samples % symbol name
786 16.0835 Scheduler::FindNextConflict?(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, ProgramInfo?*, std::_List_iterator<ProgramInfo?*, ProgramInfo?*&, ProgramInfo?>&)
547 11.1930 Scheduler::AddNewRecords?()
377 7.7143 void std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::merge<bool (*)(ProgramInfo?*, ProgramInfo?*)>(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, bool (*)(ProgramInfo?*, ProgramInfo?*))
286 5.8523 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::_M_stringbuf_init(std::_Ios_Openmode)
225 4.6041 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::str() const
200 4.0925 comp_priority(ProgramInfo?*, ProgramInfo?*)
188 3.8469 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::overflow(int)
187 3.8265 void std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::sort<bool (*)(ProgramInfo?*, ProgramInfo?*)>(bool (*)(ProgramInfo?*, ProgramInfo?*))
157 3.2126 anonymous symbol from section .plt
155 3.1717 Scheduler::MarkShowingsList?(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >&, ProgramInfo?*)
144 2.9466 std::basic_stringbuf<char, std::char_traits<char>, std::allocator<char> >::_M_really_sync(unsigned int, unsigned int)
131 2.6806 comp_overlap(ProgramInfo?*, ProgramInfo?*)
115 2.3532 QMapPrivate<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::find(int const&) const
113 2.3123 char* std::string::_S_construct<char*>(char*, char*, std::allocator<char> const&, std::forward_iterator_tag)
112 2.2918 comp_redundant(ProgramInfo?*, ProgramInfo?*)
109 2.2304 AutoExpire::FillDBOrdered(int)
101 2.0667 Scheduler::RunScheduler?()
97 1.9849 QValueListPrivate<QString>::~QValueListPrivate()
85 1.7393 QValueListPrivate<QString>::QValueListPrivate()
67 1.3710 Scheduler::SchedNewRecords?()
57 1.1664 Scheduler::BuildListMaps?()
56 1.1459 std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string<char*>(char*, char*,std::allocator<char> const&)
51 1.0436 QMapPrivate<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::find(QString const&) const
49 1.0027 Scheduler::PruneOverlaps?()
45 0.9208 QMapPrivate<int, bool>::find(int const&) const
44 0.9003 QValueList<QString>::detachInternal()
43 0.8799 QValueListPrivate<QString>::at(unsigned int) const
41 0.8390 QValueListPrivate<QString>::insert(QValueListIterator<QString>, QString const&)
36 0.7366 QValueListPrivate<QString>::QValueListPrivate(QValueListPrivate<QString> const&)
34 0.6957 Scheduler::PruneRedundants?()
29 0.5934 QMapPrivate<RecordingType?, int>::find(RecordingType? const&) const
24 0.4911 QMap<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::operator[](int const&)
24 0.4911 std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::_M_create_node(ProgramInfo?* const&)
23 0.4706 std::_Rb_tree<QString, QString, std::_Identity<QString>, std::less<QString>, std::allocator<QString> >::find(QString const&) const
18 0.3683 std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::insert(std::_List_iterator<ProgramInfo?*, ProgramInfo?*&, ProgramInfo?>, ProgramInfo?* const&)
12 0.2455 AutoExpire::UpdateDontExpireSet?()
7 0.1432 AutoExpire::Sleep(int)
6 0.1228 AutoExpire::IsInDontExpireSet?(QString, QDateTime)
6 0.1228 QMap<RecordingType?, int>::operator[](RecordingType? const&)
6 0.1228 QMapPrivate<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::insertSingle(QString const&)
6 0.1228 QMapPrivate<int, bool>::clear(QMapNode<int, bool>*)
6 0.1228 QMapPrivate<int, bool>::insert(QMapNodeBase*, QMapNodeBase*, int const&)
5 0.1023 std::_List_base<ProgramInfo?*, std::allocator<ProgramInfo?*> >::clear()
4 0.0818 QMapPrivate<int, bool>::insertSingle(int const&)
4 0.0818 QMapPrivate<int, jobqueueentry>::clear(QMapNode<int, jobqueueentry>*)
4 0.0818 comp_recstart(ProgramInfo?*, ProgramInfo?*)
4 0.0818 std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::erase(std::_List_iterator<ProgramInfo?*, ProgramInfo?*&, ProgramInfo?>)
3 0.0614 HouseKeeper::wantToRun(QString const&, int, int, int)
3 0.0614 QMap<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::operator[](QString const&)
3 0.0614 QMapPrivate<QString, QString>::find(QString const&) const
3 0.0614 QMapPrivate<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::clear(QMapNode<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >*)
3 0.0614 Scheduler::PruneOldRecords?()
3 0.0614 std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> >::operator=(std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > const&)
2 0.0409 AutoExpire::RunExpirer?()
2 0.0409 HouseKeeper::RunHouseKeeping?()
2 0.0409 MainServer::customEvent(QCustomEvent*)
2 0.0409 MainServer::qt_invoke(int, QUObject*)
2 0.0409 MythEvent::~MythEvent?()
2 0.0409 PlaybackSock::UpRef?()
2 0.0409 QMap<RecordingType?, int>::insert(RecordingType? const&, int const&, bool)
2 0.0409 QMapIterator<QString, int>::inc()
2 0.0409 QMapPrivate<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::clear(QMapNode<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >*)
2 0.0409 QString::~QString()
2 0.0409 jobqueueentry::jobqueueentry()
1 0.0205 AutoExpire::ExpireEpisodesOverMax?()
1 0.0205 AutoExpire::SendDeleteMessages?(unsigned int, unsigned int, bool)
1 0.0205 EncoderLink::MatchesRecording?(ProgramInfo? const*)
1 0.0205 MainServer::DoHandleDeleteRecording?(ProgramInfo?*, PlaybackSock?*, bool)
1 0.0205 MainServer::deferredDeleteSlot()
1 0.0205 MainServer::staticMetaObject()
1 0.0205 QMap<QString, QString>::operator[](QString const&)
1 0.0205 QMap<QString, int>::operator[](QString const&)
1 0.0205 QMap<QString, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::insert(QString const&, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > const&, bool)
1 0.0205 QMapPrivate<QString, int>::clear()
1 0.0205 QMapPrivate<int, jobqueueentry>::clear()
1 0.0205 QMapPrivate<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::insert(QMapNodeBase*, QMapNodeBase*, int const&)
1 0.0205 QMapPrivate<int, std::list<ProgramInfo?*, std::allocator<ProgramInfo?*> > >::insertSingle(int const&)
1 0.0205 Scheduler::FillEncoderFreeSpaceCache?()
1 0.0205 Scheduler::FillRecordList?()
1 0.0205 Scheduler::MoveHigherRecords?()
1 0.0205 Scheduler::PrintRec?(ProgramInfo?*, char const*)
1 0.0205 std::_Rb_tree<QString, QString, std::_Identity<QString>, std::less<QString>, std::allocator<QString> >::_M_create_node(QString const&)
1 0.0205 std::_Rb_tree<QString, QString, std::_Identity<QString>, std::less<QString>, std::allocator<QString> >::_M_erase(std::_Rb_tree_node<QString>*)
1 0.0205 std::_Rb_tree_rebalance(std::_Rb_tree_node_base*, std::_Rb_tree_node_base*&)
comment:30 Changed 18 years ago by
I don't know much about the scheduler, but can you compile MythTV without dvb-eit support and see if that makes the problem go away?
BTW it looks like ReiserFS v3 is 1500% to 5000% times worse than jfs at file deletion.
http://fsbench.netnation.com/new_hardware/2.6.0-test8/ide/bonnie.html
I guess we really should point this out in the documentation with these types of disparities. ReiserFS v4 is only about 100% slower than JFS, but even that is significantly worse than any other commonly used filesystem.
comment:31 Changed 18 years ago by
I recompiled last night and used --disable-dvb-eit. .. I still get the blocking issues. When I did have EIT on, sometimes I would get "Got data from card (>500), but not always. A few times the system was blocking when EIT was going on. Sometimes EIT ran and everything ran just fine.
I went back through 2 weeks of logs and went line by line through about 30 (random) blocking sessions. Every last IOBOUND session stopped when the schedular reported it had finished. I had never noticed this before. I don't think that expiring/deleting a file was is an issue, it was the reschedule that was done because a file was auto-expired and then rescheduled.
comment:32 Changed 18 years ago by
Keywords: | scheduler added; threadedfilewriter removed |
---|---|
Owner: | changed from danielk to gigem |
Status: | assigned → new |
Summary: | blocking in ThreadedFileWriter → Scheduler consuming all CPU cycles and disk bandwidth (via mysql). |
I'm reassigning this to gigem, this is looking like a scheduler problem now.
It looks like the scheduler is consuming alot of CPU, but the scheduler is running at a lower threading priority so I don't think its CPU use is actually causing the problem.
I think the reason it blocks the ThreadedFileWriter? is probably because it uses the mysql server which both uses the filesystem and is running at the same priority as the the file writer.
comment:33 Changed 18 years ago by
Could this be the same issue I (and others) had with mythfilldatabase causing "IOBOUND - blocking in ThreadedFileWriter::Write()" errors during a recording? See #348
comment:34 Changed 18 years ago by
I'm 100% sure the scheduler is causing the problems. Upon marking 9 channels that dvb scan found as not visable via mythweb, I've gone over 24 hours, including recording 3 shows at the same time, with 5 total IOBOUND/blocking messages (Prior this was several hundred to thousands). I had 5 channels in my listing that were USDTV channels, and normally encrypted. 2 others were simulcast but SD versions, and the other two were channels that went off. The 5 USDTV and 2 off air channels did not have any guide information.
Over the next few weeks I'll play with enabling various channels to see if I can narrow the problem down to specifics.
comment:35 Changed 18 years ago by
Brandon, can you attach the output of this (just until the initial scheduling is done):
mythbackend -v schedule | grep -B1 "\-\-"
with and without the problem channels?
I think the problem might be Scheduler::AddNewRecords?(void), it is a complicated query with joins on a bunch of non-indexed columns, and it takes over 500 ms on my somewhat speedy machine. My guess is that NULLs in the joined columns might slow things down even more.
Also, can you try running mysql niced with the problem channels and tell me if that makes any difference?
comment:36 Changed 18 years ago by
Implements a suggestion by gigem to break up the AddNewRecords? query by 'record' rows. This doesn't make the total query take any less time, but allows us to throttle back a little on how quickly we demand the data from the DB. As suspected the problem is that mysql is running at the same priority as the recording thread, so if it is running all out for 20-30 seconds at a go, the recorder doesn't get a chance to write enough data to disk.
This adds a 50 ms sleep between each record row, but this can be changed by changing the 'sleepTime' variable in AddNewRecords?(). I don't know yet if this is big enough or bigger than it needs to be, so try increasing it if you still have problems, and lowering if you do not.
comment:37 Changed 18 years ago by
Has anyone tried an Explain on that query to see what indexes it does use? All it would take is an improper index being used or an all scan of a table to cause such lockups.
comment:38 Changed 18 years ago by
Could someone who is having this problem please try the following test.
Run "mythbackend -v schedule". Start a test recording or recordings under the conditions when you see this problem (two local recordings on the master or whatever). From another shell, run "mythbackend --resched" and note the time that it ran. Look in the backend log and find the secheduler run at that time. Post the output starting from:
2005-11-18 00:56:56.443 Reschedule requested for id -1. Query 0: /record.recordid = up through about 50-100 lines later where it says "Scheduling:" 2005-11-18 00:56:57.080 Sort by priority... 2005-11-18 00:56:57.081 BuildListMaps... 2005-11-18 00:56:57.081 SchedNewRecords... 2005-11-18 00:56:57.081 Scheduling:
comment:39 Changed 18 years ago by
Samuel,
EXPLAIN was the first thing I tried. The only things that aren't indexed are on small tables. For kicks I tried adding indexes for them and it made no difference. Also, ANALYZE on the tables made no noticable difference.
Changed 18 years ago by
Attachment: | aftermythfilldb.txt added |
---|
time to reschedule after clearing program table and running mythfilldatabase
comment:40 Changed 18 years ago by
BJM, my reschedules are not so bad as to destroy my recordings. Perhaps because the DB uses a seperate disk. But I've attached two examples that do spend an extremely long time in a single query nonetheless. One is just a basic run on a DB with cruft from EIT scans, the other is with a pristene 'program' table after clearing it out completely and getting all new data via datadirect.
As you can see it takes between 30x and 60x times the time it really should in either case, but AddNewRecords? is twice as slow on a non-pristene db. There were approximately the same number of rows 34000 vs 36000 in the program table.
I also tried wacky values for the title, subtitle, and programid columns. The duplicates in the title sped up the query quite a bit, while duplicates in the others increased the query time, but only by a few milliseconds.
comment:41 Changed 18 years ago by
This makes the eitscanner a little smarter about handling multiple EIT scanning cards. It adds randomness to when scanning starts and when channel changes happen, this should make it much less likely that two or more cards act in lockstep when it comes to changing channels and calling for reschedules.
This also implements rate limiting across all eitscanners on the same backend on the calls to reschedule.
This also adds randomness to which channel the scanners start at. This just assures that when multiple scanners first start up two or more of them aren't scanning the same exact channels. This doesn't address Brandon's problem, but only makes the scan faster right after you first start the backend if you do have multiple cards. After a while the scanners would be out of lockstep anyway, this just speeds the process.
comment:42 Changed 18 years ago by
Priority: | critical → major |
---|
I'm lowering the priority because I've gotten a report from Brandon Beattie that this appears to no longer causing glitches with ATSC recordings.
The longish reschedule queries (1750 ms) could still cause problems for other recorders such as ivtv and bttv based recorders.
Two potential workarounds are:
- Place the DB (usually in /var) on a different hard drive than the recordings
- Run the mysql db niced
comment:43 Changed 18 years ago by
Owner: | changed from gigem to bjm |
---|
No scheduler output from a run that took over 20 seconds has been provided. The only example posted were 1.1 and ~2.5 sec which proves nothig. Further there has been no information given about the fileystem layout, in particular if the /var partition with the active database files are on the same physical disk where two HD file are being written (or read and write) at the same time.
If this were the case, the disk heads would be thrashing and could not keep up with the queued requests. Coding around a near failure situation is not a solution. For anyone having this probelm, move /var to a different physical disk than that of the video recording directory.
Do not reopen this ticket without providing:
The scheduler output as described above for a schedule run of more than 20 seconds.
The scheduler output from the same system when there are no recordings or playback in progress.
The output of:
cat /etc/fstab
df /var/lib/mysql/mythconverg
df /your/video/dir
comment:44 Changed 18 years ago by
Resolution: | → invalid |
---|---|
Status: | new → closed |
Some Debug (-v record)
2005-11-01 08:44:46.322 IOBOUND - blocking in ThreadedFileWriter::Write() 2005-11-01 08:44:48.633 DVB#0 Rec: DVB Buffer overflow error detected on read 2005-11-01 08:44:48.637 DVB#0 Rec: WARNING - Got data from card after 500 ms. 2005-11-01 08:44:48.637 DVB#0 Rec: PID 49 _continuity_count 2 cc 8 2005-11-01 08:44:48.637 DVB#0 Rec: PID 52 _continuity_count 1 cc 14 2005-11-01 08:44:51.277 IOBOUND - blocking in ThreadedFileWriter::Write() 2005-11-01 08:44:58.168 DVB#0 Rec: DVB Buffer overflow error detected on read 2005-11-01 08:44:58.176 DVB#0 Rec: WARNING - Got data from card after 500 ms. 2005-11-01 08:44:58.176 DVB#0 Rec: PID 49 _continuity_count 5 cc 6 2005-11-01 08:44:58.177 DVB#0 Rec: PID 52 _continuity_count 5 cc 6 2005-11-01 08:45:00.609 IOBOUND - blocking in ThreadedFileWriter::Write() 2005-11-01 08:45:09.966 DVB#0 Rec: DVB Buffer overflow error detected on read 2005-11-01 08:45:09.971 DVB#0 Rec: WARNING - Got data from card after 500 ms. 2005-11-01 08:45:09.971 DVB#0 Rec: PID 49 _continuity_count 5 cc 4 2005-11-01 08:45:09.972 DVB#0 Rec: PID 52 _continuity_count 7 cc 13 2005-11-01 08:45:13.187 IOBOUND - blocking in ThreadedFileWriter::Write() 2005-11-01 08:45:19.970 DVB#0 Rec: DVB Buffer overflow error detected on read 2005-11-01 08:45:19.978 DVB#0 Rec: WARNING - Got data from card after 500 ms. 2005-11-01 08:45:19.978 DVB#0 Rec: PID 52 _continuity_count 13 cc 1