Opened 14 years ago

Closed 13 years ago

Last modified 13 years ago

#1076 closed patch (fixed)

mythbackend dies when writing big files (4 GB) - FAT32 filesystem

Reported by: buzz@… Owned by: danielk
Priority: minor Milestone: 0.20
Component: mythtv Version: head
Severity: medium Keywords:
Cc: Ticket locked: no

Description

Buzz Says: (I'm making the -dev conversation into a ticket, as it's verifiably a bug when the backend seg faults) Scenario: Backend saves files to FAT32 partition. Backend tries to exceed 4GB (or there abouts) while unattended. Backend dies with error "File size limit exceeded" emitted by OS.

Backend's last message prior to dying was: "TFW: safe_swite() funky usleep" (message comes from ThreadedFileWriter?.cpp )


Buzz says: Wouldn't it be reasonable if it barfed the recording partially/entirely WITHOUT crashing mythbackend, rather than crashing entirely as it does now.


Isaac says: Backtrace?:I'm not going to want to add code specifically to handle fat32, but certainly dying is bad.


Buzz says: Last 4 lines of 'mythbackend -v all' followed by backtrace: 2006-01-19 16:35:20.575 MSqlQuery: INSERT INTO recordedmarkup (chanid, starttime, mark, type, offset) VALUES ( '1007' , '2006-01-19T15:15:00' , '119754' , 9 , '4289249140' ); 2006-01-19 16:35:20.576 MSqlQuery: INSERT INTO recordedmarkup (chanid, starttime, mark, type, offset) VALUES ( '1007' , '2006-01-19T15:15:00' , '119772' , 9 , '4289895672' ); 2006-01-19 16:35:20.577 MSqlQuery: INSERT INTO recordedmarkup (chanid, starttime, mark, type, offset) VALUES ( '1007' , '2006-01-19T15:15:00' , '119790' , 9 , '4290548972' ); 2006-01-19 16:35:25.630 TFW: safe_write(): funky usleep

Program received signal SIGXFSZ, File size limit exceeded. [Switching to Thread -1336443984 (LWP 2872)] 0x00b8e402 in kernel_vsyscall () (gdb) (gdb) bt #0 0x00b8e402 in kernel_vsyscall () #1 0x007540bb in write_nocancel () from /lib/libpthread.so.0 #2 0x00e93412 in safe_write (fd=20, data=0xaee70d90, sz=12920)

at ThreadedFileWriter?.cpp:57

#3 0x00e950a1 in ThreadedFileWriter::DiskLoop? (this=0x89b69d0)

at ThreadedFileWriter?.cpp:367

#4 0x00e951a5 in ThreadedFileWriter::boot_writer (wotsit=0x89b69d0)

at ThreadedFileWriter?.cpp:93

#5 0x0074fb80 in start_thread () from /lib/libpthread.so.0 #6 0x02d969ce in clone () from /lib/libc.so.6


Buzz says: *) OS is sending a SIGXFSZ to backend, backend is taking default action which "coredump and exit". Solution:

  • capture SIGXFSZ, handle it gracefully.

Buzz says: Hi All. I'm working on a solution to this thread and have got the following steps working in my code (diff attached - catch_and_handle_SIGXFSZ_diff.txt):

1) OS sends SIGXFSZ to mythbackend 2) backend captures said signal, squirrels it into a global called "LastSignal?", so anyone who wants to can look for it. (yes, I know globals are bad, but signal handlers are worse.) 3) ThreadedFileWriter?.cpp has an existing function called "safe_write" that I've modified so that it checks for the signal(in the global) before trying to write to any file. 4) safe_write: if a SIGXFSZ signal was received it "aborts" the in-progress write (then-and-there, without flushing memory buffers to disk or anything), returning an error. 5) safe_write is called from inside ThreadedFileWriter::DiskLoop?. The return value of safe_write is tested in DiskLoop?, and it causes both threads (write and sync threads of ThreadedFileWriter?) to be torn down, and the ThreadedFileWriter? to enter a state of "write error". 6) next time the caller (RingBuffer?.cpp) trys to call tfw->Write(...) it fails, returning -1 up to the calling function (which is in RingBuffer?.cpp - Write), and the tfw is torn down, closing the open file handle, and cleaning up. 7) RingBuffer?.cpp already had the capability to return -1 or other errors, so it's been tweaked to look at the return status of the tfw->Write call too, and pass the error up if it occurs.

...now, I'm not sure where to take it from here.

The signal is definitely being captured, and it's being passed all the way back up to the RingBuffer?, so I know that's working but.... Nothing else (backend and/or frontend) seems to recognise that the recording failed. Should I go that far, or just barf the error message to the log, and leave it at that?

IE: How do I make everything else recognise that the recording of this file has aborted/failed?

Am I doing the right thing here... Or is there an easier way?


Mark Weaver says: Just ignore the signal - write will return EFBIG and the recording should follow the usual failure path. You should be able to test it with ulimit -f, that will allow you to generate SIGXFSZ with smaller files.


Buzz says: The problem is that as it exists now in CVS, ThreadedFileWriter?.cpp has no "usual failure path" from the 'write' command (in safe_write). safe_write returns a uint to indicate how much was written, and '0' is a legitimate amount to write, not an error case. I've changed the relevant places to allow it to return negative (failure),and pass the failure back-up the calling chain to RingBuffer? where it emits an error to the log.

Both backend and frontend both still seem oblivious to the error condition that occurs when RingBuffer?->Write() return -1 during a record.

Other suggestions?


Attachments (2)

catch_and_handle_SIGXFSZ_diff.txt (8.9 KB) - added by buzz@… 14 years ago.
1076.patch (3.7 KB) - added by danielk 13 years ago.
Simple fix

Download all attachments as: .zip

Change History (12)

Changed 14 years ago by buzz@…

comment:1 Changed 14 years ago by Stuart Auchterlonie

There is a reason you can't make the file larger than 4Gb and that is because you are using FAT32.

See http://www.microsoft.com/resources/documentation/Windows/XP/all/reskit/en%2Dus/?url=/resources/documentation/Windows/XP/all/reskit/en-us/prkc_fil_tdrn.asp

comment:2 Changed 14 years ago by kkuphal

The OP is aware of the reason for the 4GB limit. The issue is that the backend should not die when it cannot write to a file. It should at least end the recording gracefully or at best, begin recording a new file. A dying backend is never a good thing.

comment:3 Changed 14 years ago by buzz <buzz@…>

No applicable comment has yet been made on my patch. Applying the attached patch DOES fix the backend crash, and makes it instead emit an error to the log. Please consider applying this patch to SVN. The scheduler/recorder are not yet made aware of the failure, but that is for another time, if someone else wants to take it up?

comment:4 Changed 14 years ago by Isaac Richards

Milestone: 0.20

comment:5 Changed 14 years ago by cpinkham

(In [8819]) References #1076. Attempt to filter the 'garbage' being sent by some telnet clients during the initial connection by only allowing characters we know are valid for input.

Robert, if this fixes it for you, can you go ahead and close the ticker, othewise I'll take another look tonight.

comment:6 Changed 14 years ago by buzz <buzz@…>

I think you referenced the wrong ticket number in [8819] I think you meant: References #1176

comment:7 Changed 13 years ago by danielk

Owner: changed from Isaac Richards to danielk
Version: head

Changed 13 years ago by danielk

Attachment: 1076.patch added

Simple fix

comment:8 Changed 13 years ago by danielk

Type: defectpatch

comment:9 Changed 13 years ago by danielk

Resolution: fixed
Status: newclosed

(In [10633]) Fixes #1076. This just makes MythTV stop trying to write to a file once it has it a file size resource limit.

The signal is ignored rather than doing a coredump, and a useful error message is printed out once for each file which hits it's resource limit. This can be triggered when the mythbackend user has a ulimit set, if the c library is configured for 32 bit file pointers, or if the filesystem does not support "large" files.

comment:10 Changed 13 years ago by danielk

(In [10634]) Refs #1076. Just moves a string initialization, Chris Pinkham pointed out that it may inadvertently trigger a malloc/free where it is now.

Note: See TracTickets for help on using tickets.