Opened 9 years ago

Closed 4 years ago

Last modified 4 years ago

#12602 closed Bug Report - General (Duplicate)

lossless transcode that fills the disk loses the tail of the recording

Reported by: Timothy D Witham <twitham@…> Owned by: JYA
Priority: minor Milestone: 32.0
Component: MythTV - Mythtranscode Version: 0.27.5
Severity: medium Keywords: disk full transcode truncate
Cc: Ticket locked: no

Description

Backup your test recording first! Or use a worthless recording.

Fill your disk so that free space exists but is less than the size of the large test recording you are about to transcode. Now queue a lossless transcode to permanently remove a cutlist. When disk gets completely full during the transcode, this error is not detected, and the rename of .tmp file is performed anyway, overwriting the original with the failed file. Of course this permanently loses the end of the recording. I see no evidence of the full disk or related error in mythtranscode.log, but it does clearly log the final rename.

I see in some of the code where the write is carefully checked for failure but the close is not:

 1163     int ret = write(fh, data, size);
 1164     if (ret < 0)
 1165         LOG(VB_GENERAL, LOG_ERR, QString("write failed %1").arg(filename) +
 1166                 ENO);
 1167     close(fh);

The problem could be that the write silently succeeds and the ENOSPC error would actually be reported by the close instead. From write(2) and close(2):

NOTES
     A successful return from write() does not make any guarantee
     that  data has  been committed  to disk.   In fact,  on some
     buggy implementations, it does not even guarantee that space
     has successfully  been reserved for the data.

NOTES
     Not checking  the return  value of close()  is a  common but
     nevertheless  serious   programming  error.   It   is  quite
     possible that  errors on  a previous write(2)  operation are
     first  reported  at the  final  close().   Not checking  the
     return value when  closing the file may lead  to silent loss
     of data.

If this is indeed the root cause, then the fix is easy: check the return value of the close just like you did the write previous to it and return an unsuccessful exit that will prevent the rename. I see this in at least mpeg2fix.cpp but it could be a good idea anywhere that writes are done to disk.

Of course my workaround is to check disk headroom before doing a transcode, or to maintain headroom larger than largest file I will transcode. But this requires more headroom than maximum recording rate requires. I think auto-expire happens only during recording, so it is not making more space during a transcode. Transcode of 4 hours of FHD could need 20GB or more of free space.

Attachments (1)

mythtranscode.patch (4.7 KB) - added by Timothy D Witham <twitham@…> 9 years ago.
abort failed .mpg transcode upon full disk

Download all attachments as: .zip

Change History (17)

comment:1 Changed 9 years ago by JYA

and how would checking that close succeeded help in any shape or form: result would still be crap

comment:2 Changed 9 years ago by Timothy D Witham <twitham@…>

Sorry, my assumption is that close is more likely to fail on full disk, even though the write just succeeded. I have seen this happen in completely unrelated code in other projects. Close may cause the OS to actually flush pending data to disk while write just buffers it into memory successfully.

Then, if close() fails, my expectation would be that instead of rename .mpg.tmp over the original .mpg, mythtv would instead log the error and remove the truncated .mpg.tmp transcode.

If you are right that close also succeeds, then we could get even more serious with fsync(2). Again from close(2):

	A successful close does not guarantee that the data has been
	successfully saved to disk, as the kernel defers writes.  It
	is not  common for a file  system to flush  the buffers when
	the stream is closed.  If you  need to be sure that the data
	is physically  stored use fsync(2).  (It will  depend on the
	disk hardware at this point.)

But that might be overkill if close already detected the full disk. I'd first check close and see if its return is correct. If fsync is needed, hopefully it breaks out with an error instead of blocking which would be even worse. I haven't yet tried any of this since I'm using pre-built mythbuntu and haven't yet built it from source.

Thanks for your quick reply and consideration!

comment:3 Changed 9 years ago by Timothy D Witham <twitham@…>

BTW: I think the code I found above is not actually doing the writing in this case. I have not yet found where mythtranscode is actually appending to the new file. But surely it is a similar issue that doesn't detect the full disk. I'll keep looking for the right code location.

comment:4 Changed 9 years ago by Timothy D Witham <twitham@…>

Looks like mythtv code is at a higher abstraction layer that is not actually doing the open/write/close calls at all. transcode.cpp calls avfw->CloseFile?(); which in avformatwriter.cpp does an avio_close without checking its return value: https://code.mythtv.org/doxygen/avformatwriter_8cpp_source.html#l00211

avio_close is over in ffmpeg's libav*: https://ffmpeg.org/doxygen/trunk/aviobuf_8c_source.html#l00943

This calls an avio_flush which hopefully blocks until the data is written, then avio_close returns ffurl_close which returns h->prot->url_close: https://ffmpeg.org/doxygen/trunk/avio_8c_source.html#l00390

I still haven't tracked down the actual close() in libav, but so far all calls are passing the return value back up the chain, until the return of avio_close is ignored in mythtv's avformatwriter.cpp.

So my bug fix idea should still work only if libav detects the ENOSPC error and passes it back up to avio_close where it can be detected and handled better by mythtv code (log the error and remove truncated new file instead of rename over original).

comment:5 Changed 9 years ago by Timothy D Witham <twitham@…>

To complete the reverse engineering, I think the actual open/write/close is here: https://ffmpeg.org/doxygen/trunk/libavformat_2file_8c_source.html

  246 static int file_close(URLContext *h)
  247 {
  248     FileContext *c = h->priv_data;
  249     return close(c->fd);
  250 }

So yes, the actual return code from the OS should be available from avio_close back in avformatwriter.cpp. Of course I could be wrong, but I think it is worth a try to log the return code of avio_close and see what value it has when disk has filled during transcode.

comment:6 Changed 9 years ago by Timothy D Witham <twitham@…>

Sorry, my attempt to read the source code was still wrong. But my theory of the root cause was right.

Thanks to your excellent Wiki documentation, I was able to build mythtranscode from fixes/0.27 from github. Then after adding a few debugging lines, I discovered the writes are actually in programs/mythtranscode/replex/multiplex.c, at least for my .mpg recordings. The return value of write and close in there are discarded. Sure enough, write detects the full disk right away:

	if (write(mx->fd_out, outbuf, written) < 0)
	  LOG(VB_GENERAL, LOG_ERR, "twitham: failed to write %d bytes: %d", written, errno);

Jan 24 13:18:06 mythtv mythtranscode: mythtranscode[24118]: E MPEG2Replex replex/multiplex.c:444 (writeout_ext) twitham: failed to write 2048 bytes: 28
Jan 24 13:18:06 mythtv mythtranscode: mythtranscode[24118]: E MPEG2Replex replex/multiplex.c:299 (writeout_video) twitham: failed to write 2048 bytes: 28
Jan 24 13:18:06  mythtranscode: last message repeated 55 times

#define	ENOSPC		28	/* No space left on device */

The write and close is currently in functions that return void. So now I need to figure out how to improve this so the failure can be seen by the calling code and cancel the file rename and preserve the original seek table and cutlist, and probably unlink the failed .tmp file.

Changed 9 years ago by Timothy D Witham <twitham@…>

Attachment: mythtranscode.patch added

abort failed .mpg transcode upon full disk

comment:7 Changed 9 years ago by Timothy D Witham <twitham@…>

Attached patch fixes the bug for my .mpg recordings. Experts can review this, but it works for me. You can remove my username if it goes production. I added it so I could easily see the new errors I logged upon write failures:

Jan 28 01:52:36 mythtv mythtranscode: mythtranscode[19202]: I MPEG2Replex replex/multiplex.c:840 (init_multiplex) Mux rate: 24.98 Mbit/s
Jan 28 01:52:49 mythtv mythtranscode: mythtranscode[19202]: I CoreContext mpeg2fix.cpp:1796 (InsertFrame) Inserting 2 I-Frames after #8 
Jan 28 01:52:49 mythtv mythtranscode: mythtranscode[19202]: I CoreContext mpeg2fix.cpp:1796 (InsertFrame) Inserting 1 I-Frames after #27 
Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 1 write failed: No space left on device
Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 2 write failed: No space left on device
Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 3 write failed: No space left on device
Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 4 write failed: No space left on device
Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 5 write failed: No space left on device
Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 6 write failed: No space left on device
Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 7 write failed: No space left on device
Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 8 write failed: No space left on device
Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 9 write failed: No space left on device
Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 10 write failed: No space left on device
Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex mpeg2fix.cpp:503 (WaitBuffers) twitham: thread finished with 91246 write errors
Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: E CoreContext mpeg2fix.cpp:2578 (Start) twitham: joined thread failed with 91246 write errors
Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: E CoreContext main.cpp:724 (main) Transcoding /var/lib/mythtv1/recordings/1241_20160128000000.mpg failed
Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: N CoreContext main.cpp:1090 (CompleteJob) Deleting /var/lib/mythtv1/recordings/1241_20160128000000.mpg.tmp
Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: N CoreContext main.cpp:749 (transUnlink) Requesting delete for file 'myth://Default@mythtv/1241_20160128000000.mpg.tmp'.
Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: I CoreContext mythcontext.cpp:1194 (~MythContext) Waiting for threads to exit.

comment:8 Changed 9 years ago by Timothy D Witham <twitham@…>

Explanation of the patch:

  • multiplex.h:
    • the mx multiplex structure gets a new "int error", the error count.
    • write_out_packs and finish_mpg now return int, the error count (so 0 = success)
      • currently unused: callers test the error count instead
  • multiplex.c:
    • if write() fails, increment error count
    • log only first 10 failures so we don't flood syslog and cause it to drop messages
    • if close() fails, increment error count
  • mpeg2fix.cpp:
    • copy error count to static variable and pass it to pthread_exit
    • capture error count in pthread_join and log/fail if non-zero
  • TODO:
    • check all writes similarly, not just writeout_video
    • fix replex.c to exit non-zero if write failure

In mpeg2fix.cpp there is also a commented section where I tried to exit on first write failure. This just deadlocked the other thread. It might be possible to cancel the other thread if you wish to abort at first write but I didn't attempt this. As-is, we just continue to fail and return the failure count when the reader is done. I monitored with top and the size didn't grow so I assume the data is just being dropped rather than filling memory, which is a good thing.

comment:9 Changed 7 years ago by Timothy D Witham <twitham@…>

I put this bug fix in the form of a pull request:

fixes/0.27: https://github.com/MythTV/mythtv/pull/163

master: https://github.com/MythTV/mythtv/pull/164

I compiled and tested only on my 0.27 build. But the patch to master was clean so it should work there as well.

comment:10 Changed 4 years ago by twitham1

Bug is still fixed for me 4 years later. But not for anyone else since pull request 164 is not merged.

comment:11 Changed 4 years ago by Stuart Auchterlonie

Milestone: unknownneeds_triage

comment:12 Changed 4 years ago by twitham1

Tweaked PR 164 again to apply clean to master.

comment:13 in reply to:  12 Changed 4 years ago by Stuart Auchterlonie

Replying to twitham1:

Tweaked PR 164 again to apply clean to master.

Could you open an issue on github and reference this ticket please?

We have moved over to github for issue tracking.

Many thanks Stuart

comment:14 Changed 4 years ago by twitham1

comment:15 Changed 4 years ago by Stuart Auchterlonie

Milestone: needs_triage32.0
Resolution: Duplicate
Status: newclosed

comment:16 Changed 4 years ago by twitham1

This is fixed in version 32 since pull request 164 was merged. Thanks!

Note: See TracTickets for help on using tickets.