Opened 9 years ago
Closed 4 years ago
Last modified 4 years ago
#12602 closed Bug Report - General (Duplicate)
lossless transcode that fills the disk loses the tail of the recording
Reported by: | Owned by: | JYA | |
---|---|---|---|
Priority: | minor | Milestone: | 32.0 |
Component: | MythTV - Mythtranscode | Version: | 0.27.5 |
Severity: | medium | Keywords: | disk full transcode truncate |
Cc: | Ticket locked: | no |
Description
Backup your test recording first! Or use a worthless recording.
Fill your disk so that free space exists but is less than the size of the large test recording you are about to transcode. Now queue a lossless transcode to permanently remove a cutlist. When disk gets completely full during the transcode, this error is not detected, and the rename of .tmp file is performed anyway, overwriting the original with the failed file. Of course this permanently loses the end of the recording. I see no evidence of the full disk or related error in mythtranscode.log, but it does clearly log the final rename.
I see in some of the code where the write is carefully checked for failure but the close is not:
1163 int ret = write(fh, data, size); 1164 if (ret < 0) 1165 LOG(VB_GENERAL, LOG_ERR, QString("write failed %1").arg(filename) + 1166 ENO); 1167 close(fh);
The problem could be that the write silently succeeds and the ENOSPC error would actually be reported by the close instead. From write(2) and close(2):
NOTES A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee that space has successfully been reserved for the data. NOTES Not checking the return value of close() is a common but nevertheless serious programming error. It is quite possible that errors on a previous write(2) operation are first reported at the final close(). Not checking the return value when closing the file may lead to silent loss of data.
If this is indeed the root cause, then the fix is easy: check the return value of the close just like you did the write previous to it and return an unsuccessful exit that will prevent the rename. I see this in at least mpeg2fix.cpp but it could be a good idea anywhere that writes are done to disk.
Of course my workaround is to check disk headroom before doing a transcode, or to maintain headroom larger than largest file I will transcode. But this requires more headroom than maximum recording rate requires. I think auto-expire happens only during recording, so it is not making more space during a transcode. Transcode of 4 hours of FHD could need 20GB or more of free space.
Attachments (1)
Change History (17)
comment:1 Changed 9 years ago by
comment:2 Changed 9 years ago by
Sorry, my assumption is that close is more likely to fail on full disk, even though the write just succeeded. I have seen this happen in completely unrelated code in other projects. Close may cause the OS to actually flush pending data to disk while write just buffers it into memory successfully.
Then, if close() fails, my expectation would be that instead of rename .mpg.tmp over the original .mpg, mythtv would instead log the error and remove the truncated .mpg.tmp transcode.
If you are right that close also succeeds, then we could get even more serious with fsync(2). Again from close(2):
A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.)
But that might be overkill if close already detected the full disk. I'd first check close and see if its return is correct. If fsync is needed, hopefully it breaks out with an error instead of blocking which would be even worse. I haven't yet tried any of this since I'm using pre-built mythbuntu and haven't yet built it from source.
Thanks for your quick reply and consideration!
comment:3 Changed 9 years ago by
BTW: I think the code I found above is not actually doing the writing in this case. I have not yet found where mythtranscode is actually appending to the new file. But surely it is a similar issue that doesn't detect the full disk. I'll keep looking for the right code location.
comment:4 Changed 9 years ago by
Looks like mythtv code is at a higher abstraction layer that is not actually doing the open/write/close calls at all. transcode.cpp calls avfw->CloseFile?(); which in avformatwriter.cpp does an avio_close without checking its return value: https://code.mythtv.org/doxygen/avformatwriter_8cpp_source.html#l00211
avio_close is over in ffmpeg's libav*: https://ffmpeg.org/doxygen/trunk/aviobuf_8c_source.html#l00943
This calls an avio_flush which hopefully blocks until the data is written, then avio_close returns ffurl_close which returns h->prot->url_close: https://ffmpeg.org/doxygen/trunk/avio_8c_source.html#l00390
I still haven't tracked down the actual close() in libav, but so far all calls are passing the return value back up the chain, until the return of avio_close is ignored in mythtv's avformatwriter.cpp.
So my bug fix idea should still work only if libav detects the ENOSPC error and passes it back up to avio_close where it can be detected and handled better by mythtv code (log the error and remove truncated new file instead of rename over original).
comment:5 Changed 9 years ago by
To complete the reverse engineering, I think the actual open/write/close is here: https://ffmpeg.org/doxygen/trunk/libavformat_2file_8c_source.html
246 static int file_close(URLContext *h) 247 { 248 FileContext *c = h->priv_data; 249 return close(c->fd); 250 }
So yes, the actual return code from the OS should be available from avio_close back in avformatwriter.cpp. Of course I could be wrong, but I think it is worth a try to log the return code of avio_close and see what value it has when disk has filled during transcode.
comment:6 Changed 9 years ago by
Sorry, my attempt to read the source code was still wrong. But my theory of the root cause was right.
Thanks to your excellent Wiki documentation, I was able to build mythtranscode from fixes/0.27 from github. Then after adding a few debugging lines, I discovered the writes are actually in programs/mythtranscode/replex/multiplex.c, at least for my .mpg recordings. The return value of write and close in there are discarded. Sure enough, write detects the full disk right away:
if (write(mx->fd_out, outbuf, written) < 0) LOG(VB_GENERAL, LOG_ERR, "twitham: failed to write %d bytes: %d", written, errno); Jan 24 13:18:06 mythtv mythtranscode: mythtranscode[24118]: E MPEG2Replex replex/multiplex.c:444 (writeout_ext) twitham: failed to write 2048 bytes: 28 Jan 24 13:18:06 mythtv mythtranscode: mythtranscode[24118]: E MPEG2Replex replex/multiplex.c:299 (writeout_video) twitham: failed to write 2048 bytes: 28 Jan 24 13:18:06 mythtranscode: last message repeated 55 times #define ENOSPC 28 /* No space left on device */
The write and close is currently in functions that return void. So now I need to figure out how to improve this so the failure can be seen by the calling code and cancel the file rename and preserve the original seek table and cutlist, and probably unlink the failed .tmp file.
Changed 9 years ago by
Attachment: | mythtranscode.patch added |
---|
abort failed .mpg transcode upon full disk
comment:7 Changed 9 years ago by
Attached patch fixes the bug for my .mpg recordings. Experts can review this, but it works for me. You can remove my username if it goes production. I added it so I could easily see the new errors I logged upon write failures:
Jan 28 01:52:36 mythtv mythtranscode: mythtranscode[19202]: I MPEG2Replex replex/multiplex.c:840 (init_multiplex) Mux rate: 24.98 Mbit/s Jan 28 01:52:49 mythtv mythtranscode: mythtranscode[19202]: I CoreContext mpeg2fix.cpp:1796 (InsertFrame) Inserting 2 I-Frames after #8 Jan 28 01:52:49 mythtv mythtranscode: mythtranscode[19202]: I CoreContext mpeg2fix.cpp:1796 (InsertFrame) Inserting 1 I-Frames after #27 Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 1 write failed: No space left on device Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 2 write failed: No space left on device Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 3 write failed: No space left on device Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 4 write failed: No space left on device Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 5 write failed: No space left on device Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 6 write failed: No space left on device Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 7 write failed: No space left on device Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 8 write failed: No space left on device Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 9 write failed: No space left on device Jan 28 01:52:54 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex replex/multiplex.c:301 (writeout_video) twitham: 10 write failed: No space left on device Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: E MPEG2Replex mpeg2fix.cpp:503 (WaitBuffers) twitham: thread finished with 91246 write errors Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: E CoreContext mpeg2fix.cpp:2578 (Start) twitham: joined thread failed with 91246 write errors Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: E CoreContext main.cpp:724 (main) Transcoding /var/lib/mythtv1/recordings/1241_20160128000000.mpg failed Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: N CoreContext main.cpp:1090 (CompleteJob) Deleting /var/lib/mythtv1/recordings/1241_20160128000000.mpg.tmp Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: N CoreContext main.cpp:749 (transUnlink) Requesting delete for file 'myth://Default@mythtv/1241_20160128000000.mpg.tmp'. Jan 28 01:52:58 mythtv mythtranscode: mythtranscode[19202]: I CoreContext mythcontext.cpp:1194 (~MythContext) Waiting for threads to exit.
comment:8 Changed 9 years ago by
Explanation of the patch:
- multiplex.h:
- the mx multiplex structure gets a new "int error", the error count.
- write_out_packs and finish_mpg now return int, the error count (so 0 = success)
- currently unused: callers test the error count instead
- multiplex.c:
- if write() fails, increment error count
- log only first 10 failures so we don't flood syslog and cause it to drop messages
- if close() fails, increment error count
- mpeg2fix.cpp:
- copy error count to static variable and pass it to pthread_exit
- capture error count in pthread_join and log/fail if non-zero
- TODO:
- check all writes similarly, not just writeout_video
- fix replex.c to exit non-zero if write failure
In mpeg2fix.cpp there is also a commented section where I tried to exit on first write failure. This just deadlocked the other thread. It might be possible to cancel the other thread if you wish to abort at first write but I didn't attempt this. As-is, we just continue to fail and return the failure count when the reader is done. I monitored with top and the size didn't grow so I assume the data is just being dropped rather than filling memory, which is a good thing.
comment:9 Changed 7 years ago by
I put this bug fix in the form of a pull request:
fixes/0.27: https://github.com/MythTV/mythtv/pull/163
master: https://github.com/MythTV/mythtv/pull/164
I compiled and tested only on my 0.27 build. But the patch to master was clean so it should work there as well.
comment:10 Changed 4 years ago by
Bug is still fixed for me 4 years later. But not for anyone else since pull request 164 is not merged.
comment:11 Changed 4 years ago by
Milestone: | unknown → needs_triage |
---|
comment:13 Changed 4 years ago by
Replying to twitham1:
Tweaked PR 164 again to apply clean to master.
Could you open an issue on github and reference this ticket please?
We have moved over to github for issue tracking.
Many thanks Stuart
comment:15 Changed 4 years ago by
Milestone: | needs_triage → 32.0 |
---|---|
Resolution: | → Duplicate |
Status: | new → closed |
Migrated to https://github.com/MythTV/mythtv/issues/244
comment:16 Changed 4 years ago by
This is fixed in version 32 since pull request 164 was merged. Thanks!
and how would checking that close succeeded help in any shape or form: result would still be crap