Opened 13 years ago

Closed 11 years ago

#9836 closed Bug Report - General (Fixed)

UTF-16LE .srt subtitle files do not work

Reported by: joe@… Owned by: Jim Stichnoth
Priority: minor Milestone: 0.27
Component: MythTV - Captions Version: 0.24-fixes
Severity: medium Keywords: subtitle srt utf-16
Cc: Ticket locked: no

Description

I have several UTF-16LE .srt subtitle files. I had to convert them to UTF-8 with iconv before they were recognized during playback.

Attachments (5)

external-subs-utf.diff (9.1 KB) - added by markk 12 years ago.
Chalet Girl (2011) BluRay.edited.srt (98.6 KB) - added by maryam.amanaty@… 12 years ago.
srt.patch (4.3 KB) - added by Jim Stichnoth 11 years ago.
xaa (3.3 KB) - added by yiannividalis@… 11 years ago.
first 100 lines of a utl-16le encoded srt file.
srt2.patch (4.7 KB) - added by Jim Stichnoth 11 years ago.

Download all attachments as: .zip

Change History (28)

comment:1 Changed 13 years ago by Raymond Wagner

Milestone: 0.24.2unknown

Resetting milestone to default.

comment:2 Changed 13 years ago by dekarl@…

Please post the first bytes of a subtitle. E.g. the output of

hexdump -n 16 [filename.srt]

Google suggests that the presence/absence of a Byte Order Mark might irritate ffmpeg.

comment:3 Changed 13 years ago by joe@…

Here is the hex dump requested:

0000000 feff 0031 000d 000a 0030 0030 003a 0030
0000010

You should be able to reproduce the issue by converting any UTF-8 .srt file to 16bit LE using:

iconv -f UTF-8 -t UTF-16LE OldFilename.srt > NewFilename.srt

comment:4 Changed 12 years ago by markk

Owner: Janne Grunau deleted
Status: newassigned

Changed 12 years ago by markk

Attachment: external-subs-utf.diff added

comment:5 Changed 12 years ago by markk

I've attached a patch that should fix the issue.

The root problem is that the code imported from xine does not handle multi-byte encodings. The 'cleanest' route would be to ensure it does.

Instead I've taken the approach of probing the subtitle file for a UTF encoding, using it if detected and otherwise falling back to the encoding setting. The file is then loaded in full and converted to unicode, converted back to utf-8 for parsing and finally converted back to unicode for presentation. Not pretty but I couldn't see another way of doing it without touching the actual parsing code.

The only problem is that it doesn't seem to handle files converted to utf using iconv. Material converted inside QtCreator? (i.e. probably using Qt code) seems to work fine.

Changed 12 years ago by maryam.amanaty@…

comment:6 Changed 12 years ago by stuartm

Status: assignedinfoneeded

Does the patch attached by Mark fix the issue?

comment:7 Changed 12 years ago by Joe@…

I do not compile myth from source, so I cannot test any time soon... Would need a few hours to relearn the process because it has been a long time and spare time has been lacking lately. I will test when I can, but would appreciate it if other testers could give its try in case I don't get to it soon.

comment:8 Changed 11 years ago by paulh

Is this still a problem?

comment:9 Changed 11 years ago by joe@…

I can confirm that it is still an issue with 0.26-fixes mythbuntu nightly build from today.

I have not been able to test the patch submitted by markk.

comment:10 Changed 11 years ago by yiannividalis@…

I run master and I'm willing to test the patch, however most of it fails to apply (I guess because I... run master):

checking file mythtv/libs/libmythtv/textsubtitleparser.cpp
Hunk #1 succeeded at 19 with fuzz 2 (offset 2 lines).
Hunk #2 FAILED at 119.
Hunk #3 FAILED at 163.
2 out of 3 hunks FAILED
checking file mythtv/libs/libmythtv/xine_demux_sputext.cpp
Hunk #2 FAILED at 96.
Hunk #3 FAILED at 1108.
Hunk #4 FAILED at 1123.
Hunk #5 succeeded at 1151 (offset 2 lines).
3 out of 5 hunks FAILED
checking file mythtv/libs/libmythtv/xine_demux_sputext.h
Hunk #1 FAILED at 21.
1 out of 1 hunk FAILED

MythTV Version : v0.27-pre2-2004-g30d7cc4-dirty
MythTV Branch : master
Network Protocol : 77
Library API : 0.27.20130719-1
QT Version : 4.8.4
Options compiled in:
 linux profile use_hidesyms using_alsa using_oss using_pulse using_pulseoutput using_backend using_bindings_perl using_bindings_python using_bindings_php using_dvb using_frontend using_ceton using_hdpvr using_libcrypto using_libfftw3 using_libxml2 using_libudf using_lirc using_mheg using_opengl using_opengl_video using_qtwebkit using_qtscript using_qtdbus using_taglib using_v4l2 using_x11 using_xrandr using_xv using_profiletype using_bindings_perl using_bindings_python using_bindings_php using_mythtranscode using_opengl using_vdpau using_ffmpeg_threads using_mheg using_libass using_libxml2 using_libudf

If someone can provide a patch against master, I can test it.

comment:11 Changed 11 years ago by stuartm

Milestone: unknown0.27
Status: infoneededassigned

comment:12 Changed 11 years ago by stuartm

Owner: set to Jim Stichnoth

comment:13 Changed 11 years ago by Jim Stichnoth

Component: MythTV - Video PlaybackMythTV - Captions
Status: assignedaccepted

Changed 11 years ago by Jim Stichnoth

Attachment: srt.patch added

comment:14 Changed 11 years ago by Jim Stichnoth

Yianni (and others),

Please try srt.patch and see if it works. It is an adaptation of Mark's original patch. Like Mark, I found that it still wouldn't load files converted by iconv, so I couldn't test the attached Chalet Girl example.

comment:15 Changed 11 years ago by yiannividalis@…

Sorry for the late reply, but we were away for the weekend. Anyway, the patch seems to work with the provided Chalet Girl (2011) BluRay?.edited.srt​ file.

I just downloaded the srt file, I renamed it to fit one of my films, and I could see the Arabic (I think) subs with the film.

Can I be of more help?

Yianni.

comment:16 Changed 11 years ago by yiannividalis@…

Scratch that. I hadn't paid close attention to your post, Jim. It appears the Chalet Girl was downloaded as UTF-8, that's why it was appearing normally.

I have other srt files encoded as utl-16le that still don't play with the patch.

comment:17 Changed 11 years ago by Jim Stichnoth

Could you attach one of these .srt files, or an excerpt?

Changed 11 years ago by yiannividalis@…

Attachment: xaa added

first 100 lines of a utl-16le encoded srt file.

comment:18 Changed 11 years ago by Jim Stichnoth

Milestone: 0.270.27.1

Changed 11 years ago by Jim Stichnoth

Attachment: srt2.patch added

comment:19 Changed 11 years ago by Jim Stichnoth

Yianni, srt2.patch works for me with your sample, could you give it a try? (A little more cleanup is needed, e.g. it probably leaks memory.)

comment:20 Changed 11 years ago by yiannividalis@…

Yes, Jim, I can confirm it works here, too. Thanks for your effort. I'm off to work now!

comment:21 Changed 11 years ago by Jim Stichnoth <jstichnoth@…>

In 73c9c55bc202c01d2b6c2172a00293a475672830/mythtv:

Handle UTF-16 encoded text subtitle files. Refs #9836

The code imported from xine that parses external text subtitle files
(e.g. .srt files) uses a great number of sscanf() calls, which doesn't
work with UTF-16 character arrays, nor any other encoding that
includes NUL characters or expands/obfuscates ASCII characters.

Instead of modifying the parser code, we convert the original file
contents to UTF-8 (auto-detecting the original encoding where
possible), send it to the xine code for parsing, and then convert the
individual subtitle strings back to unicode for presentation.

This is adapted from a patch written by Mark Kendall.

comment:22 Changed 11 years ago by Jim Stichnoth <jstichnoth@…>

In 2efe2796efa68109f9d5635151186e60e4d5c0de/mythtv:

Handle UTF-16 encoded text subtitle files. Refs #9836

The code imported from xine that parses external text subtitle files
(e.g. .srt files) uses a great number of sscanf() calls, which doesn't
work with UTF-16 character arrays, nor any other encoding that
includes NUL characters or expands/obfuscates ASCII characters.

Instead of modifying the parser code, we convert the original file
contents to UTF-8 (auto-detecting the original encoding where
possible), send it to the xine code for parsing, and then convert the
individual subtitle strings back to unicode for presentation.

This is adapted from a patch written by Mark Kendall.
(cherry picked from commit 73c9c55bc202c01d2b6c2172a00293a475672830)

comment:23 Changed 11 years ago by Jim Stichnoth

Milestone: 0.27.10.27
Resolution: Fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.