Opened 12 years ago

Closed 12 years ago

#4917 closed defect (fixed)

ofdb.py: charset issues

Reported by: laga@… Owned by: Anduin Withers
Priority: minor Milestone: 0.21.1
Component: mythtv Version: unknown
Severity: medium Keywords:
Cc: Ticket locked: no

Description

ofdb.py has some charset issues.

When searching for a movie ID using ofdb.py -M, the query is sent encoded as UTF8 and the server does not return any hits. It seems it expects iso-8859-15 instead.

I tried to recode the query to iso-8859-15 inside ofdb.py, but urllib would complain because it was expecting ASCII. I'm not sure where to go from here.

The movie meta data returned by ofdb.py doesn't look good in the video manager either. It looks like UTF-8 being displayed as latin-1, eg you get two weird characters instead of one umlaut.The same happens in the console, but it goes away if I comment out

content = unicode(content, charset)

in line 105.

Here's a simple test case: ofdb.py -M 'identität'

My terminal and my environment are both set to UTF-8.

Change History (5)

comment:1 Changed 12 years ago by anonymous

I've created a small wrapper script which converts all input to iso8859-15:

#!/bin/sh
ARG=`echo $@ | iconv -f UTF-8 -t iso8859-15`
/usr/share/mythtv/mythvideo/scripts/ofdb.py $ARG

When using this script and commenting out line 105 as shown above, ofdb.py almost works. Searching for titles works fine and umlauts will show up correctly in MythVideo?. For some IDs, ofdb.py will give me errors, though:

laga@prometheus:~$  /usr/share/mythtv/mythvideo/scripts/ofdb.py -D "35767,Fluch-der-Karibik"
# Traceback (most recent call last):
#   File "/usr/share/mythtv/mythvideo/scripts/ofdb.py", line 313, in search_data
#     doc = reader.fromString(content)
#   File "/usr/lib/python2.5/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py", line 69, in fromString
#     return self.fromStream(stream, ownerDoc, charset)
#   File "/usr/lib/python2.5/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py", line 27, in fromStream
#     self.parser.parse(stream)
#   File "/usr/lib/python2.5/site-packages/_xmlplus/dom/ext/reader/Sgmlop.py", line 57, in parse
#     self._parser.parse(stream.read())
#   File "/usr/lib/python2.5/site-packages/_xmlplus/dom/ext/reader/Sgmlop.py", line 160, in finish_starttag
#     unicode(value, self._charset))
#   File "/usr/lib/python2.5/site-packages/_xmlplus/dom/Element.py", line 170, in setAttributeNS
#     raise InvalidCharacterErr()
# Traceback (most recent call last):
#   File "/usr/share/mythtv/mythvideo/scripts/ofdb.py", line 458, in <module>
#     main()
#   File "/usr/share/mythtv/mythvideo/scripts/ofdb.py", line 444, in main
#     search_data(options.data_search, options.ratings_from)
#   File "/usr/share/mythtv/mythvideo/scripts/ofdb.py", line 357, in search_data
#     print_exception(traceback.format_exc())
#   File "/usr/share/mythtv/mythvideo/scripts/ofdb.py", line 53, in print_exception
#     comment_out(line)
#   File "/usr/share/mythtv/mythvideo/scripts/ofdb.py", line 41, in comment_out
#     print("# %s" % (str,))
#   File "/usr/lib/python2.5/codecs.py", line 303, in write
#     data, consumed = self.encode(object, self.errors)
# UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 26: ordinal not in range(128)

This is without the wrapper script but with my patch from ticket #4916.

comment:2 Changed 12 years ago by Anduin Withers

Owner: changed from Isaac Richards to Anduin Withers
Status: newassigned

comment:3 Changed 12 years ago by Anduin Withers

Milestone: unknown0.21.1

comment:4 Changed 12 years ago by Anduin Withers

(In [16988]) References #4917

  • Fixes character encoding issues (page reports UTF-8 but contains ISO-8859/UTF-8).
  • Fixes HTML parser error caused by bad meta/script tags.

Many thanks to Michael Haas (laga) for reporting these issues.

comment:5 Changed 12 years ago by Anduin Withers

Resolution: fixed
Status: assignedclosed

(In [17064]) Closes #4917

Merges [16988], [16995], [17012], and [17051] from trunk.

  • Fixes character encoding issues (page reports UTF-8 but contains ISO-8859/UTF-8).
  • Fixes HTML parser error caused by bad meta/script tags.
  • Modify the modules search path to include oldxml if the directory exists, the previous workaround didn't because the xml modules were not reloaded. This is for Ubuntu Hardy, they are removing python-xml.

Thanks to Michael Haas (laga) for finding these issues, and for finding them again when my initial fix attempts failed.

Note: See TracTickets for help on using tickets.