Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#5946 closed patch (fixed)

Fix (unimportant) data corruption in database character set conversion

Reported by: sphery <mtdean@…> Owned by: Janne Grunau
Priority: minor Milestone: unknown
Component: mythtv Version: head
Severity: medium Keywords:
Cc: Ticket locked: no

Description

The conversion from CHAR(latin1)->BINARY->CHAR(utf8) performed for some fields in database upgrades 1216 and 1217 results in null-padded data in the CHAR field. MySQL expects CHAR field data to be space-padded, so it never strips the null characters from the data when returning query results or when manipulating data. The null characters then break comparisons (meaning we get new "uncorrupted" values placed in the DB in addition to the corrupted values) and can be displayed, for example, in MythWeb.

The attached patch changes the conversion of all multi-character CHAR fields to use CHAR(latin1)->VARBINARY->CHAR(utf8) to prevent the null-padding corruption from occurring. Specifically, the affected fields are people.name, programgenres.genre, programrating.{system,rating}, recordedrating.{system.rating} .

There are still columns of type CHAR(1) for which I didn't change the intermediate BINARY data type, and it's possible that in some databases values for fields which are not defined as "NOT NULL" may be converted to a single null character when they are meant to be blank (which would be represented internally as a single space). The additional CHAR(1) fields include cardinput.sharable, dtv_multiplex.{inversion,polarity,bandwidth,transmission_mode}, and programgenres.relevance ; however, in my database, none include empty values, so none are corrupted.

Though this will change the database upgrade history, it will simply prevent corruption of data for users who have not yet upgraded their databases. I will be submitting a patch that fixes the corrupt data for those who have already converted to trunk.

Users should not attempt to fix the data as it's really unimportant data and most likely there exist uncorrupted data for any current listings data and the upcoming patch will fix it properly. Fixing it properly is not trivial because of key uniqueness constraints which will be violated due to the existence of duplicates whose only difference is the null-padding corruption. The new, uncorrupted values will have been inserted by mythfilldatabase/EIT, so are likely to exist if MythTV (mythbackend/mythfilldatabase) has ever been run since the database conversion.

Attachments (3)

mythtv-5946-fix_database_utf8_conversion.patch (2.0 KB) - added by sphery <mtdean@…> 12 years ago.
Fixes database conversion (for users whose database have not yet been converted)
mythtv-5946-fix_database_utf8_conversion_corruption.patch (8.0 KB) - added by sphery <mtdean@…> 12 years ago.
Fixes corruption caused by original implementation of UTF-8 conversion
mythtv-5946-fix_database_utf8_conversion_corruption--better_dup_handling.patch (8.3 KB) - added by sphery <mtdean@…> 12 years ago.
Updated patch--better handles dups, fixing problems from theoretically possible duplicates in programrating and recordedrating

Download all attachments as: .zip

Change History (7)

Changed 12 years ago by sphery <mtdean@…>

Fixes database conversion (for users whose database have not yet been converted)

Changed 12 years ago by sphery <mtdean@…>

Fixes corruption caused by original implementation of UTF-8 conversion

comment:1 Changed 12 years ago by sphery <mtdean@…>

mythtv-5946-fix_database_utf8_conversion_corruption.patch fixes the corruption caused by the initial (pre mythtv-5946-fix_database_utf8_conversion.patch ) implementation of the UTF-8 conversion in DB update 1216.

Because of references from the credits and recordedcredits table to person (ID's) in the people table, the fix involves:

  1. query all corrupt records in people
  2. find all records that are duplicates once null-padding is removed
    1. if there are duplicates:
      1. update references in {,recorded}credits to refer to the original (corrupt) person
      2. delete the duplicate (not-null-padded) record from people
      3. update the corrupt (null-padded) name on the original person record
    2. if there are no duplicates
      1. update the corrupt (null-padded) name on the original person record

Though we are updating the corrupt (null-padded) name on the original person record whether there are duplicates or not, we cannot just move the code outside the duplicate-checking conditional because if there are duplicates, it's only safe to correct the name if we were able to update references (we will only delete the duplicate if we were able to update references).

Also, it seems inefficient to update the {,recorded}credits table to use the corrupt person, delete the not-corrupt person, then correct the corrupt person's name, but we need to use this approach to ensure that data is valid even if we have multiple duplicates. For example, if users have edited database data directly, they may have gotten additional duplicates with varying numbers of null-pad characters.

The update is rather long-running. For various test cases on my Athlon X2 5000+ dev system, it took:

  • 80229 records + 80229 dups = 58s
  • 80229 records + 39771 dups (40458 corrupt, not dup) = 43s
  • 80229 records + 0 dups (39771 corrupt records) = 11s
  • 80229 records + 0 dups (0 corrupt records) = 0.5s (because of the fixes for programgenres, programrating, and recordedrating)
  • 72650 records with mythtv-fix_database_utf8_conversion.patch applied to prevent corruption = 0.5s

comment:2 Changed 12 years ago by Janne Grunau

Owner: changed from Isaac Richards to Janne Grunau
Status: newassigned

Changed 12 years ago by sphery <mtdean@…>

Updated patch--better handles dups, fixing problems from theoretically possible duplicates in programrating and recordedrating

comment:3 Changed 12 years ago by Janne Grunau

Resolution: fixed
Status: assignedclosed

(In [19219]) Fix database coruption caused by the UTF-8 conversion. Fixes #5946

patch by: Micheal T Dean < mtdean thirdcontact com >

comment:4 Changed 12 years ago by Janne Grunau

(In [19220]) convert the remaining CHAR(x), x>1, colunms to VARCHAR. Refs #5946

Note: See TracTickets for help on using tickets.