Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#13517 closed Patch - Feature (fixed)

Reducing memory usage by mythfilldatabase to 1/6 of original usage

Reported by: Hans Dingemans Owned by: Peter Bennett
Priority: minor Milestone: 31.1
Component: MythTV - Mythfilldatabase Version: Master Head
Severity: medium Keywords:
Cc: Ticket locked: no

Description

Mythfilldatabase uses QDomDocument to parse and store the XML data that is read; according to QDomDocument documentation this object is not meant to handle large XML files; QXmlStreamReader should be used in these situations.

This pull request replaces QDomDocument by QXmlStreamreader; loading a 222 MB XML file used 2,5GB on memory, after these changes it only uses 420MB, and speeds up the loading process by approx. 15% .

Pull request made on github: https://github.com/MythTV/mythtv/pull/189

Attachments (3)

mythfilldatabase-reduce-mem-usage1.diff (48.2 KB) - added by Peter Bennett 4 years ago.
Proposed fix from OP
mythfilldatabase-reduce-mem-usage2.diff.gz (7.9 KB) - added by Hans Dingemans 4 years ago.
mythfilldatabase-reduce-mem-usage3.diff.gz (8.4 KB) - added by Hans Dingemans 4 years ago.

Download all attachments as: .zip

Change History (22)

comment:1 Changed 4 years ago by Hans Dingemans

Testing can be done very easily:


mysqldump mythconver -u root >backup.sql
mythfilldatabase <your parameters> <your xml file>
mysqldump --skip-comments --skip-extended-insert mythconverg -u root >current.sql

//restore backup
mysql mythconverg -u root <backup.sql
<path-to-newly-compiled>mythfilldatabase <your parameters> <your xml file>
mysqldump --skip-comments --skip-extended-insert mythconverg -u root >new.sql

diff -Naur current.sql new.sql >output.diff

If you look at output.diff you should only see a few differences in timestamps and some changes in non-relevant tables...

Version 2, edited 4 years ago by Hans Dingemans (previous) (next) (diff)

Changed 4 years ago by Peter Bennett

Proposed fix from OP

comment:2 Changed 4 years ago by Peter Bennett

Owner: set to Peter Bennett
Status: newassigned
Type: Bug Report - GeneralPatch - Feature

Changed 4 years ago by Hans Dingemans

comment:3 Changed 4 years ago by Hans Dingemans

Added modified version of the patch, which now obeys the Coding Standards.

comment:4 Changed 4 years ago by Peter Bennett

I ran a test with a full xmltv download, 729 channels for 2 weeks, in master.

Current master version

virt / res
7 GB / 5.62 GB
22 mins

With patch

virt / res
2 GB / 689 MB
17 mins

The patch reduced the resident memory requirement from 5.62 GB to 689 MB, a factor of 8. Also it reduced the run time slightly.

With the patch, all listings look correct. It also added channels and channel icons, as expected.

comment:5 Changed 4 years ago by Gary Buhrmaster

Currently, invalid XMLTV files (which are a reality of life) will be rejected before processing the bad data. It is not clear that this is still true (I have a hard time following the patch file results), and I would consider a failure to validate the entire file being a valid XML document *before* processing a serious regression (as bad XMLTV files are a reality, even with a zero return code from the grabber itself).

comment:6 in reply to:  5 Changed 4 years ago by Hans Dingemans

Replying to Gary Buhrmaster:

Currently, invalid XMLTV files (which are a reality of life) will be rejected before processing the bad data. It is not clear that this is still true (I have a hard time following the patch file results), and I would consider a failure to validate the entire file being a valid XML document *before* processing a serious regression (as bad XMLTV files are a reality, even with a zero return code from the grabber itself).

I agree; the patch is a straight 1:1 recode of QDomDocument to QXmlStreamReader, which are in the same library. All checks have remained the same.

However, the proof of the pudding is in the eating; if you have some real-life malformed XMLTV files, please send them to me or attach them here.

comment:7 Changed 4 years ago by Gary Buhrmaster

Delete the file half way, and see if any (any) part of the file is processed. Randomly replace (in the file) a few "channel" tags with XXXXXX (just a few, and never in the same group), and a few "programme" with YYYYY so that the file is invalid XML.

It is not at all clear you have run these types of tests.

The stream processor explicitly states it requires well formed XML, and that it is an incremental processor. If you don't have well formed XML it will (eventually) return an error, but possibly only after returning existing data that you have already processed.

comment:8 in reply to:  7 ; Changed 4 years ago by Hans Dingemans

Replying to Gary Buhrmaster:

So you mean that if the last line of the xmltv file, the </tv> tag is missing, you wouldn't want to have that entire file processed? That would be relatively easy to accomplish with the commit/rollback functionality of mysql; but personally I would prefer that mythfilldatabase will get as much of the data that is NOT corrupt loaded in the mysql database.

Now that is a matter of personal taste, and I wouldn't want to have my taste to be leading in this; what if we check this in the mythtv-dev mailinglist?

comment:9 in reply to:  8 Changed 4 years ago by Gary Buhrmaster

Replying to Hans Dingemans:

Now that is a matter of personal taste, and I wouldn't want to have my taste to be leading in this; what if we check this in the mythtv-dev mailinglist?

This has already been "decided" by the devs. It is all, or nothing. Otherwise the commit 990429dd would have been marked as a RELEASE BLOCKER as it changed existing functionality for existing codes (rejecting existing valid requests) due to the "all or nothing" approach promoted and agreed to. And while I had (and have) my concerns about that approach, it was the devs decision to make, but that decision established precedent ("stare decisis").

comment:10 Changed 4 years ago by Hans Dingemans

Ehmm, I'm confused. That commit says: " Change channel.visible to a 4-value, enum type.

This change essentially adds two, new values, kChannelAlwaysVisible and kChannelNeverVisible, to the previous true/false values. The intent of these new values is to indicate that channel visibility is exclusively controlled by the user. Automated utilities like MythUtil?-Channel-HDHR-visibilityCheck and MythTV itself should never override those values without user approval."

What does that have to do with this situation?

comment:11 Changed 4 years ago by Peter Bennett

Gary - thank you for your valuable input.

Hans - Regarding using database rollback - MythTV uses autocommit throughout, so I don't think that is an option.

Does the system complete parsing before updating the database? If so it should be easy to reject the whole run if there is an error.

When updating the database does it first delete old data? If so it would be dangerous to accept a half complete run because the program data may be deleted and not replaced.

comment:12 in reply to:  10 Changed 4 years ago by Gary Buhrmaster

Replying to Hans Dingemans:

Ehmm, I'm confused. That commit says: .....

What does that have to do with this situation?

There is a history, and my previous comment covered the basics of the reasons (no reason to repeat that again), and I would encourage you to carefully review and understand the code and the email discussions on the -dev list and the implication(s). However, most importantly it firmly reconfirmed and reinforced the precedent of "all or nothing" with explicit concurrence of the developers. Flip-flopping on such decisions for each and every patch is not (well, has never been) the MythTV developers way.

And of course the existing code would appear to do the appropriate thing, and (most importantly) the author seems to agree that the new code should do the appropriate thing.

I have not seriously evaluated all the possible alternatives to know if fixing that may not require reading (streaming) the file twice (the first time just to validate the file), as the mythtv developers have also explicitly chosen to never use the transaction capabilities of mysql as you suggested as an alternative(1) (and for that matter to not use DB triggers or cascade deletes or blobs or ....), which results in different sets of code hoop jumping through out the code (including, possibly, this case).

And while it may be appropriate to reconsider past decisions (especially those made decades ago), the forum for such developer discussions is not a patch ticket (it would be out of scope here).

(1) Given the often huge set of changes in a larger lineup guide update (and the elapsed time to process the data), a single transaction might cause a different set of issues, and that is not even addressing that the table type would need to be changed (currently the mysql tables are explicitly forced to be myisam for a supported (by the project) DB, and myisam does not support database transactions), and that (myisam as the supported table), too, is an intentional design choice.

comment:13 Changed 4 years ago by gigem

I'd be okay with reading the file twice. Alternatively, could all of the data be loaded into temporary, mysql tables while reading. If everything checks out, then replace the data in the normal tables with that from the temporary tables.

Changed 4 years ago by Hans Dingemans

comment:14 Changed 4 years ago by Hans Dingemans

Added new version of the patch; there are now checks on XML syntax and XMLTV-DTD layout; if the checks fail, the entire xml-file is rejected.

comment:15 Changed 4 years ago by Peter Bennett

I ran a bunch of tests on the latest patch and everything looks good.

With a valid file, it still performs as described above, using 698 MB instead of 5.6 GB.

Changing one closing tag in the file results in an error termination with a good message about an unterminated tag, and no database update.

Removing the final </tv> tag gives the same result, an error message with no database update.

Adding extra tags (validly formatted), results in a successful load with those tags ignored (as I would expect). The same result is obtained with the current master version.

Deleting complete tags from the file also results in a valid load with those fields not loaded (as expected). The same result is obtained with the current master version.

As far as I can see, the program performs exactly as current master, but uses only 12% of the memory. I recommend committing the changes.

comment:16 Changed 4 years ago by Klaas de Waal

Then I recommend to commit it to master and after two weeks or so, when there are no problems appearing, commit it in v31 also.

comment:17 Changed 4 years ago by Hans Dingemans <jpldingemans@…>

Resolution: fixed
Status: assignedclosed

In a9aa00613/mythtv:

mythfilldatabase: reduce memory usage.

Mythfilldatabase uses QDomDocument to parse and store the XML data that is read; according to QDomDocument documentation this object is not meant to handle large XML files; QXmlStreamReader should be used in these situations.

This commit replaces QDomDocument by QXmlStreamreader.

A test showed that memory usage dropped from 5.6 GB to 698 MB.

Fixes #13517

Signed-off-by: Peter Bennett <pbennett@…>

comment:18 Changed 4 years ago by Hans Dingemans <jpldingemans@…>

In f5d75a6de7/mythtv:

mythfilldatabase: reduce memory usage.

Mythfilldatabase uses QDomDocument to parse and store the XML data that is read; according to QDomDocument documentation this object is not meant to handle large XML files; QXmlStreamReader should be used in these situations.

This commit replaces QDomDocument by QXmlStreamreader.

A test showed that memory usage dropped from 5.6 GB to 698 MB.

Fixes #13517

Signed-off-by: Peter Bennett <pbennett@…>
(cherry picked from commit a9aa006139da24cbad861a2c25d57fa3f71aba2a)

comment:19 Changed 4 years ago by Stuart Auchterlonie

Milestone: needs_triage31.1
Note: See TracTickets for help on using tickets.