Opened 5 years ago
Closed 5 years ago
Last modified 5 years ago
#13517 closed Patch - Feature (fixed)
Reducing memory usage by mythfilldatabase to 1/6 of original usage
Reported by: | Hans Dingemans | Owned by: | Peter Bennett |
---|---|---|---|
Priority: | minor | Milestone: | 31.1 |
Component: | MythTV - Mythfilldatabase | Version: | Master Head |
Severity: | medium | Keywords: | |
Cc: | Ticket locked: | no |
Description
Mythfilldatabase uses QDomDocument to parse and store the XML data that is read; according to QDomDocument documentation this object is not meant to handle large XML files; QXmlStreamReader should be used in these situations.
This pull request replaces QDomDocument by QXmlStreamreader; loading a 222 MB XML file used 2,5GB on memory, after these changes it only uses 420MB, and speeds up the loading process by approx. 15% .
Pull request made on github: https://github.com/MythTV/mythtv/pull/189
Attachments (3)
Change History (22)
Changed 5 years ago by
Attachment: | mythfilldatabase-reduce-mem-usage1.diff added |
---|
Proposed fix from OP
comment:2 Changed 5 years ago by
Owner: | set to Peter Bennett |
---|---|
Status: | new → assigned |
Type: | Bug Report - General → Patch - Feature |
Changed 5 years ago by
Attachment: | mythfilldatabase-reduce-mem-usage2.diff.gz added |
---|
comment:3 Changed 5 years ago by
Added modified version of the patch, which now obeys the Coding Standards.
comment:4 Changed 5 years ago by
I ran a test with a full xmltv download, 729 channels for 2 weeks, in master.
Current master version
virt / res 7 GB / 5.62 GB 22 mins
With patch
virt / res 2 GB / 689 MB 17 mins
The patch reduced the resident memory requirement from 5.62 GB to 689 MB, a factor of 8. Also it reduced the run time slightly.
With the patch, all listings look correct. It also added channels and channel icons, as expected.
comment:5 follow-up: 6 Changed 5 years ago by
Currently, invalid XMLTV files (which are a reality of life) will be rejected before processing the bad data. It is not clear that this is still true (I have a hard time following the patch file results), and I would consider a failure to validate the entire file being a valid XML document *before* processing a serious regression (as bad XMLTV files are a reality, even with a zero return code from the grabber itself).
comment:6 Changed 5 years ago by
Replying to Gary Buhrmaster:
Currently, invalid XMLTV files (which are a reality of life) will be rejected before processing the bad data. It is not clear that this is still true (I have a hard time following the patch file results), and I would consider a failure to validate the entire file being a valid XML document *before* processing a serious regression (as bad XMLTV files are a reality, even with a zero return code from the grabber itself).
I agree; the patch is a straight 1:1 recode of QDomDocument to QXmlStreamReader, which are in the same library. All checks have remained the same.
However, the proof of the pudding is in the eating; if you have some real-life malformed XMLTV files, please send them to me or attach them here.
comment:7 follow-up: 8 Changed 5 years ago by
Delete the file half way, and see if any (any) part of the file is processed. Randomly replace (in the file) a few "channel" tags with XXXXXX (just a few, and never in the same group), and a few "programme" with YYYYY so that the file is invalid XML.
It is not at all clear you have run these types of tests.
The stream processor explicitly states it requires well formed XML, and that it is an incremental processor. If you don't have well formed XML it will (eventually) return an error, but possibly only after returning existing data that you have already processed.
comment:8 follow-up: 9 Changed 5 years ago by
Replying to Gary Buhrmaster:
So you mean that if the last line of the xmltv file, the </tv> tag is missing, you wouldn't want to have that entire file processed? That would be relatively easy to accomplish with the commit/rollback functionality of mysql; but personally I would prefer that mythfilldatabase will get as much of the data that is NOT corrupt loaded in the mysql database.
Now that is a matter of personal taste, and I wouldn't want to have my taste to be leading in this; what if we check this in the mythtv-dev mailinglist?
comment:9 Changed 5 years ago by
Replying to Hans Dingemans:
Now that is a matter of personal taste, and I wouldn't want to have my taste to be leading in this; what if we check this in the mythtv-dev mailinglist?
This has already been "decided" by the devs. It is all, or nothing. Otherwise the commit 990429dd would have been marked as a RELEASE BLOCKER as it changed existing functionality for existing codes (rejecting existing valid requests) due to the "all or nothing" approach promoted and agreed to. And while I had (and have) my concerns about that approach, it was the devs decision to make, but that decision established precedent ("stare decisis").
comment:10 follow-up: 12 Changed 5 years ago by
Ehmm, I'm confused. That commit says: " Change channel.visible to a 4-value, enum type.
This change essentially adds two, new values, kChannelAlwaysVisible and kChannelNeverVisible, to the previous true/false values. The intent of these new values is to indicate that channel visibility is exclusively controlled by the user. Automated utilities like MythUtil?-Channel-HDHR-visibilityCheck and MythTV itself should never override those values without user approval."
What does that have to do with this situation?
comment:11 Changed 5 years ago by
Gary - thank you for your valuable input.
Hans - Regarding using database rollback - MythTV uses autocommit throughout, so I don't think that is an option.
Does the system complete parsing before updating the database? If so it should be easy to reject the whole run if there is an error.
When updating the database does it first delete old data? If so it would be dangerous to accept a half complete run because the program data may be deleted and not replaced.
comment:12 Changed 5 years ago by
Replying to Hans Dingemans:
Ehmm, I'm confused. That commit says: .....
What does that have to do with this situation?
There is a history, and my previous comment covered the basics of the reasons (no reason to repeat that again), and I would encourage you to carefully review and understand the code and the email discussions on the -dev list and the implication(s). However, most importantly it firmly reconfirmed and reinforced the precedent of "all or nothing" with explicit concurrence of the developers. Flip-flopping on such decisions for each and every patch is not (well, has never been) the MythTV developers way.
And of course the existing code would appear to do the appropriate thing, and (most importantly) the author seems to agree that the new code should do the appropriate thing.
I have not seriously evaluated all the possible alternatives to know if fixing that may not require reading (streaming) the file twice (the first time just to validate the file), as the mythtv developers have also explicitly chosen to never use the transaction capabilities of mysql as you suggested as an alternative(1) (and for that matter to not use DB triggers or cascade deletes or blobs or ....), which results in different sets of code hoop jumping through out the code (including, possibly, this case).
And while it may be appropriate to reconsider past decisions (especially those made decades ago), the forum for such developer discussions is not a patch ticket (it would be out of scope here).
(1) Given the often huge set of changes in a larger lineup guide update (and the elapsed time to process the data), a single transaction might cause a different set of issues, and that is not even addressing that the table type would need to be changed (currently the mysql tables are explicitly forced to be myisam for a supported (by the project) DB, and myisam does not support database transactions), and that (myisam as the supported table), too, is an intentional design choice.
comment:13 Changed 5 years ago by
I'd be okay with reading the file twice. Alternatively, could all of the data be loaded into temporary, mysql tables while reading. If everything checks out, then replace the data in the normal tables with that from the temporary tables.
Changed 5 years ago by
Attachment: | mythfilldatabase-reduce-mem-usage3.diff.gz added |
---|
comment:14 Changed 5 years ago by
Added new version of the patch; there are now checks on XML syntax and XMLTV-DTD layout; if the checks fail, the entire xml-file is rejected.
comment:15 Changed 5 years ago by
I ran a bunch of tests on the latest patch and everything looks good.
With a valid file, it still performs as described above, using 698 MB instead of 5.6 GB.
Changing one closing tag in the file results in an error termination with a good message about an unterminated tag, and no database update.
Removing the final </tv> tag gives the same result, an error message with no database update.
Adding extra tags (validly formatted), results in a successful load with those tags ignored (as I would expect). The same result is obtained with the current master version.
Deleting complete tags from the file also results in a valid load with those fields not loaded (as expected). The same result is obtained with the current master version.
As far as I can see, the program performs exactly as current master, but uses only 12% of the memory. I recommend committing the changes.
comment:16 Changed 5 years ago by
Then I recommend to commit it to master and after two weeks or so, when there are no problems appearing, commit it in v31 also.
comment:17 Changed 5 years ago by
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
In a9aa00613/mythtv:
comment:19 Changed 5 years ago by
Milestone: | needs_triage → 31.1 |
---|
Testing can be done very easily:
If you look at output.diff you should only see a few differences in timestamps and some changes in non-relevant tables...
EDIT: typo