Opened 3 weeks ago

Closed 2 weeks ago

Last modified 2 weeks ago

#13514 closed Bug Report - General (Fixed)

ttvdb does not choose the best show when searching

Reported by: mspieth Owned by: mspieth
Priority: minor Milestone: needs_triage
Component: MythTV - General Version: Unspecified
Severity: medium Keywords:
Cc: Ticket locked: no

Description

Updates on the thetvdb site means the search request can return shows that are not in the best order. There needs to be a sorting algo to get the best show choice.

I propose to add levenshtein text distance to choose the best match. This will mean a new python module python-levenshtein as a requirement for ttvdb.

Attachments (1)

levenshtein_normalize_unicode.py (2.4 KB) - added by rcrdnalor 2 weeks ago.
Usage of levenshtein within python bindings

Download all attachments as: .zip

Change History (7)

Changed 2 weeks ago by rcrdnalor

Usage of levenshtein within python bindings

comment:1 Changed 2 weeks ago by rcrdnalor

You can use the levenshtein implementation from the MythtTV's python bindings as fallback.

This works best with normalized unicode strings, and can be fed with utf-8 encoded strings as well.

See usage in attached file 'levenshtein_normalize_unicode.py'. Works with python2 and python3.

comment:2 Changed 2 weeks ago by mspieth

Is it better to use this or fuzzywuzzy?

Opinions wanted.

comment:3 in reply to:  2 Changed 2 weeks ago by Gary Buhrmaster

Replying to mspieth:

Is it better to use this or fuzzywuzzy?

Opinions wanted.

Well, since you asked....

Do you have enough samples of the ttvdb (poor) choices to result in a meaningful comparison of the various algorithms to determine which is statistically better(*) for MythTV? Those might include Jaccard Similarity, Cosine Similarity, Levenshtein Distance. If you have enough samples, you could use something like the textdistance library (which provides for the choice of many algorithms) as the the test platform for obtaining your statistics.

fuzzywuzzy is an implementation of Levenshtein distance, as I recall, as is python-Levenshtein, so in addition to the existing internal matching, they should be expected to all produce essentially equivalent results (and obviously the internal function is already in place, which has a different advantage).

I suppose that requiring textdistance, with an optional user override of the algorithm, provides the most flexibility, but I have not researched how widely textdistance has been packaged in the supported distros, and if (for example) Levenshtein is always good, using the internal functions might be good enough.

(*) The general problem of natural language searching is not yet a completely solved problem, of course, so "better" is as good as it gets.

comment:4 Changed 2 weeks ago by mspieth

Resolution: Fixed
Status: assignedclosed

After a few inputs this has been changed to use the bindings copy of levenshtein.

Works just as well as before.

comment:5 Changed 2 weeks ago by Mark Spieth <mspieth@…>

In 56d72164c5/mythtv:

ttvdb: Add best match sorting since thetvdb doesnt seem to do this correctly.

  • new requirement of fuzzywuzzy >= 0.7.0
    • Earlier may work too.
    • Tested with 0.17.0.

refs #13514

comment:6 Changed 2 weeks ago by Mark Spieth <mspieth@…>

In 776765800/mythtv:

ttvdb: use levenshtein from MythTV.utilities in mythbindings

  • less dependencies and works just as well

refs #13514

Note: See TracTickets for help on using tickets.