Ported a similarity calculation algorithm #128

chilimangoes · 2015-11-08T22:33:53Z

To potentially help with #49.

I ported a similarity calculation function I created to compare youtube video titles to theTVDB episode titles. For that purpose, it returned the correct episode better than 9 out of 10 times, but I don't know how well it will work for matching abbreviations. For example, if I try to match "test function", the symbol "test" has a higher score (39) than "tstfnc" (37) because of the token sequence boosting I added.

One thought I had was to change the sequence boosting to work with non-contiguous sequences. In this case, the entire sequence of "tstfnc" would count as a sequence, even though most of the letters aren't contiguous, because those letters appear in the same sequence in both.

Ported a similarity calculation algorithm

synkarius · 2015-11-09T02:39:31Z

I think your proposed change to the sequence boosting is a good idea. The present PITA algorithm does something similar, also to detect non-contiguous sequences. Another thing to try might be removing all the vowels.

Ported a similarity calculation algorithm. Not integrated yet.

83ebb32

synkarius added a commit that referenced this pull request Nov 9, 2015

Merge pull request #128 from chilimangoes/pr-similarity-algo

99f5428

Ported a similarity calculation algorithm

synkarius merged commit 99f5428 into dictation-toolbox:develop Nov 9, 2015

LexiconCode added the Enhancement Enhancement of an existing feature label Mar 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ported a similarity calculation algorithm #128

Ported a similarity calculation algorithm #128

chilimangoes commented Nov 8, 2015

synkarius commented Nov 9, 2015

Ported a similarity calculation algorithm #128

Ported a similarity calculation algorithm #128

Conversation

chilimangoes commented Nov 8, 2015

synkarius commented Nov 9, 2015