Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ported a similarity calculation algorithm #128

Merged

Conversation

chilimangoes
Copy link
Collaborator

To potentially help with #49.

I ported a similarity calculation function I created to compare youtube video titles to theTVDB episode titles. For that purpose, it returned the correct episode better than 9 out of 10 times, but I don't know how well it will work for matching abbreviations. For example, if I try to match "test function", the symbol "test" has a higher score (39) than "tstfnc" (37) because of the token sequence boosting I added.

One thought I had was to change the sequence boosting to work with non-contiguous sequences. In this case, the entire sequence of "tstfnc" would count as a sequence, even though most of the letters aren't contiguous, because those letters appear in the same sequence in both.

synkarius added a commit that referenced this pull request Nov 9, 2015
Ported a similarity calculation algorithm
@synkarius synkarius merged commit 99f5428 into dictation-toolbox:develop Nov 9, 2015
@synkarius
Copy link
Collaborator

I think your proposed change to the sequence boosting is a good idea. The present PITA algorithm does something similar, also to detect non-contiguous sequences. Another thing to try might be removing all the vowels.

@LexiconCode LexiconCode added the Enhancement Enhancement of an existing feature label Mar 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Enhancement of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants