Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ARP matching heuristic #2179

Merged
merged 4 commits into from
May 25, 2022
Merged

Conversation

florelis
Copy link
Member

@florelis florelis commented May 24, 2022

This improves the ARP matching heuristic by replacing the match confidence algorithm that used the edit distance between strings with an algorithm based on the words on each string.

Code changes:

  • Moved the specific match confidence algorithms to a separate header from the overall ARP matching to reduce the recompilation needed when adding new heuristics.
  • Added a new version to the normalizer that preserves the whitespace in the input strings. The existing version would remove all whitespace, which would prevent us from normalizing before splitting a string into words.
  • Added a helper to split a string into words using ICU breaks.

Changes in the heuristic (and some notes):

  • The algorithm was changed to be based on words on the strings instead of individual characters, which were more prone to causing false positives than we'd like.
  • The score is calculated as the edit distance between the sequences of words, were the allowed operations are only Add and Remove (no Edit).
    • Using longest common subsequence was also considered, but it did not properly capture the difference between skipping a word and having completely different words.
    • The Edit operation was not considered because it made any two strings too similar
    • The implementation for edit distance was changed as my last version had bugs
  • Added a minimum requirement of how similar the package names should be to consider a match. This prevents us from matching two packages just because the publisher matched pretty well.
  • Made the case were name and publisher are compared as a single string not be part of the name matching so as to not consider the publisher twice.
  • There is room for improvement in considering adjacent words that may also be written as single words (e.g. "Screen Saver" vs "ScreenSaver").

Test changes:

  • Increased the expected match ratio in the tests to 75% and decreased the tolerance for false positives to 0%.
  • Added VS to the test data as it was known to cause a false match with VS Code, which no longer happens with this change.
  • Modified the test data to tag some entries as "new".
Microsoft Reviewers: Open in CodeFlow

@florelis florelis requested a review from a team as a code owner May 24, 2022 03:38
Comment on lines 237 to 240
dataSet.RequiredTrueMatchRatio = 0.75;
dataSet.RequiredFalseMatchRatio = 0;
dataSet.RequiredTrueMismatchRatio = 0; // There are no expected mismatches in this data set
dataSet.RequiredFalseMismatchRatio = 0.3;
dataSet.RequiredFalseMismatchRatio = 0.25;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these values are used in multiple tests, should they be abstracted to a constant to be sure that all tests always use the same ratios (in case someone forgets to edit one in the future)? I'm assuming we would want the algorithm to produce consistent ratios across all tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two datasets but they do not necessarily produce the same results even if they use the same data because one adds noise and the other doesn't. For example my original implementation used to have a couple of false matches when there was noise. In this version there are no false positives, but I can imagine a scenario where the true match ratio is lower with noise due to not being able to pick between multiple promising ARP entries (currently we just pick the highest score, but we could be smarter and look at how much higher it is).

@@ -628,4 +634,27 @@ namespace AppInstaller::Utility

return path.filename();
}

std::vector<std::string> SplitIntoWords(std::string_view input)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test with a few cases in it. Preferably also with some non-English strings, especially from a language without spaces.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of test cases

@@ -376,11 +378,14 @@ namespace AppInstaller::Utility
// Repeatedly remove matches for the regexes to create the minimum name
while (RemoveAll(ProgramNameRegexes, result.Name));

auto tokens = Split(ProgramNameSplit, result.Name, LegalEntitySuffixes);
result.Name = Join(tokens);
if (!PreserveWhiteSpace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really PreserveNonLetterAndDigitsAndLegalEntitySuffixes, right? I'm not sure what you are going for, but I would think that to preserve the existence of some whitespace, it should be:

if (PreserveWhiteSpace)
{
  auto tokens = Split(ProgramNameSplit, result.Name, LegalEntitySuffixes);
  for (auto& token : tokens)
  {
    Remove(NonLettersAndDigits, token);
  }
  result.Name = JoinWithSpace(tokens);
}
else
// Do pre-PR code block

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. I have no idea how I missread the code this bad; I assumed the split was around whitespace. My bad.
I believe the only thing that needs to be conditional is the Remove(NonLettersAndDigits, ...) because it also removes whitespace. I'll see if I can make a version like NonLettersDigitsOrWhitespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to just preserve the whitespace. Now that it also removes legal entity suffixes, the matching rate increased 🥳

@ghost ghost added Needs-Author-Feedback Issue needs attention from issue or PR author and removed Needs-Author-Feedback Issue needs attention from issue or PR author labels May 24, 2022
@florelis florelis merged commit 8d703ff into microsoft:master May 25, 2022
@florelis florelis deleted the heuristic branch May 25, 2022 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants