Improve ARP matching heuristic #2179

florelis · 2022-05-24T03:38:04Z

This improves the ARP matching heuristic by replacing the match confidence algorithm that used the edit distance between strings with an algorithm based on the words on each string.

Code changes:

Moved the specific match confidence algorithms to a separate header from the overall ARP matching to reduce the recompilation needed when adding new heuristics.
Added a new version to the normalizer that preserves the whitespace in the input strings. The existing version would remove all whitespace, which would prevent us from normalizing before splitting a string into words.
Added a helper to split a string into words using ICU breaks.

Changes in the heuristic (and some notes):

The algorithm was changed to be based on words on the strings instead of individual characters, which were more prone to causing false positives than we'd like.
The score is calculated as the edit distance between the sequences of words, were the allowed operations are only Add and Remove (no Edit).
- Using longest common subsequence was also considered, but it did not properly capture the difference between skipping a word and having completely different words.
- The Edit operation was not considered because it made any two strings too similar
- The implementation for edit distance was changed as my last version had bugs
Added a minimum requirement of how similar the package names should be to consider a match. This prevents us from matching two packages just because the publisher matched pretty well.
Made the case were name and publisher are compared as a single string not be part of the name matching so as to not consider the publisher twice.
There is room for improvement in considering adjacent words that may also be written as single words (e.g. "Screen Saver" vs "ScreenSaver").

Test changes:

Increased the expected match ratio in the tests to 75% and decreased the tolerance for false positives to 0%.
Added VS to the test data as it was known to cause a false match with VS Code, which no longer happens with this change.
Modified the test data to tag some entries as "new".

Microsoft Reviewers: Open in CodeFlow

Trenly · 2022-05-24T14:32:48Z

src/AppInstallerCLITests/Correlation.cpp

+    dataSet.RequiredTrueMatchRatio = 0.75;
+    dataSet.RequiredFalseMatchRatio = 0;
    dataSet.RequiredTrueMismatchRatio = 0; // There are no expected mismatches in this data set
-    dataSet.RequiredFalseMismatchRatio = 0.3;
+    dataSet.RequiredFalseMismatchRatio = 0.25;


Since these values are used in multiple tests, should they be abstracted to a constant to be sure that all tests always use the same ratios (in case someone forgets to edit one in the future)? I'm assuming we would want the algorithm to produce consistent ratios across all tests

There are two datasets but they do not necessarily produce the same results even if they use the same data because one adds noise and the other doesn't. For example my original implementation used to have a couple of false matches when there was noise. In this version there are no false positives, but I can imagine a scenario where the true match ratio is lower with noise due to not being able to pick between multiple promising ARP entries (currently we just pick the highest score, but we could be smarter and look at how much higher it is).

src/AppInstallerCommonCore/AppInstallerStrings.cpp

JohnMcPMS · 2022-05-24T17:02:32Z

src/AppInstallerCommonCore/AppInstallerStrings.cpp

@@ -628,4 +634,27 @@ namespace AppInstaller::Utility

        return path.filename();
    }
+
+    std::vector<std::string> SplitIntoWords(std::string_view input)


Please add a test with a few cases in it. Preferably also with some non-English strings, especially from a language without spaces.

Added a couple of test cases

JohnMcPMS · 2022-05-24T17:09:13Z

src/AppInstallerCommonCore/NameNormalization.cpp

@@ -376,11 +378,14 @@ namespace AppInstaller::Utility
                // Repeatedly remove matches for the regexes to create the minimum name
                while (RemoveAll(ProgramNameRegexes, result.Name));

-                auto tokens = Split(ProgramNameSplit, result.Name, LegalEntitySuffixes);
-                result.Name = Join(tokens);
+                if (!PreserveWhiteSpace)


This is really PreserveNonLetterAndDigitsAndLegalEntitySuffixes, right? I'm not sure what you are going for, but I would think that to preserve the existence of some whitespace, it should be:

if (PreserveWhiteSpace) { auto tokens = Split(ProgramNameSplit, result.Name, LegalEntitySuffixes); for (auto& token : tokens) { Remove(NonLettersAndDigits, token); } result.Name = JoinWithSpace(tokens); } else // Do pre-PR code block

Wow. I have no idea how I missread the code this bad; I assumed the split was around whitespace. My bad.
I believe the only thing that needs to be conditional is the Remove(NonLettersAndDigits, ...) because it also removes whitespace. I'll see if I can make a version like NonLettersDigitsOrWhitespace.

Updated to just preserve the whitespace. Now that it also removes legal entity suffixes, the matching rate increased 🥳

florelis added 2 commits May 23, 2022 20:05

Improve ARP matching heuristic

684a022

Fix test

86fed43

florelis requested a review from a team as a code owner May 24, 2022 03:38

Trenly reviewed May 24, 2022

View reviewed changes

JohnMcPMS requested changes May 24, 2022

View reviewed changes

ghost added Needs-Author-Feedback Issue needs attention from issue or PR author and removed Needs-Author-Feedback Issue needs attention from issue or PR author labels May 24, 2022

florelis added 2 commits May 24, 2022 14:28

Fix name normalization

93dc2d8

Update tests

96edb2b

JohnMcPMS approved these changes May 24, 2022

View reviewed changes

florelis merged commit 8d703ff into microsoft:master May 25, 2022

florelis deleted the heuristic branch May 25, 2022 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ARP matching heuristic #2179

Improve ARP matching heuristic #2179

florelis commented May 24, 2022 •

edited by ghost

Loading

Trenly May 24, 2022

florelis May 24, 2022

JohnMcPMS May 24, 2022

florelis May 24, 2022

JohnMcPMS May 24, 2022

florelis May 24, 2022

florelis May 24, 2022

Improve ARP matching heuristic #2179

Improve ARP matching heuristic #2179

Conversation

florelis commented May 24, 2022 • edited by ghost Loading

Microsoft Reviewers: Open in CodeFlow

Trenly May 24, 2022

Choose a reason for hiding this comment

florelis May 24, 2022

Choose a reason for hiding this comment

JohnMcPMS May 24, 2022

Choose a reason for hiding this comment

florelis May 24, 2022

Choose a reason for hiding this comment

JohnMcPMS May 24, 2022

Choose a reason for hiding this comment

florelis May 24, 2022

Choose a reason for hiding this comment

florelis May 24, 2022

Choose a reason for hiding this comment

florelis commented May 24, 2022 •

edited by ghost

Loading