-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ARP matching heuristic #2179
Conversation
dataSet.RequiredTrueMatchRatio = 0.75; | ||
dataSet.RequiredFalseMatchRatio = 0; | ||
dataSet.RequiredTrueMismatchRatio = 0; // There are no expected mismatches in this data set | ||
dataSet.RequiredFalseMismatchRatio = 0.3; | ||
dataSet.RequiredFalseMismatchRatio = 0.25; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these values are used in multiple tests, should they be abstracted to a constant to be sure that all tests always use the same ratios (in case someone forgets to edit one in the future)? I'm assuming we would want the algorithm to produce consistent ratios across all tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two datasets but they do not necessarily produce the same results even if they use the same data because one adds noise and the other doesn't. For example my original implementation used to have a couple of false matches when there was noise. In this version there are no false positives, but I can imagine a scenario where the true match ratio is lower with noise due to not being able to pick between multiple promising ARP entries (currently we just pick the highest score, but we could be smarter and look at how much higher it is).
@@ -628,4 +634,27 @@ namespace AppInstaller::Utility | |||
|
|||
return path.filename(); | |||
} | |||
|
|||
std::vector<std::string> SplitIntoWords(std::string_view input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test with a few cases in it. Preferably also with some non-English strings, especially from a language without spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple of test cases
@@ -376,11 +378,14 @@ namespace AppInstaller::Utility | |||
// Repeatedly remove matches for the regexes to create the minimum name | |||
while (RemoveAll(ProgramNameRegexes, result.Name)); | |||
|
|||
auto tokens = Split(ProgramNameSplit, result.Name, LegalEntitySuffixes); | |||
result.Name = Join(tokens); | |||
if (!PreserveWhiteSpace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really PreserveNonLetterAndDigitsAndLegalEntitySuffixes
, right? I'm not sure what you are going for, but I would think that to preserve the existence of some whitespace, it should be:
if (PreserveWhiteSpace)
{
auto tokens = Split(ProgramNameSplit, result.Name, LegalEntitySuffixes);
for (auto& token : tokens)
{
Remove(NonLettersAndDigits, token);
}
result.Name = JoinWithSpace(tokens);
}
else
// Do pre-PR code block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. I have no idea how I missread the code this bad; I assumed the split was around whitespace. My bad.
I believe the only thing that needs to be conditional is the Remove(NonLettersAndDigits, ...)
because it also removes whitespace. I'll see if I can make a version like NonLettersDigitsOrWhitespace
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to just preserve the whitespace. Now that it also removes legal entity suffixes, the matching rate increased 🥳
This improves the ARP matching heuristic by replacing the match confidence algorithm that used the edit distance between strings with an algorithm based on the words on each string.
Code changes:
Changes in the heuristic (and some notes):
Test changes:
Microsoft Reviewers: Open in CodeFlow