-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add heuristics for matching packages to ARP after installing #2044
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
arpEntries.push_back(allARPEntries[i]); | ||
} | ||
|
||
auto match = measure.GetBestMatchForManifest(manifest, arpEntries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the scoring algorithms should be only that; give the manifest and a single ARP entry and they compute a score in [0, 1.0]. Or to allow for optimization, construct them with a manifest so that they can pre-compute anything that is manifest specific, then compute a score per ARP entry. A confidence threshold would be defined globally to standardize on the minimum for considering something a possible match.
Then the overall score would factor in other data, like new/updated entry, version matching, etc. But that is agnostic of the name and publisher matching algorithm, and should mostly just be attempting to tiebreak or boost our confidence.
This test harness is about rating the name and publisher matching algorithm specifically, and would be a good place to also understand other metrics, like the gap between the first and second highest confidence intervals. It having all of the scores enables that kind of investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This GetBestMatchForManifest()
is defined in the base class. It loops over all the entries and finds the one with the highest score (that is also over the threshold). It's not defined by each algorithm; those only do what you say (assign a score).
The way I'm thinking about it, the scoring algorithms consider everything (name, version, if it is new), and the fact that we are currently only looking at name string matching is incidental. If we add scores for other things, the string matching can just be pushed one layer below these algorithm classes.
I'm making the threshold specific to each algorithm and not restricted to [0,1] because I think that is hard to normalize across multiple algorithms. Specially if we consider how to mix multiple scores into one. For example, I'm thinking that we could compare mixing the scores for name and publisher (or anything else) by a weighted sum, or by multiplying them, or by summing squares, and that would all affect the threshold and range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I didn't look closely at that, but I did find the base class header odd, so that is why. I believe there may be value in being able to run multiple passes, such as:
- Run a direct equality matcher looking for the easy cases where we actually have exact data
- Run the existing normalized name and publisher matcher
- Run these more complex ones we are working on now, which are slower, but better at matching
With this design, it is more complicated using a base class with implementation built in (not impossible, just a bit more convoluted). If the matcher interface were separated out, then it would be easier to have a single orchestration class that can use one or more matchers. Again, not impossible to do with the current base class, but I think you end up needing to wrap the functionality anyway.
The overall scoring algorithm considers everything, but I don't think that each name+publisher matching algorithm should separately consider other fields. Not that they should be disallowed, as I could see potential value in versions occurring in the name, but generally those are separate algorithms. If we want to consider different version matching heuristics, we should do so as its own thing.
So I guess where I'm headed is that a pure interface class makes generic matchers possible, and a single orchestration class/function allows us to combine them as we see fit.
I'm making the threshold specific to each algorithm and not restricted to [0,1] because I think that is hard to normalize across multiple algorithms.
Unless the algorithm actually returns completely unbounded values, numbers in the range [0, 1] are always possible. This confidence interval scoring is very common, and having a normalized range is what makes it possible to combine scores from different aspects. Creating a standardize threshold for meaning also allows the different algorithms to be interchangeable to the code above. If mine thinks that 0.5+ is a solid match, but yours thinks that 0.9+ is, then how can the code above combine them? If the answer is "use the threshold", then the scoring algorithm could do the same to normalize to the standardized value. And it can do it with knowledge of its statistics; a linear mapping may not be the best approach for all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some changes based on what you said (but after re-reading it, I think it doesn't exactly match what you wanted...).
I kept all of your original code for matching ARP using product code and normalized names, and then the heuristics only run if that doesn't return anything. That's intended to be part of the orchestrator you mentioned.
I added a base class for name+publisher matchers that take the strings and return a value in [0,1]. I kept the 3 sample matchers I had earlier. I think this is where you can plug in the matching algorithm you wanted to implement.
I kept the base class for the overall algorithm, and added an implementation that for now considers only the string matching. This was intended to be the orchestrator for mix-and-matching smaller matchers, but I just realized that I didn't do it the way you suggested because it doesn't allow for doing a full pass with a single matcher, then another one with other matcher if needed.
I also added implementations of that base class to the tests, which use only a single string matching algorithm. That way we can evaluate just the accuracy of those, without considering anything else we may end up adding on top in the orchestrator.
I still need to test this E2E and play with the matching algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is now worse for my purposes, as I need more ARP information than you are providing. For my algorithm I want something more like:
struct IARPMatchConfidenceAlgorithm
{
double Compute(const ARPEntry& entry) const;
};
struct WordsARPMatchConfidenceAlgorithm : public IARPMatchConfidenceAlgorithm
{
WordsARPMatchConfidenceAlgorithm(const Manifest& manifest);
double Compute(const ARPEntry& entry) const override;
};
Where the ARPEntry contains other information, like language, that I was going to leverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave those changes for the future as we update the matching algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch that last comment; I updated it to use the interface you suggested. Ended up cleaner with it :D
* Moved all the logic for detecting+reporting changes from InstallFlow to ARPCorrelation * Made matching threshold a global value * Separated the overall matching algorithm and the string matching algorithms * Removed hacks used for testing (having name + publisher in the ARPEntry, instead of using the data from the IPackageVersion) * Expanded test infrastructure to be able to extend with just adding new datasets or algorithms * Changed correlation to first try to use the Source search using normalized name and publisher, and only use the heuristics if that fails.
This comment was marked as outdated.
This comment was marked as outdated.
template <> | ||
struct DataMapping<Data::ProductCodeFromARP> | ||
{ | ||
using value_t = std::vector<Utility::LocIndString>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be the entire AppsAndFeaturesEntry
, not just the product code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to include the whole AppAndFeaturesEntry, although I'm not entirely sure if I did it The Right Way®
This comment was marked as resolved.
This comment was marked as resolved.
sourceIdentifier = context.Get<Execution::Data::PackageVersion>()->GetProperty(PackageVersionProperty::SourceIdentifier); | ||
} | ||
|
||
Logging::Telemetry().LogSuccessfulInstallARPChange( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this telemetry event get moved somewhere else? It should still be done in this function when one is found rather than being done in the helper method that could be used for other purposes.
That might mean changing the output of the helper to return additional information, although the count fields in this event are less meaningful with different algorithms. But we could still calculate the number of changes, how many manifests were above the threshold, and how many of those were changed as the values used here, in that order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had moved it down to the function doing the correlation; but now it's back here. I changed the helper to return the count of changes/matches, although I'm keeping that count to only consider the exact matches from the source search as I couldn't figure out a good way to keep the count consistent across the multiple "passes".
Do you have any ideas how to count the matching manifests when sometimes we use the exact matching and sometimes the confidence measures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we can reason about the meaning, anything is fine. It can stay using the exact same values as before, just with a better guess. I thought we might use these values in some way to find things that weren't correlating, but it turned out to be very easy to find them 😉
So basically, these numbers are probably not important. Don't spend time trying to improve them, and if you think they are broken, we might consider just reporting 0 for all of them.
} | ||
|
||
// Cross reference the changes with the search results | ||
std::vector<std::shared_ptr<IPackage>> packagesInBoth; | ||
auto names = arpEntry->GetMultiProperty(PackageVersionMultiProperty::Name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will be the normalized name and publisher from the index. Not that it really matters, since normalizing them again for insertion into the next index is fine, but it might be worth a comment.
Also, for these ARP only snapshots I wouldn't expect to have more than one of any of these multi-properties.
@@ -204,9 +205,35 @@ struct TestContext : public Context | |||
} | |||
}; | |||
|
|||
// Override the correlation heuristic by an empty one to ensure that these tests | |||
// consider only the exact matching. | |||
struct TestHeuristicOverride |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really these tests should probably be removed at some point in the future, but it doesn't need to be now since you have them working.
auto s2 = Utility::FoldCase(sv2); | ||
|
||
// distance[i][j] = distance between s1[0:i] and s2[0:j] | ||
std::vector<std::vector<double>> distance{ s1.size(), std::vector<double>(s2.size(), 0.0) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using a single allocation for efficiency.
} | ||
} | ||
|
||
// Also attempt to find the entry based on the manifest data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like this older matching mechanism to fit into the same infrastructure so that the overall could be something like:
foreach (algorithm : { ... })
{
algorithm.Init(manifest)
if (Correlate(algorithm, ARP) > threshold)
return that;
}
Thoughts for the future, doesn't need to change now.
This comment was marked as resolved.
This comment was marked as resolved.
@@ -82,13 +76,10 @@ XPDP2X1MMZ4KR8|Ashampoo Burning Studio 23|Ashampoo|Ashampoo Burning Studio 23|23 | |||
XPFCFBB4FB3D6D|Emjysoft Cleaner|Emjysoft|Emjysoft Cleaner 2022 v4.1|4.1|Emjysoft|{167B1302-A739-42DE-BBD2-4C2F13D1FF51}_is1 | |||
XPFCFKCNNTXGQD|Yandex Browser|Yandex|Yandex|21.9.1.686|ООО «ЯНДЕКС»|YandexBrowser | |||
XPFCFL5ZTNFGD7|Wondershare Anireel|WONDERSHARE GLOBAL LIMITED|Wondershare Anireel(Build 1.6.2)||Wondershare Software|Wondershare Anireel_is1 | |||
XPFCFL5ZTNFGD7|Wondershare Anireel|WONDERSHARE GLOBAL LIMITED|Wondershare Helper Compact 2.6.0|2.6.0|Wondershare|{5363CE84-5F09-48A1-8B6C-6BB590FFEDF2}_is1 | |||
XPFCG86X2PGLDJ|Christmas Elf by Pothos|Pothos|Christmas Elf|||ChristmasElf | |||
XPFCGHHXNH4WBW|Biblioteca e Prestiti Librari|Esposito Software di M. G. Caputo|Gestione Biblioteca e Prestiti Librari 3.0 Demo||Copyright Esposito Software|Gestione Biblioteca e Prestiti Librari 3.0 Demo_is1 | |||
XPFCWP0SQWXM3V|CCleaner|Piriform Software Ltd|CCleaner|5.89|Piriform|CCleaner | |||
XPFCXFRDJ8VGPT|Домашняя Бухгалтерия|Keepsoft|Äîìàøíÿÿ áóõãàëòåðèÿ Lite|7.2|Keepsoft|Äîìàøíÿÿ áóõãàëòåðèÿ Lite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Äîìàøíÿÿ áóõãàëòåðèÿ
These look like UTF-8 bytes interpreted as chars; could be bad data in the original registry entries or someone didn't preserve the encoding along the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
UErrorCode errorCode = UErrorCode::U_ZERO_ERROR; | ||
auto utf32ByteCount= ucnv_convert("UTF-32", "UTF-8", nullptr, 0, input.data(), static_cast<int32_t>(input.size()), &errorCode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a Windows API to convert to UTF-32, they hid it well...
Co-authored-by: JohnMcPMS <[email protected]>
This adds heuristics for matching ARP data with a package after installing it, for better correlation later.
Note that this is still a work in progress. It doesn't work yet and it is not very clean..
Microsoft Reviewers: Open in CodeFlow