Add heuristics for matching packages to ARP after installing #2044

florelis · 2022-03-24T20:08:31Z

This adds heuristics for matching ARP data with a package after installing it, for better correlation later.

Defined a type for algorithms measuring how much an ARP entry matches with a package. We will ultimately use only one, but we keep several for comparison and experimenting.
After installing, we will try to use the matching heuristic to correlate one of the ARP changes with the package. If we find a match, we will store the product code in the tracking catalog for future use.
Added tests to evaluate each heuristic. The test would take some data and create a summary of how many (true|false)(positives|negatives) we got and report them for each heuristic. For the heuristic we ultimately use, we will want to also add a check for how good the results have to be.
Added some test data with package names and publisher as coming from the manifest and as seen in ARP after installing. This was obtained by installing several Store apps and inspecting the ARP after the installation, with some manual editing of the data.

Note that this is still a work in progress. It doesn't work yet and it is not very clean..

Microsoft Reviewers: Open in CodeFlow

src/AppInstallerRepositoryCore/ARPCorrelation.cpp

JohnMcPMS · 2022-03-25T00:18:25Z

src/AppInstallerCLITests/Correlation.cpp

+            arpEntries.push_back(allARPEntries[i]);
+        }
+
+        auto match = measure.GetBestMatchForManifest(manifest, arpEntries);


I think that the scoring algorithms should be only that; give the manifest and a single ARP entry and they compute a score in [0, 1.0]. Or to allow for optimization, construct them with a manifest so that they can pre-compute anything that is manifest specific, then compute a score per ARP entry. A confidence threshold would be defined globally to standardize on the minimum for considering something a possible match.

Then the overall score would factor in other data, like new/updated entry, version matching, etc. But that is agnostic of the name and publisher matching algorithm, and should mostly just be attempting to tiebreak or boost our confidence.

This test harness is about rating the name and publisher matching algorithm specifically, and would be a good place to also understand other metrics, like the gap between the first and second highest confidence intervals. It having all of the scores enables that kind of investigation.

This GetBestMatchForManifest() is defined in the base class. It loops over all the entries and finds the one with the highest score (that is also over the threshold). It's not defined by each algorithm; those only do what you say (assign a score).

The way I'm thinking about it, the scoring algorithms consider everything (name, version, if it is new), and the fact that we are currently only looking at name string matching is incidental. If we add scores for other things, the string matching can just be pushed one layer below these algorithm classes.

I'm making the threshold specific to each algorithm and not restricted to [0,1] because I think that is hard to normalize across multiple algorithms. Specially if we consider how to mix multiple scores into one. For example, I'm thinking that we could compare mixing the scores for name and publisher (or anything else) by a weighted sum, or by multiplying them, or by summing squares, and that would all affect the threshold and range.

Ok, I didn't look closely at that, but I did find the base class header odd, so that is why. I believe there may be value in being able to run multiple passes, such as:

Run a direct equality matcher looking for the easy cases where we actually have exact data

Run the existing normalized name and publisher matcher

Run these more complex ones we are working on now, which are slower, but better at matching

With this design, it is more complicated using a base class with implementation built in (not impossible, just a bit more convoluted). If the matcher interface were separated out, then it would be easier to have a single orchestration class that can use one or more matchers. Again, not impossible to do with the current base class, but I think you end up needing to wrap the functionality anyway.

The overall scoring algorithm considers everything, but I don't think that each name+publisher matching algorithm should separately consider other fields. Not that they should be disallowed, as I could see potential value in versions occurring in the name, but generally those are separate algorithms. If we want to consider different version matching heuristics, we should do so as its own thing.

So I guess where I'm headed is that a pure interface class makes generic matchers possible, and a single orchestration class/function allows us to combine them as we see fit.

I'm making the threshold specific to each algorithm and not restricted to [0,1] because I think that is hard to normalize across multiple algorithms.

Unless the algorithm actually returns completely unbounded values, numbers in the range [0, 1] are always possible. This confidence interval scoring is very common, and having a normalized range is what makes it possible to combine scores from different aspects. Creating a standardize threshold for meaning also allows the different algorithms to be interchangeable to the code above. If mine thinks that 0.5+ is a solid match, but yours thinks that 0.9+ is, then how can the code above combine them? If the answer is "use the threshold", then the scoring algorithm could do the same to normalize to the standardized value. And it can do it with knowledge of its statistics; a linear mapping may not be the best approach for all cases.

I made some changes based on what you said (but after re-reading it, I think it doesn't exactly match what you wanted...).

I kept all of your original code for matching ARP using product code and normalized names, and then the heuristics only run if that doesn't return anything. That's intended to be part of the orchestrator you mentioned.

I added a base class for name+publisher matchers that take the strings and return a value in [0,1]. I kept the 3 sample matchers I had earlier. I think this is where you can plug in the matching algorithm you wanted to implement.

I kept the base class for the overall algorithm, and added an implementation that for now considers only the string matching. This was intended to be the orchestrator for mix-and-matching smaller matchers, but I just realized that I didn't do it the way you suggested because it doesn't allow for doing a full pass with a single matcher, then another one with other matcher if needed.

I also added implementations of that base class to the tests, which use only a single string matching algorithm. That way we can evaluate just the accuracy of those, without considering anything else we may end up adding on top in the orchestrator.

I still need to test this E2E and play with the matching algorithms.

Yes, it is now worse for my purposes, as I need more ARP information than you are providing. For my algorithm I want something more like:

struct IARPMatchConfidenceAlgorithm { double Compute(const ARPEntry& entry) const; }; struct WordsARPMatchConfidenceAlgorithm : public IARPMatchConfidenceAlgorithm { WordsARPMatchConfidenceAlgorithm(const Manifest& manifest); double Compute(const ARPEntry& entry) const override; };

Where the ARPEntry contains other information, like language, that I was going to leverage.

I'll leave those changes for the future as we update the matching algorithm.

Scratch that last comment; I updated it to use the interface you suggested. Ended up cleaner with it :D

* Moved all the logic for detecting+reporting changes from InstallFlow to ARPCorrelation * Made matching threshold a global value * Separated the overall matching algorithm and the string matching algorithms * Removed hacks used for testing (having name + publisher in the ARPEntry, instead of using the data from the IPackageVersion) * Expanded test infrastructure to be able to extend with just adding new datasets or algorithms * Changed correlation to first try to use the Source search using normalized name and publisher, and only use the heuristics if that fails.

JohnMcPMS · 2022-03-30T23:07:42Z

src/AppInstallerCLICore/ExecutionContextData.h

+        template <>
+        struct DataMapping<Data::ProductCodeFromARP>
+        {
+            using value_t = std::vector<Utility::LocIndString>;


This needs to be the entire AppsAndFeaturesEntry, not just the product code.

Updated to include the whole AppAndFeaturesEntry, although I'm not entirely sure if I did it The Right Way®

src/AppInstallerCLICore/ExecutionContextData.h

JohnMcPMS · 2022-04-07T19:43:16Z

src/AppInstallerCLICore/Workflows/InstallFlow.cpp

-                sourceIdentifier = context.Get<Execution::Data::PackageVersion>()->GetProperty(PackageVersionProperty::SourceIdentifier);
-            }
-
-            Logging::Telemetry().LogSuccessfulInstallARPChange(


Did this telemetry event get moved somewhere else? It should still be done in this function when one is found rather than being done in the helper method that could be used for other purposes.

That might mean changing the output of the helper to return additional information, although the count fields in this event are less meaningful with different algorithms. But we could still calculate the number of changes, how many manifests were above the threshold, and how many of those were changed as the values used here, in that order.

I had moved it down to the function doing the correlation; but now it's back here. I changed the helper to return the count of changes/matches, although I'm keeping that count to only consider the exact matches from the source search as I couldn't figure out a good way to keep the count consistent across the multiple "passes".

Do you have any ideas how to count the matching manifests when sometimes we use the exact matching and sometimes the confidence measures?

As long as we can reason about the meaning, anything is fine. It can stay using the exact same values as before, just with a better guess. I thought we might use these values in some way to find things that weren't correlating, but it turned out to be very easy to find them 😉

So basically, these numbers are probably not important. Don't spend time trying to improve them, and if you think they are broken, we might consider just reporting 0 for all of them.

JohnMcPMS · 2022-04-07T20:01:49Z

src/AppInstallerCLICore/Workflows/InstallFlow.cpp

            }

-            // Cross reference the changes with the search results
-            std::vector<std::shared_ptr<IPackage>> packagesInBoth;
+            auto names = arpEntry->GetMultiProperty(PackageVersionMultiProperty::Name);


These will be the normalized name and publisher from the index. Not that it really matters, since normalizing them again for insertion into the next index is fine, but it might be worth a comment.

Also, for these ARP only snapshots I wouldn't expect to have more than one of any of these multi-properties.

src/AppInstallerCLICore/Workflows/InstallFlow.cpp

JohnMcPMS · 2022-04-07T20:10:24Z

src/AppInstallerCLITests/ARPChanges.cpp

@@ -204,9 +205,35 @@ struct TestContext : public Context
        }
 };

+// Override the correlation heuristic by an empty one to ensure that these tests
+// consider only the exact matching.
+struct TestHeuristicOverride


Really these tests should probably be removed at some point in the future, but it doesn't need to be now since you have them working.

src/AppInstallerRepositoryCore/ARPCorrelation.cpp

JohnMcPMS · 2022-04-07T21:12:18Z

src/AppInstallerRepositoryCore/ARPCorrelation.cpp

+            auto s2 = Utility::FoldCase(sv2);
+
+            // distance[i][j] = distance between s1[0:i] and s2[0:j]
+            std::vector<std::vector<double>> distance{ s1.size(), std::vector<double>(s2.size(), 0.0) };


I would suggest using a single allocation for efficiency.

src/AppInstallerRepositoryCore/ARPCorrelation.cpp

JohnMcPMS · 2022-04-07T22:20:25Z

src/AppInstallerRepositoryCore/ARPCorrelation.cpp

+            }
+        }
+
+        // Also attempt to find the entry based on the manifest data


I would like this older matching mechanism to fit into the same infrastructure so that the overall could be something like:

foreach (algorithm : { ... }) { algorithm.Init(manifest) if (Correlate(algorithm, ARP) > threshold) return that; }

Thoughts for the future, doesn't need to change now.

src/AppInstallerCLITests/Correlation.cpp

JohnMcPMS · 2022-04-08T21:41:22Z

src/AppInstallerCLITests/TestData/InputARPData.txt

@@ -82,13 +76,10 @@ XPDP2X1MMZ4KR8|Ashampoo Burning Studio 23|Ashampoo|Ashampoo Burning Studio 23|23
 XPFCFBB4FB3D6D|Emjysoft Cleaner|Emjysoft|Emjysoft Cleaner 2022 v4.1|4.1|Emjysoft|{167B1302-A739-42DE-BBD2-4C2F13D1FF51}_is1
 XPFCFKCNNTXGQD|Yandex Browser|Yandex|Yandex|21.9.1.686|ООО «ЯНДЕКС»|YandexBrowser
 XPFCFL5ZTNFGD7|Wondershare Anireel|WONDERSHARE GLOBAL LIMITED|Wondershare Anireel(Build 1.6.2)||Wondershare Software|Wondershare Anireel_is1
-XPFCFL5ZTNFGD7|Wondershare Anireel|WONDERSHARE GLOBAL LIMITED|Wondershare Helper Compact 2.6.0|2.6.0|Wondershare|{5363CE84-5F09-48A1-8B6C-6BB590FFEDF2}_is1
 XPFCG86X2PGLDJ|Christmas Elf by Pothos|Pothos|Christmas Elf|||ChristmasElf
 XPFCGHHXNH4WBW|Biblioteca e Prestiti Librari|Esposito Software di M. G. Caputo|Gestione Biblioteca e Prestiti Librari 3.0 Demo||Copyright Esposito Software|Gestione Biblioteca e Prestiti Librari 3.0 Demo_is1
 XPFCWP0SQWXM3V|CCleaner|Piriform Software Ltd|CCleaner|5.89|Piriform|CCleaner
 XPFCXFRDJ8VGPT|Домашняя Бухгалтерия|Keepsoft|Äîìàøíÿÿ áóõãàëòåðèÿ Lite|7.2|Keepsoft|Äîìàøíÿÿ áóõãàëòåðèÿ Lite


Äîìàøíÿÿ áóõãàëòåðèÿ

These look like UTF-8 bytes interpreted as chars; could be bad data in the original registry entries or someone didn't preserve the encoding along the way.

It's like this in the original registry:

JohnMcPMS · 2022-04-08T22:27:52Z

src/AppInstallerCommonCore/AppInstallerStrings.cpp

+        }
+
+        UErrorCode errorCode = UErrorCode::U_ZERO_ERROR;
+        auto utf32ByteCount= ucnv_convert("UTF-32", "UTF-8", nullptr, 0, input.data(), static_cast<int32_t>(input.size()), &errorCode);


If there is a Windows API to convert to UTF-32, they hid it well...

Co-authored-by: JohnMcPMS <[email protected]>

Flor Elisa Chacon Ochoa added 10 commits March 22, 2022 16:48

Add type for ARP correlation algorithms

76a0548

Add function to compute best match

da36d8b

Add overal structure for tests

d560e33

Record ARP product code after install

c0894f4

Use correlation measures in post-install

2ac0f75

Add test cases

bcfb37e

Add normalized name measure (very hacky...)

d2f4662

Add edit distance measure

a3e9d49

Cleanup data

5574320

Add edit distance measure to tests

6c4ec11

This comment was marked as resolved.

Sign in to view

Spelling

cb2b9c9

This comment was marked as resolved.

Sign in to view

JohnMcPMS reviewed Mar 24, 2022

View reviewed changes

src/AppInstallerRepositoryCore/ARPCorrelation.cpp Outdated Show resolved Hide resolved

JohnMcPMS reviewed Mar 25, 2022

View reviewed changes

Flor Elisa Chacon Ochoa added 3 commits March 28, 2022 16:32

Merge branch 'master' into matching

fb8da65

Report false matches in tests

435d4c3

This comment was marked as outdated.

Sign in to view

Flor Elisa Chacon Ochoa added 2 commits March 30, 2022 13:15

Use FoldCase; remove edit distance weights

58a3f29

Cleanup test data

2cf7961

JohnMcPMS reviewed Mar 30, 2022

View reviewed changes

denelon mentioned this pull request Apr 1, 2022

Improve package matching against Apps & Features Entry #1952

Closed

denelon linked an issue Apr 1, 2022 that may be closed by this pull request

Improve package matching against Apps & Features Entry #1952

Closed

Flor Elisa Chacon Ochoa added 6 commits April 1, 2022 11:53

Fix crashes; add logs

ab55bbc

Put whole ARP entry in context

867f156

Cleanup test data

c01cfbb

Update test logs

49727ba

Use type in context

aa5afee

Update test data

e75287d

Flor Elisa Chacon Ochoa added 3 commits April 5, 2022 15:51

Allow empty

d58c1ef

Remove unused measure

abc38b3

Reduce reporting

b647a60

This comment was marked as resolved.

Sign in to view

Flor Elisa Chacon Ochoa added 3 commits April 5, 2022 21:04

Spelling

d2cc53c

Add empty heuristic override for ARP snapshot tests

0e29cc4

Hide test

7c43ebf

florelis marked this pull request as ready for review April 7, 2022 19:05

florelis requested a review from a team as a code owner April 7, 2022 19:05

JohnMcPMS requested changes Apr 7, 2022

View reviewed changes

ghost added the Needs-Author-Feedback Issue needs attention from issue or PR author label Apr 7, 2022

Flor Elisa Chacon Ochoa added 9 commits April 7, 2022 22:34

Rename context data

891e678

Refactor per PR comments; use UTF-32 for edit distance

561e21d

Expand test cases

70ae168

Remove duplicates in data

35683f8

Copy code for publisher property

1087ec5

Use Publisher property in tests

0520643

Merge branch 'master' into matching

95af4ce

Resolve TODOs

ce8b259

Report time for correlation

314d2f1

ghost removed the Needs-Author-Feedback Issue needs attention from issue or PR author label Apr 8, 2022

This comment was marked as resolved.

Sign in to view

Flor Elisa Chacon Ochoa added 2 commits April 8, 2022 13:50

Do a single allocation for edit distance table

b48b697

Spelling

1ccd981

JohnMcPMS approved these changes Apr 8, 2022

View reviewed changes

Chacón and others added 2 commits April 8, 2022 15:43

Update src/AppInstallerCLITests/Correlation.cpp

b3c3332

Co-authored-by: JohnMcPMS <[email protected]>

Use steady_clock

3f49a78

florelis merged commit 4059f2b into microsoft:master Apr 8, 2022

florelis deleted the matching branch April 8, 2022 23:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add heuristics for matching packages to ARP after installing #2044

Add heuristics for matching packages to ARP after installing #2044

florelis commented Mar 24, 2022 •

edited by ghost

Loading

This comment was marked as resolved.

This comment was marked as resolved.

JohnMcPMS Mar 25, 2022

florelis Mar 25, 2022

JohnMcPMS Mar 25, 2022

florelis Mar 30, 2022

JohnMcPMS Mar 30, 2022

florelis Apr 6, 2022

florelis Apr 8, 2022

This comment was marked as outdated.

JohnMcPMS Mar 30, 2022

florelis Apr 6, 2022

This comment was marked as resolved.

JohnMcPMS Apr 7, 2022

florelis Apr 8, 2022

JohnMcPMS Apr 8, 2022

JohnMcPMS Apr 7, 2022

JohnMcPMS Apr 7, 2022

JohnMcPMS Apr 7, 2022

JohnMcPMS Apr 7, 2022

This comment was marked as resolved.

JohnMcPMS Apr 8, 2022

florelis Apr 8, 2022

JohnMcPMS Apr 8, 2022

Add heuristics for matching packages to ARP after installing #2044

Add heuristics for matching packages to ARP after installing #2044

Conversation

florelis commented Mar 24, 2022 • edited by ghost Loading

Microsoft Reviewers: Open in CodeFlow

This comment was marked as resolved.

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

florelis commented Mar 24, 2022 •

edited by ghost

Loading