Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heuristics for matching packages to ARP after installing #2044

Merged
merged 41 commits into from
Apr 8, 2022

Conversation

florelis
Copy link
Member

@florelis florelis commented Mar 24, 2022

This adds heuristics for matching ARP data with a package after installing it, for better correlation later.

  • Defined a type for algorithms measuring how much an ARP entry matches with a package. We will ultimately use only one, but we keep several for comparison and experimenting.
  • After installing, we will try to use the matching heuristic to correlate one of the ARP changes with the package. If we find a match, we will store the product code in the tracking catalog for future use.
  • Added tests to evaluate each heuristic. The test would take some data and create a summary of how many (true|false)(positives|negatives) we got and report them for each heuristic. For the heuristic we ultimately use, we will want to also add a check for how good the results have to be.
  • Added some test data with package names and publisher as coming from the manifest and as seen in ARP after installing. This was obtained by installing several Store apps and inspecting the ARP after the installation, with some manual editing of the data.

Note that this is still a work in progress. It doesn't work yet and it is not very clean..

Microsoft Reviewers: Open in CodeFlow

@github-actions

This comment was marked as resolved.

@github-actions

This comment was marked as resolved.

arpEntries.push_back(allARPEntries[i]);
}

auto match = measure.GetBestMatchForManifest(manifest, arpEntries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the scoring algorithms should be only that; give the manifest and a single ARP entry and they compute a score in [0, 1.0]. Or to allow for optimization, construct them with a manifest so that they can pre-compute anything that is manifest specific, then compute a score per ARP entry. A confidence threshold would be defined globally to standardize on the minimum for considering something a possible match.

Then the overall score would factor in other data, like new/updated entry, version matching, etc. But that is agnostic of the name and publisher matching algorithm, and should mostly just be attempting to tiebreak or boost our confidence.

This test harness is about rating the name and publisher matching algorithm specifically, and would be a good place to also understand other metrics, like the gap between the first and second highest confidence intervals. It having all of the scores enables that kind of investigation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This GetBestMatchForManifest() is defined in the base class. It loops over all the entries and finds the one with the highest score (that is also over the threshold). It's not defined by each algorithm; those only do what you say (assign a score).

The way I'm thinking about it, the scoring algorithms consider everything (name, version, if it is new), and the fact that we are currently only looking at name string matching is incidental. If we add scores for other things, the string matching can just be pushed one layer below these algorithm classes.

I'm making the threshold specific to each algorithm and not restricted to [0,1] because I think that is hard to normalize across multiple algorithms. Specially if we consider how to mix multiple scores into one. For example, I'm thinking that we could compare mixing the scores for name and publisher (or anything else) by a weighted sum, or by multiplying them, or by summing squares, and that would all affect the threshold and range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I didn't look closely at that, but I did find the base class header odd, so that is why. I believe there may be value in being able to run multiple passes, such as:

  1. Run a direct equality matcher looking for the easy cases where we actually have exact data
  2. Run the existing normalized name and publisher matcher
  3. Run these more complex ones we are working on now, which are slower, but better at matching

With this design, it is more complicated using a base class with implementation built in (not impossible, just a bit more convoluted). If the matcher interface were separated out, then it would be easier to have a single orchestration class that can use one or more matchers. Again, not impossible to do with the current base class, but I think you end up needing to wrap the functionality anyway.

The overall scoring algorithm considers everything, but I don't think that each name+publisher matching algorithm should separately consider other fields. Not that they should be disallowed, as I could see potential value in versions occurring in the name, but generally those are separate algorithms. If we want to consider different version matching heuristics, we should do so as its own thing.

So I guess where I'm headed is that a pure interface class makes generic matchers possible, and a single orchestration class/function allows us to combine them as we see fit.

I'm making the threshold specific to each algorithm and not restricted to [0,1] because I think that is hard to normalize across multiple algorithms.

Unless the algorithm actually returns completely unbounded values, numbers in the range [0, 1] are always possible. This confidence interval scoring is very common, and having a normalized range is what makes it possible to combine scores from different aspects. Creating a standardize threshold for meaning also allows the different algorithms to be interchangeable to the code above. If mine thinks that 0.5+ is a solid match, but yours thinks that 0.9+ is, then how can the code above combine them? If the answer is "use the threshold", then the scoring algorithm could do the same to normalize to the standardized value. And it can do it with knowledge of its statistics; a linear mapping may not be the best approach for all cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes based on what you said (but after re-reading it, I think it doesn't exactly match what you wanted...).

I kept all of your original code for matching ARP using product code and normalized names, and then the heuristics only run if that doesn't return anything. That's intended to be part of the orchestrator you mentioned.

I added a base class for name+publisher matchers that take the strings and return a value in [0,1]. I kept the 3 sample matchers I had earlier. I think this is where you can plug in the matching algorithm you wanted to implement.

I kept the base class for the overall algorithm, and added an implementation that for now considers only the string matching. This was intended to be the orchestrator for mix-and-matching smaller matchers, but I just realized that I didn't do it the way you suggested because it doesn't allow for doing a full pass with a single matcher, then another one with other matcher if needed.

I also added implementations of that base class to the tests, which use only a single string matching algorithm. That way we can evaluate just the accuracy of those, without considering anything else we may end up adding on top in the orchestrator.

I still need to test this E2E and play with the matching algorithms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is now worse for my purposes, as I need more ARP information than you are providing. For my algorithm I want something more like:

struct IARPMatchConfidenceAlgorithm
{
    double Compute(const ARPEntry& entry) const;
};

struct WordsARPMatchConfidenceAlgorithm : public IARPMatchConfidenceAlgorithm
{
    WordsARPMatchConfidenceAlgorithm(const Manifest& manifest);
    double Compute(const ARPEntry& entry) const override;
};

Where the ARPEntry contains other information, like language, that I was going to leverage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave those changes for the future as we update the matching algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that last comment; I updated it to use the interface you suggested. Ended up cleaner with it :D

Flor Elisa Chacon Ochoa added 3 commits March 28, 2022 16:32
* Moved all the logic for detecting+reporting changes from InstallFlow to ARPCorrelation
* Made matching threshold a global value
* Separated the overall matching algorithm and the string matching algorithms
* Removed hacks used for testing (having name + publisher in the ARPEntry, instead of using the data from the IPackageVersion)
* Expanded test infrastructure to be able to extend with just adding new datasets or algorithms
* Changed correlation to first try to use the Source search using normalized name and publisher, and only use the heuristics if that fails.
@github-actions

This comment was marked as outdated.

template <>
struct DataMapping<Data::ProductCodeFromARP>
{
using value_t = std::vector<Utility::LocIndString>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be the entire AppsAndFeaturesEntry, not just the product code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to include the whole AppAndFeaturesEntry, although I'm not entirely sure if I did it The Right Way®

@github-actions

This comment was marked as resolved.

@florelis florelis marked this pull request as ready for review April 7, 2022 19:05
@florelis florelis requested a review from a team as a code owner April 7, 2022 19:05
src/AppInstallerCLICore/ExecutionContextData.h Outdated Show resolved Hide resolved
sourceIdentifier = context.Get<Execution::Data::PackageVersion>()->GetProperty(PackageVersionProperty::SourceIdentifier);
}

Logging::Telemetry().LogSuccessfulInstallARPChange(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this telemetry event get moved somewhere else? It should still be done in this function when one is found rather than being done in the helper method that could be used for other purposes.

That might mean changing the output of the helper to return additional information, although the count fields in this event are less meaningful with different algorithms. But we could still calculate the number of changes, how many manifests were above the threshold, and how many of those were changed as the values used here, in that order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had moved it down to the function doing the correlation; but now it's back here. I changed the helper to return the count of changes/matches, although I'm keeping that count to only consider the exact matches from the source search as I couldn't figure out a good way to keep the count consistent across the multiple "passes".

Do you have any ideas how to count the matching manifests when sometimes we use the exact matching and sometimes the confidence measures?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we can reason about the meaning, anything is fine. It can stay using the exact same values as before, just with a better guess. I thought we might use these values in some way to find things that weren't correlating, but it turned out to be very easy to find them 😉

So basically, these numbers are probably not important. Don't spend time trying to improve them, and if you think they are broken, we might consider just reporting 0 for all of them.

}

// Cross reference the changes with the search results
std::vector<std::shared_ptr<IPackage>> packagesInBoth;
auto names = arpEntry->GetMultiProperty(PackageVersionMultiProperty::Name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will be the normalized name and publisher from the index. Not that it really matters, since normalizing them again for insertion into the next index is fine, but it might be worth a comment.

Also, for these ARP only snapshots I wouldn't expect to have more than one of any of these multi-properties.

src/AppInstallerCLICore/Workflows/InstallFlow.cpp Outdated Show resolved Hide resolved
@@ -204,9 +205,35 @@ struct TestContext : public Context
}
};

// Override the correlation heuristic by an empty one to ensure that these tests
// consider only the exact matching.
struct TestHeuristicOverride
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really these tests should probably be removed at some point in the future, but it doesn't need to be now since you have them working.

src/AppInstallerRepositoryCore/ARPCorrelation.cpp Outdated Show resolved Hide resolved
auto s2 = Utility::FoldCase(sv2);

// distance[i][j] = distance between s1[0:i] and s2[0:j]
std::vector<std::vector<double>> distance{ s1.size(), std::vector<double>(s2.size(), 0.0) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using a single allocation for efficiency.

src/AppInstallerRepositoryCore/ARPCorrelation.cpp Outdated Show resolved Hide resolved
}
}

// Also attempt to find the entry based on the manifest data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like this older matching mechanism to fit into the same infrastructure so that the overall could be something like:

foreach (algorithm : { ... })
{
   algorithm.Init(manifest)
   if (Correlate(algorithm, ARP) > threshold)
       return that;
}

Thoughts for the future, doesn't need to change now.

src/AppInstallerCLITests/Correlation.cpp Show resolved Hide resolved
@ghost ghost added the Needs-Author-Feedback Issue needs attention from issue or PR author label Apr 7, 2022
@ghost ghost removed the Needs-Author-Feedback Issue needs attention from issue or PR author label Apr 8, 2022
@github-actions

This comment was marked as resolved.

src/AppInstallerCLITests/Correlation.cpp Outdated Show resolved Hide resolved
@@ -82,13 +76,10 @@ XPDP2X1MMZ4KR8|Ashampoo Burning Studio 23|Ashampoo|Ashampoo Burning Studio 23|23
XPFCFBB4FB3D6D|Emjysoft Cleaner|Emjysoft|Emjysoft Cleaner 2022 v4.1|4.1|Emjysoft|{167B1302-A739-42DE-BBD2-4C2F13D1FF51}_is1
XPFCFKCNNTXGQD|Yandex Browser|Yandex|Yandex|21.9.1.686|ООО «ЯНДЕКС»|YandexBrowser
XPFCFL5ZTNFGD7|Wondershare Anireel|WONDERSHARE GLOBAL LIMITED|Wondershare Anireel(Build 1.6.2)||Wondershare Software|Wondershare Anireel_is1
XPFCFL5ZTNFGD7|Wondershare Anireel|WONDERSHARE GLOBAL LIMITED|Wondershare Helper Compact 2.6.0|2.6.0|Wondershare|{5363CE84-5F09-48A1-8B6C-6BB590FFEDF2}_is1
XPFCG86X2PGLDJ|Christmas Elf by Pothos|Pothos|Christmas Elf|||ChristmasElf
XPFCGHHXNH4WBW|Biblioteca e Prestiti Librari|Esposito Software di M. G. Caputo|Gestione Biblioteca e Prestiti Librari 3.0 Demo||Copyright Esposito Software|Gestione Biblioteca e Prestiti Librari 3.0 Demo_is1
XPFCWP0SQWXM3V|CCleaner|Piriform Software Ltd|CCleaner|5.89|Piriform|CCleaner
XPFCXFRDJ8VGPT|Домашняя Бухгалтерия|Keepsoft|Äîìàøíÿÿ áóõãàëòåðèÿ Lite|7.2|Keepsoft|Äîìàøíÿÿ áóõãàëòåðèÿ Lite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Äîìàøíÿÿ áóõãàëòåðèÿ

These look like UTF-8 bytes interpreted as chars; could be bad data in the original registry entries or someone didn't preserve the encoding along the way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's like this in the original registry:
image

}

UErrorCode errorCode = UErrorCode::U_ZERO_ERROR;
auto utf32ByteCount= ucnv_convert("UTF-32", "UTF-8", nullptr, 0, input.data(), static_cast<int32_t>(input.size()), &errorCode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a Windows API to convert to UTF-32, they hid it well...

@florelis florelis merged commit 4059f2b into microsoft:master Apr 8, 2022
@florelis florelis deleted the matching branch April 8, 2022 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve package matching against Apps & Features Entry
2 participants