-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: duplicate packages when identical except location #1249
Conversation
Signed-off-by: a <[email protected]>
Signed-off-by: a <[email protected]>
@@ -99,16 +99,6 @@ func TestIDUniqueness(t *testing.T) { | |||
}, | |||
expectedIDComparison: assert.NotEqual, | |||
}, | |||
{ | |||
name: "location is reflected", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed as the location shouldn't be reflected in the ID anymore
@spiffcs to polish the unit/integration tests snapshots. Please ping me if its a bigger issue than just re-snapping them |
This PR introduces a hash:ignore tag for the virtual path in Java metadata. It also introduces a hash:ignore for the Locations field of a given package. Because these two are no longer considered as part of the struct hash golden snapshots needed to be regenerated. No other data should have changed beyond the ID/related fields or timestamps Signed-off-by: Christopher Phillips <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @cpendery and sorry I didn't catch my comment when we spoke earlier today. I ran into it when doing the integration testing updates. In your testing what situation did you see where this field being hashed was causing duplicates that were semantically the same?
@@ -20,7 +20,7 @@ var jenkinsPluginPomPropertiesGroupIDs = []string{ | |||
|
|||
// JavaMetadata encapsulates all Java ecosystem metadata for a package as well as an (optional) parent relationship. | |||
type JavaMetadata struct { | |||
VirtualPath string `json:"virtualPath" cyclonedx:"virtualPath"` // we need to include the virtual path in cyclonedx documents to prevent deduplication of jars within jars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can ignore this since it can help distinguish between two logically separate packages.
Example:
- We have a jenkins plugin that is tested where they are duplicates EXCEPT for the virtual path they are installed. For the first one
VirtualPath: /java/example-jenkins-plugin.hpi
. For the second oneVirtualPath: /java/example-jenkins-plugin.hpi:WEB-INF/lib/example-jenkins-plugin.jar
<-- This is what allows syft to detect separate packages nested within their parents with the same name.
I think ignoring the Locations list is ok as far as duplicates are concerned, but this field shows is similarly named packages are located within each other or at separate locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syft/test/integration/catalog_packages_cases_test.go
Lines 352 to 359 in 48f0a46
{ | |
name: "find jenkins plugins", | |
pkgType: pkg.JenkinsPluginPkg, | |
pkgLanguage: pkg.Java, | |
duplicates: 1, // there is a "example-jenkins-plugin" HPI, and nested within that a JAR of the same name | |
pkgInfo: map[string]string{ | |
"example-jenkins-plugin": "1.0-SNAPSHOT", | |
}, |
This test case.
The debugger output below shows a catalog where the package has been added already with /java/example-jenkins-plugin.hpi
. The package to be added has /java/example-jenkins-plugin.hpi:WEB-INF/lib/example-jenkins-plugin.jar
Since we're no longer hashing this syft sees them as identical even though one is the parent while the other is nested within a JAR of the same name.
virtualPath needs to be considered as a part of ID generation for packages that use the JavaMetadata struct. Consider a package that has a virtualPath of "/java/example-jenkins-plugin.hpi" vs one that has the virtualPath of "/java/example-jenkins-plugin.hpi:WEB-INF/lib/example-jenkins-plugin.jar". Syft views these as separate packages and thus they should be cataloged and sorted separately and in a stable fashion. Conversely, if an apkdb package is found to have exactly the same metadata, it is considered the same package and the location lists are to be merged. Signed-off-by: Christopher Phillips <[email protected]>
Here's the latest commit message I wrote, but I wanted to throw it out here on the PR as well since it feels like there is some contention here:
^ That's the current behavior, but it looks to me like we're treating the two ecosystems differently. In the example provided by #1162 In the example given above for our Given VirtualPath is location data, should we be merging these packages under one object and showing all locations it's installed, or should we keep duplicating this data and generating 4 separate objects in our different SBOM outputs? |
cc @kzantow and cc @wagoodman for other maintainer 👀 |
My 2c here: all other things the same, I would consider two packages with the same ecosystem, name, and version to be the same(-ish) and we should try to merge all locations where we find evidence of said package together. I think the rub here is for example we might find one with RPM metadata and another with Java metadata, and merging these would lose some information from one of the catalogers, even if realistically the packages might be referring to the same thing (e.g. apache tomcat found via rpmdb but also due to jars on the filesystem). It seems like to do this effectively we might want to account for merging things like this, too (which seems like potentially a bigger change). |
Signed-off-by: Christopher Phillips <[email protected]>
remove specific sort case now that VirtualPath is ignored in ID generation. Digests have been added to the hash consideration which should fix the case this sort was added for where ID were identical when they should not have been Signed-off-by: Christopher Phillips <[email protected]>
Thanks @kzantow for the feedback! I spoke with @wagoodman offline as well. "It seems like to do this effectively we might want to account for merging things like this (cross ecosystem), too (which seems like potentially a bigger change)." I think we can speak about solving this one as its own PR but really good to highlight here. Specifically, to solve for #1162, we're going to ignore We'll also update |
Signed-off-by: Christopher Phillips <[email protected]>
Signed-off-by: Christopher Phillips <[email protected]>
I've pushed two fixes here that update I tested this with the example provided in the issue. Instead of 4 separate packages
@wagoodman if this is what we want here then I'll update the schema and tests to cover this new behavior |
* main: refactor: Remove experimental Anchore Enterprise upload functionality (anchore#1257) Update syft bootstrap tools to latest versions. (anchore#1254) Update Stereoscope to d24c9d626b33fa720210b007a20767801827b532 (anchore#1253) Update syft bootstrap tools to latest versions. (anchore#1244) fix apkdb checksum representation (anchore#1247) feat: add identifiable field to source object (anchore#1243) feat: attest support for Singularity images (anchore#1201) Update syft bootstrap tools to latest versions. (anchore#1239) Update Stereoscope to 1b1b744a919964f38d14e1416fb3f25221b761ce (anchore#1240) fix: Follow symlinks when searching for globs in all-layers scope (anchore#1221)
Signed-off-by: Christopher Phillips <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple questions about hash ignores
@@ -397,7 +407,7 @@ func packageIdentitiesMatch(p pkg.Package, parentPkg *pkg.Package) bool { | |||
metadata := p.Metadata.(pkg.JavaMetadata) | |||
|
|||
// the virtual path matches... | |||
if parentPkg.Metadata.(pkg.JavaMetadata).VirtualPath == metadata.VirtualPath { | |||
if len(metadata.VirtualPath) > 0 && metadata.VirtualPath[0] == parentPkg.Metadata.(pkg.JavaMetadata).VirtualPath[0] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check all VirtualPaths match, maybe in an unordered fashion?
ArchiveDigests []file.Digest `hash:"ignore" json:"digest,omitempty"` | ||
ArchiveDigests []file.Digest `json:"digest,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this remain hash:"ignore"
? If we add a virtual path, are we adding an archive digest, which would then alter the package ID?
virtualPaths := p.Metadata.(JavaMetadata).VirtualPath | ||
virtualPaths = append(virtualPaths, other.Metadata.(JavaMetadata).VirtualPath...) | ||
if t, ok := p.Metadata.(JavaMetadata); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're doing this check if t, ok := p.Metadata.(JavaMetadata); ok {
anyway, why not do this first instead of a possible panic here virtualPaths := p.Metadata.(JavaMetadata).VirtualPath
(albeit very unlikely). And we should probably test other.Metadata.(JavaMetadata)
, too.
Unfortunately this issue/PR is in conflict with some upcoming work, specifically #572 . Today syft reports out a rather "flat" package list, with some package relationships to files to convey which packages own which files. The infrastructure for allowing catalogers to discover and report up package-to-package relationships is done and #572 is meant to utilize this to allow SBOMs generated by Syft to convey package-to-package dependency graphs. Before picking up #572 I was very much in favor of deduplicating packages based on the metadata we find out about that package (and ignore where the package is from... in this way the "same bytes" found in multiple places would resolve to a single logical package). This method lends itself well when describing lists of unrelated packages, however, in the context of a graph, deduplicating (or merging) nodes means that the graph topology could be changing in unexpected ways. Take, for instance, two different and unrelated python projects in the same container image installed in two different locations, I think it would be surprising if the dependency graph for both projects had edges to the same node. Logically speaking the projects may share the same dependency at the same version, but in reality they do so in separate contexts, so the graph should reflect this. This is further complicated if there are installation details that result in different child dependencies that are not easily visible in the packaging metadata (I think python package I'm going to close this PR in the meantime and follow up in the issue with other options. I'd be happy to talk about this further here, in the issue, or in the community meetings. |
📝 Description
This pr fixes the issue of the same package being found in multiple locations producing multiple separate entries in the sbom.
Closes #1162