Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: duplicate packages when identical except location #1249

Closed
wants to merge 10 commits into from

Conversation

cpendery
Copy link
Contributor

@cpendery cpendery commented Oct 5, 2022

📝 Description

This pr fixes the issue of the same package being found in multiple locations producing multiple separate entries in the sbom.

Closes #1162

@@ -99,16 +99,6 @@ func TestIDUniqueness(t *testing.T) {
},
expectedIDComparison: assert.NotEqual,
},
{
name: "location is reflected",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed as the location shouldn't be reflected in the ID anymore

@cpendery
Copy link
Contributor Author

cpendery commented Oct 5, 2022

@spiffcs to polish the unit/integration tests snapshots. Please ping me if its a bigger issue than just re-snapping them

This PR introduces a hash:ignore tag for the virtual path in Java
metadata. It also introduces a hash:ignore for the Locations field of a
given package. Because these two are no longer considered as part of the
struct hash golden snapshots needed to be regenerated. No other data
should have changed beyond the ID/related fields or timestamps

Signed-off-by: Christopher Phillips <[email protected]>
Copy link
Contributor

@spiffcs spiffcs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @cpendery and sorry I didn't catch my comment when we spoke earlier today. I ran into it when doing the integration testing updates. In your testing what situation did you see where this field being hashed was causing duplicates that were semantically the same?

@@ -20,7 +20,7 @@ var jenkinsPluginPomPropertiesGroupIDs = []string{

// JavaMetadata encapsulates all Java ecosystem metadata for a package as well as an (optional) parent relationship.
type JavaMetadata struct {
VirtualPath string `json:"virtualPath" cyclonedx:"virtualPath"` // we need to include the virtual path in cyclonedx documents to prevent deduplication of jars within jars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can ignore this since it can help distinguish between two logically separate packages.

Example:

  • We have a jenkins plugin that is tested where they are duplicates EXCEPT for the virtual path they are installed. For the first one VirtualPath: /java/example-jenkins-plugin.hpi. For the second one VirtualPath: /java/example-jenkins-plugin.hpi:WEB-INF/lib/example-jenkins-plugin.jar <-- This is what allows syft to detect separate packages nested within their parents with the same name.

I think ignoring the Locations list is ok as far as duplicates are concerned, but this field shows is similarly named packages are located within each other or at separate locations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
name: "find jenkins plugins",
pkgType: pkg.JenkinsPluginPkg,
pkgLanguage: pkg.Java,
duplicates: 1, // there is a "example-jenkins-plugin" HPI, and nested within that a JAR of the same name
pkgInfo: map[string]string{
"example-jenkins-plugin": "1.0-SNAPSHOT",
},

This test case.

The debugger output below shows a catalog where the package has been added already with /java/example-jenkins-plugin.hpi. The package to be added has /java/example-jenkins-plugin.hpi:WEB-INF/lib/example-jenkins-plugin.jar

Since we're no longer hashing this syft sees them as identical even though one is the parent while the other is nested within a JAR of the same name.

Screen Shot 2022-10-05 at 9 38 38 PM

virtualPath needs to be considered as a part of ID generation for
packages that use the JavaMetadata struct. Consider a package that has a
virtualPath of "/java/example-jenkins-plugin.hpi" vs one that has the
virtualPath of "/java/example-jenkins-plugin.hpi:WEB-INF/lib/example-jenkins-plugin.jar".
Syft views these as separate packages and thus they should be cataloged
and sorted separately and in a stable fashion. Conversely, if an apkdb
package is found to have exactly the same metadata, it is considered the
same package and the location lists are to be merged.

Signed-off-by: Christopher Phillips <[email protected]>
@spiffcs
Copy link
Contributor

spiffcs commented Oct 6, 2022

Here's the latest commit message I wrote, but I wanted to throw it out here on the PR as well since it feels like there is some contention here:

virtualPath needs to be considered as a part of ID generation for
packages that use the JavaMetadata struct. Consider a package that has a
virtualPath of "/java/example-jenkins-plugin.hpi" vs one that has the
virtualPath of "/java/example-jenkins-plugin.hpi:WEB-INF/lib/example-jenkins-plugin.jar".
Syft views these as separate packages and thus they should be cataloged
and sorted separately and in a stable fashion. 

Conversely, if an apkdb package is found to have exactly the same metadata, it is considered the
same package and the location lists are to be merged.

^ That's the current behavior, but it looks to me like we're treating the two ecosystems differently.

In the example provided by #1162 VirtualPath is the only field that is different among the 4 packages listed as "slf4j-api". The sha1 digests are identical, with only the metadata.VirtualPath field showing any difference.

In the example given above for our example-jenkins-plugin scenario, the sha1 digests for the packages are different values, which demonstrates the value in separating them based on their virtual paths.

Given VirtualPath is location data, should we be merging these packages under one object and showing all locations it's installed, or should we keep duplicating this data and generating 4 separate objects in our different SBOM outputs?

@spiffcs spiffcs marked this pull request as ready for review October 6, 2022 03:09
@spiffcs spiffcs requested review from kzantow and wagoodman October 6, 2022 03:09
@spiffcs
Copy link
Contributor

spiffcs commented Oct 6, 2022

cc @kzantow and cc @wagoodman for other maintainer 👀
cc @zhill

@kzantow
Copy link
Contributor

kzantow commented Oct 6, 2022

My 2c here: all other things the same, I would consider two packages with the same ecosystem, name, and version to be the same(-ish) and we should try to merge all locations where we find evidence of said package together. I think the rub here is for example we might find one with RPM metadata and another with Java metadata, and merging these would lose some information from one of the catalogers, even if realistically the packages might be referring to the same thing (e.g. apache tomcat found via rpmdb but also due to jars on the filesystem). It seems like to do this effectively we might want to account for merging things like this, too (which seems like potentially a bigger change).

remove specific sort case now that VirtualPath is ignored in ID
generation. Digests have been added to the hash consideration which
should fix the case this sort was added for where ID were identical when
they should not have been

Signed-off-by: Christopher Phillips <[email protected]>
@spiffcs
Copy link
Contributor

spiffcs commented Oct 6, 2022

Thanks @kzantow for the feedback! I spoke with @wagoodman offline as well.

"It seems like to do this effectively we might want to account for merging things like this (cross ecosystem), too (which seems like potentially a bigger change)."

I think we can speak about solving this one as its own PR but really good to highlight here.

Specifically, to solve for #1162, we're going to ignore VirtualPath, but include ArchiveDigests in the generation of the ID.
This should reduce the output from 4 > 1 for pkg:maven/org.slf4j/[email protected].

We'll also update VirtualPath from string => []string on the JavaMetadata so when we call merge we can account for all VirtualLocations for a single Java package, with the same digest, in the same way we treat locations for other packages.

@spiffcs spiffcs self-assigned this Oct 6, 2022
@spiffcs
Copy link
Contributor

spiffcs commented Oct 7, 2022

I've pushed two fixes here that update JavaMetadata.VirtualPath from string => []string. The merge function for the catalog has been updated to reflect this change.

I tested this with the example provided in the issue. Instead of 4 separate packages syft-json is now a single package that looks like this:

  {
   "id": "6711fde2ffdba0f0",
   "name": "slf4j-api",  "version": "1.6.2",
   "type": "java-archive",
   "foundBy": "java-cataloger",
    ................
   "metadataType": "JavaMetadata",
   "metadata": {
    "virtualPath": [
     "/usr/share/elasticsearch/modules/vector-tile/slf4j-api-1.6.2.jar",
     "/usr/share/elasticsearch/modules/x-pack-security/slf4j-api-1.6.2.jar",
     "/usr/share/elasticsearch/modules/x-pack-identity-provider/slf4j-api-1.6.2.jar",
     "/usr/share/elasticsearch/modules/repository-azure/slf4j-api-1.6.2.jar"
    ],
    "manifest": {
     "main": {
      "Archiver-Version": "Plexus Archiver",
      "Build-Jdk": "1.6.0_16",

@wagoodman if this is what we want here then I'll update the schema and tests to cover this new behavior

* main:
  refactor: Remove experimental Anchore Enterprise upload functionality (anchore#1257)
  Update syft bootstrap tools to latest versions. (anchore#1254)
  Update Stereoscope to d24c9d626b33fa720210b007a20767801827b532 (anchore#1253)
  Update syft bootstrap tools to latest versions. (anchore#1244)
  fix apkdb checksum representation (anchore#1247)
  feat: add identifiable field to source object (anchore#1243)
  feat: attest support for Singularity images (anchore#1201)
  Update syft bootstrap tools to latest versions. (anchore#1239)
  Update Stereoscope to 1b1b744a919964f38d14e1416fb3f25221b761ce (anchore#1240)
  fix: Follow symlinks when searching for globs in all-layers scope (anchore#1221)
Signed-off-by: Christopher Phillips <[email protected]>
Copy link
Contributor

@kzantow kzantow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple questions about hash ignores

@@ -397,7 +407,7 @@ func packageIdentitiesMatch(p pkg.Package, parentPkg *pkg.Package) bool {
metadata := p.Metadata.(pkg.JavaMetadata)

// the virtual path matches...
if parentPkg.Metadata.(pkg.JavaMetadata).VirtualPath == metadata.VirtualPath {
if len(metadata.VirtualPath) > 0 && metadata.VirtualPath[0] == parentPkg.Metadata.(pkg.JavaMetadata).VirtualPath[0] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check all VirtualPaths match, maybe in an unordered fashion?

ArchiveDigests []file.Digest `hash:"ignore" json:"digest,omitempty"`
ArchiveDigests []file.Digest `json:"digest,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this remain hash:"ignore"? If we add a virtual path, are we adding an archive digest, which would then alter the package ID?

Comment on lines +72 to +74
virtualPaths := p.Metadata.(JavaMetadata).VirtualPath
virtualPaths = append(virtualPaths, other.Metadata.(JavaMetadata).VirtualPath...)
if t, ok := p.Metadata.(JavaMetadata); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing this check if t, ok := p.Metadata.(JavaMetadata); ok { anyway, why not do this first instead of a possible panic here virtualPaths := p.Metadata.(JavaMetadata).VirtualPath (albeit very unlikely). And we should probably test other.Metadata.(JavaMetadata), too.

@wagoodman
Copy link
Contributor

Unfortunately this issue/PR is in conflict with some upcoming work, specifically #572 .

Today syft reports out a rather "flat" package list, with some package relationships to files to convey which packages own which files. The infrastructure for allowing catalogers to discover and report up package-to-package relationships is done and #572 is meant to utilize this to allow SBOMs generated by Syft to convey package-to-package dependency graphs.

Before picking up #572 I was very much in favor of deduplicating packages based on the metadata we find out about that package (and ignore where the package is from... in this way the "same bytes" found in multiple places would resolve to a single logical package). This method lends itself well when describing lists of unrelated packages, however, in the context of a graph, deduplicating (or merging) nodes means that the graph topology could be changing in unexpected ways.

Take, for instance, two different and unrelated python projects in the same container image installed in two different locations, I think it would be surprising if the dependency graph for both projects had edges to the same node. Logically speaking the projects may share the same dependency at the same version, but in reality they do so in separate contexts, so the graph should reflect this. This is further complicated if there are installation details that result in different child dependencies that are not easily visible in the packaging metadata (I think python package extras might be a good example of this -- the same package is installed with different options and could result in different dependencies needed... though I'd need to dig to come to a solid conclusion on this).

I'm going to close this PR in the meantime and follow up in the issue with other options. I'd be happy to talk about this further here, in the issue, or in the community meetings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Similar Packages Should Be Aggregated
4 participants