Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce notion of Package set #95

Open
JonoYang opened this issue May 10, 2023 · 11 comments
Open

Introduce notion of Package set #95

JonoYang opened this issue May 10, 2023 · 11 comments
Assignees

Comments

@JonoYang
Copy link
Member

We have the case where we have different instances of the same package. For example, for a maven Package, we can have the source, test, or doc JAR for that package. These JARs are part of the same package but would be considered different packages in the PackageDB, as they have different download URLs. These JARS would have similar purl values (same type, namespace, name, and version) but different subpath and qualifiers. We want to be able to relate this group of Packages together such that we can combine all the findings from the different varieties of a Package and return the data to the user.

After some discussion with @pombredanne, we will add two new fields to the Package model:

  • package_set (working name)
    • This field contains a uuid or unique identifier for a group of packages. This ID would be unique to a particular purl (type, namespace, name, version) combination. The ID is generated when a Package with a purl that is not yet in the PackageDB is created. When another package with the same type, namespace, name, and version is created, the ID would also be used here.
  • package_content
    • This field contains a tag that describes what is in this package, either source, binary, doc, tests, etc.

There would be an API endpoint (or equivalent), where we query for a package, then we would combine the metadata based on some set precedence (curated package data, source data data, then binary, etc) and return the combined package data to the user.

@pombredanne @DennisClark

I'd like to hear your thoughts on grouping like packages via ID

@JonoYang JonoYang self-assigned this May 10, 2023
@DennisClark
Copy link
Member

@JonoYang Package Set is a great idea. I think it would help to get some concrete examples of the proposed new field package_content since I am having trouble visualizing the detail values in a such a field.

@JonoYang
Copy link
Member Author

@DennisClark

There are multiple JARs of log4j-core v2.0 at https://repo1.maven.org/maven2/org/apache/logging/log4j/log4j-core/2.0/ ,
image

We would create an entry in the PackageDB for each JAR. The tag in package_content the category the files within it fall under. log4j-core-2.0-javadoc.jar would have a package_content value of doc, log4j-core-2.0-sources.jar would be source, log4j-core-2.0-tests.jar would be test, and log4j-core-2.0.jar would be binary.

@DennisClark
Copy link
Member

@JonoYang thanks -- my original problem was thinking that package_content referred to the whole set, but instead we are planning to choose a specific label for each jar -- that makes a lot of sense!

@DennisClark
Copy link
Member

@JonoYang and of course, I also wondered if the the package_set concept might be applied to other technologies other than Java, where a group of packages might be considered part of a set.

@pombredanne
Copy link
Member

@JonoYang and of course, I also wondered if the the package_set concept might be applied to other technologies other than Java, where a group of packages might be considered part of a set.

@DennisClark I think so. For instance, these could be sets:

@pombredanne
Copy link
Member

@JonoYang you wrote:

There would be an API endpoint (or equivalent), where we query for a package, then we would combine the metadata based on some set precedence (curated package data, source data data, then binary, etc) and return the combined package data to the user.

I guess some examples of how we could combine metadata could be either generic or package type specific or even package name or PURL-specific.

  • a generic way license may flow through may be: Git repo -> source archive -> binary build
  • a type-specific way may be: maven JAR -> maven binary JAR

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented May 15, 2023

Btw, adding another example for this: https://repo1.maven.org/maven2/commons-daemon/commons-daemon/1.3.3/
We have a lot of variations of java source jars, native jars and other type of packages here

JonoYang added a commit that referenced this issue May 16, 2023
    * Add package_set and package_content
    * Add migration to set package_content field, if possible

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue May 16, 2023
JonoYang added a commit that referenced this issue May 17, 2023
JonoYang added a commit that referenced this issue May 17, 2023
    * package_set is now a UUIDField
    * package_content is now an IntegerField

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue May 17, 2023
JonoYang added a commit that referenced this issue May 17, 2023
    * get_mixed_package is now get_enhanced_package
    * Perform query for each package, not each field on the package

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue May 17, 2023
JonoYang added a commit that referenced this issue May 17, 2023
JonoYang added a commit that referenced this issue May 17, 2023
@JonoYang
Copy link
Member Author

The package_content field is an IntegerField with the following choices as values:

  • CURATION = 1
    • This is a special package whose data maps to a particular package in a package set. Curations contain package data that has been curated by a user or from some other source.
  • PATCH = 2
  • SOURCE_REPO = 3
  • SOURCE_ARCHIVE = 4
  • BINARY = 5
  • TEST = 6
  • DOC = 7

For now, this field will be populated for new Packages created using the maven or npm on-demand Package mining code. Existing Packages will have a null value in this field. In the future, we will have "improvers" that are tasks that will update Package data, and we'll have an improver that sets package_content. (Related: another possible improver would be one that groups maven Packages with the same SHA1's to the same package_set. )

When we are in get_enhanced_package, we have a package and the other packages from the same package set. We order the other packages based on package_content and purl fields. When we encounter a package that is of the same package_content as us, we skip it. We then update the following fields, if the package we are looking at has no values in them already:

UPDATEABLE_FIELDS = [
    'primary_language',
    'copyright',

    'declared_license_expression',
    'declared_license_expression_spdx',
    'license_detections',
    'other_license_expression',
    'other_license_expression_spdx',
    'other_license_detections',
    # TODO: update extracted license statement and other fields together
    # all license fields are based off of `extracted_license_statement` and should be treated as a unit
    # hold off for now
    'extracted_license_statement',

    'notice_text',
    'api_data_url',
    'bug_tracking_url',
    'code_view_url',
    'vcs_url',
    'source_packages',
    'repository_homepage_url',
    'dependencies',
    'parties',
]

@DennisClark
Copy link
Member

@JonoYang glad to see the progress report here ! My only concern is wondering how the consumer of scan results will know that when, for example, package_content=3, that the package is a SOURCE_REPO? Tools can be made smart enough to figure that out of course, but it means that someone looking at the raw json output might be unable to interpret the "3".

JonoYang added a commit that referenced this issue May 18, 2023
    * Update ResourceAPISerializer
    * Update scan_and_fingerprint_package pipeline

Signed-off-by: Jono Yang <[email protected]>
@JonoYang
Copy link
Member Author

@DennisClark

Good catch! I'll update the Package serializers to return the package_content value's label, instead of the value.

JonoYang added a commit that referenced this issue May 19, 2023
    * Show package_content label instead of value in API results

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue May 20, 2023
    * Add test for get_enhanced_package

Signed-off-by: Jono Yang <[email protected]>
@JonoYang JonoYang mentioned this issue May 20, 2023
JonoYang added a commit that referenced this issue May 22, 2023
    * Update code to reflect changes

Signed-off-by: Jono Yang <[email protected]>
@JonoYang
Copy link
Member Author

We're running into the issue of multiple packages having the same source_repo package, since this conflicts with the constraint on the db for having unique purl values and download_urls together.

A solution may be to allow packages to be in multiple package sets. This would involve adding the field package_sets to the Package model.

Right now, we set the package_set value at package insert time, where we check to see if an existing package with the same purl values as the package we're about to make exists. If it does, we use that package's package_set value for our new package. If it doesn't, we create a new package_set value using uuid4().

The source_repo packages we want to create have a different purl than the packages they are for, so it is not always straightforward process to automatically associate the source_repo package to its binary or source_archive package. We would have to do this process by hand, knowing which Packages to associate our new source_repo package to.

JonoYang added a commit that referenced this issue Jun 24, 2023
JonoYang added a commit that referenced this issue Jun 28, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Jun 28, 2023
JonoYang added a commit that referenced this issue Jun 29, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
    * Create package set only when there is an existing related package

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
    * Show hyperlinks to packages in package sets within a package in the Package API
    * Show package uids of packages in package sets within a Package in the Package metadata

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
    * Create package set only when there is an existing related package

Signed-off-by: Jono Yang <[email protected]>
JonoYang added a commit that referenced this issue Jul 19, 2023
JonoYang added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Jono Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants