Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Licenses missing in most report format #933

Closed
WhyJee opened this issue Mar 31, 2022 · 10 comments · Fixed by #1540
Closed

Licenses missing in most report format #933

WhyJee opened this issue Mar 31, 2022 · 10 comments · Fixed by #1540
Assignees
Labels
bug Something isn't working license relating to software licensing

Comments

@WhyJee
Copy link

WhyJee commented Mar 31, 2022

What happened:

Scanning the same image leads to different results depending on the output format.

Type Components cpe purl Versions Licenses Notes
cyclonedx 154 154 153 154 0 WARN unable to convert relationship from CycloneDX 1.3 JSON, dropping...
cyclonedx-json 154 154 153 154 0 WARN unable to convert relationship from CycloneDX 1.3 JSON, dropping...
json 153 806 153 153 153 with 2 UNKNOWN License for python artifacts
spdx-json 153 806 153 153 0
spdx-tag-value 153 806 153 153 0
text 153 0 0 153 0
table 153 0 0 153 0

Scanning the same image using tern

Type Components cpe purl Versions Licenses Notes
cyclonedxjson 149 0 149 149 149
html 149 0 0 149 149
json 149 0 0 149 149
spdxtagvalue 149 0 0 149 149
spdxjson 149 0 0 149 149
yaml 149 0 0 149 149

Thus the presence or absence of the license is not a format problem as for the common spdx or cyclonedx, tern is able to get this field correctly filled. As in json Syft is able to have all the information, this is probably in the converter that the loss occur (which is reflected I think by the WARN logs).

What you expected to happen:

Expectation is that content is independent of the format (if we except of course table and text) and everything that the format may accept shall be in the output.

How to reproduce it (as minimally and precisely as possible):

docker run \
            --rm \
            -it \
            -v /var/run/docker.sock:/var/run/docker.sock \
            -v $PWD:/tmp/workdir \
            anchore/syft:latest \
            -v \
            packages \
            -s Squashed \
            -o <format> \
            --file /tmp/workdir/bom.format \
            docker:almalinux:latest

Anything else we need to know?:

Environment:

  • Output of syft version: 0.42.4
  • OS (e.g: cat /etc/os-release or similar):
@WhyJee WhyJee added the bug Something isn't working label Mar 31, 2022
@spiffcs spiffcs added this to OSS Apr 27, 2022
@wagoodman wagoodman added the license relating to software licensing label Apr 28, 2022
@mj
Copy link

mj commented Jun 7, 2022

The licenses are most likely missing because their names are not listed in internal/spdxlicense/license_list.go and syft uses that list to validate licenses when converting to formats like CycloneDX JSON. This behaviour seems to make sense because e.g. also Dependency-Track does not process invalid license names like "GPL" (missing the version) or "BSD-3-clause-with-weird-numbering" (honestely WTF?) in SBOM files.

The warnings mentioned in this issues' description do not affect the license processing.

@spiffcs spiffcs moved this to Backlog (Pulled Forward for Priority) in OSS Jul 8, 2022
@spiffcs spiffcs assigned spiffcs and unassigned spiffcs Jul 21, 2022
@spiffcs
Copy link
Contributor

spiffcs commented Jul 21, 2022

Tagging @cpendery

@spiffcs spiffcs moved this from Backlog (Pulled Forward for Priority) to In Progress (Actively Resolving) in OSS Jul 21, 2022
@cpendery
Copy link
Contributor

@WhyJee I'm having trouble replicating your license counts. When trying to recreate your values using the almalinux image tag closest to your post and Syft 0.42.4, I'm only finding 4 licenses. While there is a difference between that and the other formats, its entirely based on the filtering @mj mentioned.

docker run \   
            --rm \
            -it \
            -v /var/run/docker.sock:/var/run/docker.sock \
            -v $PWD:/tmp/workdir \
            anchore/syft:v0.42.4 \
            -v \
            packages \
            -s Squashed \
            -o json \
            --file /tmp/workdir/bom.json \
            docker:almalinux:8.5-20220306

I'm able to replicate this filtering out of licenses in 0.52.0, with licenses like (Apache-2.0 OR MPL-1.1), Proprietary, and BSD being filtered out.

@WhyJee
Copy link
Author

WhyJee commented Aug 3, 2022

Based on your comments @mj / @spiffcs , I have replayed the analysis with latest Syft and latest Almalinux image.

There are 153 license entries in the json output which are identified as :

License Count
BSD 10
BSD and GPLv2 1
BSD and GPLv2+ 3
BSD and LGPLv2+ 1
BSD and LGPLv2 and Sleepycat 2
BSD or GPLv2 1
BSD or GPLv2+ 1
BSD with advertising 1
GPLv2 4
GPLv2+ 14
GPLv2+ and BSD 1
GPLv2 and GPLv2+ and LGPLv2+ and BSD with advertising and Public Domain 1
GPLv2+ and LGPLv2+ 2
GPLv2+ and LGPLv2+ with exceptions 2
GPLv2+ and Public Domain 1
(GPLv2+ or AFL) and GPLv2+ 5
GPLv2+ or LGPLv3+ 4
(GPLv2+ or LGPLv3+) and GPLv3+ 1
GPLv3+ 12
GPLv3+ and GFDL 1
GPLv3+ and GPLv2+ and LGPLv2+ and BSD 1
GPLv3+ and GPLv3+ with exceptions and GPLv2+ with exceptions and LGPLv2+ and BSD 2
GPLv3+ and LGPLv2+ 2
GPLv3+ or BSD 1
LGPL2.1+ (the library), GPL2+ (tests and examples) 1
LGPLv2 2
LGPLv2+ 24
LGPLv2+ and BSD and Public Domain 1
LGPLv2+ and GPLv3+ 3
LGPLv2+ and LGPLv2+ with exceptions and GPLv2+ and GPLv2+ with exceptions and BSD and Inner-Net and ISC and Public Domain and GFDL 3
LGPLv2+ and MIT 1
LGPLv2+ and MIT and GPLv2+ 2
LGPLv3+ and GPLv3+ and GFDL 1
LGPLv3+ or GPLv2+ 2
(LGPLv3+ or GPLv2+) and GPLv3+ 1
MIT 18
MIT and Python and ASL 2.0 and BSD and ISC and LGPLv2 and MPLv2.0 and (ASL 2.0 or BSD) 1
OpenLDAP 1
OpenSSL and ASL 2.0 1
pubkey 1
Public Domain 9
Python 2
SISSL and BSD 1
UNKNOWN 2
Vim and MIT 1
zlib and Boost 1

From this we can split the problem in several categories and eventually solve some.

Multiple licenses

This is the tricky case as the scanner would need a robust split algorithm (see above table).
The you will have the issue below on name matching to solve of course.

Note: a commercial tool our company is also investigating transform:

MIT and Python and ASL 2.0 and BSD and ISC and LGPLv2 and MPLv2.0 and (ASL 2.0 or BSD)

into:

Python Software Foundation License 2.0 AND GNU Library General Public License v2 or later AND MIT License AND ISC License AND Apache License 2.0 AND BSD 3-clause ""New"" or ""Revised"" License AND Mozilla Public License 2.0

This is not a split, but it seems it is parsed pretty correctly.

Single license

Name mismatch

This is the most common issue for single name.

ASL 2.0 not matching one of "apache-2" "apache-2.0" "apache-2.0.0", not leading to license Apache-2.0
GPLv2 not matching one of "gpl-2" "gpl-2.0" "gpl-2.0.0", not leading to license GPL-2.0
GPL2+ or GPLv2+ not matching one of "gpl-2+" "gpl-2.0+" "gpl-2.0.0+", not leading to license GPL-2.0+
GPLv3+ not matching one of "gpl-3+" "gpl-3.0+" "gpl-3.0.0+", not leading to license GPL-3.0+
LGPLv2 not matching one of "lgpl-2" "lgpl-2.0" "lgpl-2.0.0", not leading to license LGPL-2.0
LGPL2.1+ not matching one of "lgpl-2+" "lgpl-2.0+" "lgpl-2.0.0+", not leading to license LGPL-2.0+
LGPLv2+ not matching one of "lgpl-2+" "lgpl-2.0+" "lgpl-2.0.0+", not leading to license LGPL-2.0+
LGPLv3+ not matching one of "lgpl-3+" "lgpl-3.0+" "lgpl-3.0.0+", not leading to license LGPL-3.0+
MPLv2.0 not matching one of "mpl-2" "mpl-2.0" "mpl-2.0.0", not leading to license MPL-2.0

I don't know what the other packager (Debian, ...) are putting as license, but it seems the solution could be to update the license_list.go in order to make the match. I am not sure we can ask to RedHat to rewrite all its rpm to comply to Syft.

Name not recognized

This one occurs only if single license is "MIT" (18 occurrences) or "BSD" (10 occurrences).
It seems but I have not checked that it is an exact match; we may have expected something case insensitive.

Solving these 2 issue would be 1st step.

@flemminglau
Copy link

For purposes of CycloneDX note that the format allows
"license": {"id": "SPDX ref"}
or alternatively if it cannot be matched:
"license": {"name": "any text you want"}

It would be great if at least in CycloneDX case the available information could be returned. A free text License "name" is better than nothing.

@TTMaZa
Copy link

TTMaZa commented Nov 3, 2022

Hi @WhyJee,

thanks for bringing the different categories up!

One could argue, that in some component that is licenses as "MIT AND LGPL-2.1-only" there actually are "sub components" that are licensed differently. So from the perspective of CycloneDX, this should somehow be two components (that are smooshed together) and not one component with License "MIT AND LGPL-2.1-only".

But I'm pretty sure that this is just a rough estimation.

I've seen people release software under "GPL AND MIT" to tell you that you can choose one and sometimes people release Software under "GPL OR MIT" to give you this very choice.

On the other hand SPDX and Fedora seem, to agree that only "OR" should be used for this. And "AND" should be used if different parts of the component have different licenses.

Fedora has a guideline for this: https://docs.fedoraproject.org/en-US/legal/license-field/#_license_expressions
SPDX has something similar: https://spdx.github.io/spdx-spec/v2.3/SPDX-license-expressions/

@flemminglau
Copy link

But while we are waiting for the "real" solution would it not be better to report unknown (unmapped) licenses as "whatever" than not reporting them at all?

@TTMaZa
Copy link

TTMaZa commented Nov 9, 2022

Hi there,

I did some cross check and found that other CyclonDX-Tools seem to struggle with Licenses Expressions such as "(LGPLv3+ or GPLv2+) and GPLv3+" as well:

CycloneDX/cyclonedx-python#377
DependencyTrack/dependency-track#170

The cyclonedx-python people went one step further then just struggeling hier:

CycloneDX/cyclonedx-python-lib#304

CyclonDX seems to have a precise way of doing this by embracing SPDX-License-Expressions:

CycloneDX/specification#1

One more thing: Dependency Track plans to support these SPDX-License-Expressions as stated here: DependencyTrack/dependency-track#170 (comment)

@kzantow kzantow self-assigned this Nov 16, 2022
@dawez
Copy link

dawez commented Jan 26, 2023

Hi there, do we have any progress on this?

@kzantow
Copy link
Contributor

kzantow commented Feb 7, 2023

Hi @dawez -- I think this will be fixed with #1540

gszr added a commit to Kong/public-shared-actions that referenced this issue Apr 18, 2023
gszr added a commit to Kong/public-shared-actions that referenced this issue Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working license relating to software licensing
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

9 participants