Improve: Parse POM from maven repo to extract and save license information in Java DB #21

namandf · 2023-08-07T10:26:29Z

Discussion: aquasecurity/trivy#4236 (reply in thread)
Unit Tests: Working as expected

…ct and save license information in Java DB

CLAassistant · 2023-08-07T10:26:34Z

All committers have signed the CLA.

ChristianCiach · 2023-08-07T12:42:21Z

Sorry for chiming in. I am just an interested outsider who knows next to nothing about the architecture of this project.

I see that you save the license as a TEXT column of the indices table. I just want to add my random thoughts just in case, even though these may have already been addressed:

Maven artifacts can have multiple licenses, and you are joining them into a single string with a comma as the separator. I don't think this is the best way to model this from a database perspective.
I think it would be better to create a dedicated licenses table and to join it to the indices table via a index_license table to resolve the n:m relation. This would also lead to massive string-deduplication that would be currently impossible due to the comma-joined licenses, but will possibly save a lot of space.

namandf · 2023-08-07T12:54:13Z

Hi @ChristianCiach ,
Thats a valid point.

Didn't want to over complicate the change(additional tables to track licenses, map GAV i.e create a unique id in indices table, additional joins in read queries). It would definitely be an optimization , but not sure if the benefits are worth the effort especially because the difference in size isn't a lot and the DB is cached/refreshed every 3 days if I am not wrong.

Even in case of CICD, there might be alternatives to mount the cache directory to avoid db download multiple times.

But if everybody feels it's a mandatory change, we can work on it.

Regarding licenses being a TEXT column; my bad. I had my use case in mind. It could be an array (sqlite might not support it directly) similar to the existing license type in the Package model. @DmitriyLewen let me know if you have any suggestions.

ChristianCiach · 2023-08-07T13:02:43Z

The savings in storage are actually the least of my concerns, but I wanted to mention this because these concerns have been mentioned multiple times in the linked issue.

I have a strong background in database administration, so I always try to normalize the data instead of storing them in a pre-processed way (like the joined strings here). I also think it would be easier to move from a normalized schema to a de-normalized schema later instead of the other way around.

Lastly, storing the licences in an atomic way (meaning un-joined) would also address the concerns about the length of the attribute.

I want to stress again that I know next to nothing about the architecture of Trivy, so all my my points may very well be, well, pointless.

Anyway, thank you so much for adressing this issue! I was actually tasked by my company to develop a trivy-sbom-postprocessor to add all the missing licenses, so you might save me a lot of time :D

DmitriyLewen

left small comments

pkg/crawler/crawler.go

DmitriyLewen · 2023-08-08T04:54:49Z

Hello @ChristianCiach
Thanks for your tips!

I think it would be better to create a dedicated licenses table and to join it to the indices table via a index_license table to resolve the n:m relation. This would also lead to massive string-deduplication that would be currently impossible due to the comma-joined licenses, but will possibly save a lot of space.

This make sense. I don't have much experience with sql can you point out the timing nuances:
Getting GAV + licenses from only indices table(as created @namandf ) takes less time than getting GAV from indices and also matching licenses from new table with found GAV ?
Or i am wrong?

I ask because we have 2 important points:

database size
time to get information from database

DmitriyLewen · 2023-08-08T06:52:55Z

pkg/crawler/crawler.go

+	}
+
+	// TODO: Check if we can limit the length of license string i.e trim and save
+	return strings.Join(licenses, ","), nil


Looks like we need to think about this, because some licenses (e.g. The Apache License, Version 2.0) can contain commas. In this case will be confusion if we will split by commas (e.g. The Apache License, Version 2.0,The 2-Clause BSD License).

Regarding the , separator, we can probably use |

Gave license classifier a try.

It expects a license file . Does not result in matches using just the license name in POM in most cases.
We essentially need to extract license URL from POM, fetch the contents of URL, dump it in a file and then use license classifier for the right results.

It will add an additional overhead. We might have to add an in memory cache as well in order to avoid redundant processing.
In cases where URL is missing in POM, we'll have to rely on the license name in POM.

Effectively a 3 step process:

Fetch license URL contents and classify license

If URL is missing or license couldn't be classified in step 1 , then classify license using license name in POM

If step 2 also fails then dump the license name as is (precautionary check to limit the license string length)
In all cases where license classifier is being used, confidence threshold used could be 80%

Integrating this in the current design of java db might be a problem, since concurrent goroutines for license classification might not work. It instantiates a backend instance , expects a list of license files, spawns go routines to process them and accumulates results in the backend.
Still need to figure out the specifics.

=========

The other perspective could be not over complicate things and claim support for whats highlighted in pom.xml. Since the user controls the pom.xml , it's their responsibility to specify the license information accurately. We will show whatever resides in pom.xml (license.Name) with some precautionary checks.

Integrating this in the current design of java db might be a problem, since concurrent goroutines for license classification might not work. It instantiates a backend instance , expects a list of license files, spawns go routines to process them and accumulates results in the backend.

This is really problem. We can add new step before crawler: parse all pom.xml files (as with sha1 files) as you said (Effectively a 3 step process:) and save to map/file (url -> license name). And use this map/file in crawler.
But there may be problem with CI/CD (work time, memory, etc...).

The other perspective could be not over complicate things and claim support for whats highlighted in pom.xml. Since the user controls the pom.xml , it's their responsibility to specify the license information accurately. We will show whatever resides in pom.xml (license.Name) with some precautionary checks.

I think it is good solution. At least we can start with this
Maven docs say(https://maven.apache.org/pom.html#Licenses): Using an SPDX identifier as the license name is recommended.
It is not required, but we can refer on it.

This is really problem. We can add new step before crawler: parse all pom.xml files (as with sha1 files) as you said (Effectively a 3 step process:) and save to map/file (url -> license name). And use this map/file in crawler.
But there may be problem with CI/CD (work time, memory, etc...).

Let me give this a try.

Updated the PR.

Crawl to fetch GAV information + parse POM to extract license keys (hash of url if available , else name)

Save license keys along with GAV information in the indexes cache directory and simultaneously build a map of unique license keys with meta information about license name and url

Use license classifier to find license info associated with unique license keys found in step 2 and save them referenced by license key in a licenses cache directory

As part of build, parse the cached data in indexes directory. Extract license keys from the same and use it to extract license metadata from the licenses cache directory.

Write aggregated data to DB

The DB size now is just 50MB more with license information(800MB) vs 750MB without license information.

| separated list of unique licenses :
licenses.csv

The only problem seems to be time taken for building java db. https://github.com/aquasecurity/trivy-java-db/pull/21/files#diff-526a7142f7b15a72eab215dcfa540f5b6d502951a2f9f3b34dd900013e82c675R426 seems to be a major contributor other than additional POM parsing.

Don't think it should be a problem though since we refresh the database once a week.

I think this is good way for us. And changes of place look good. 50mb is not much.

pkg/crawler/crawler.go

ChristianCiach · 2023-08-09T11:41:31Z

This make sense. I don't have much experience with sql can you point out the timing nuances: Getting GAV + licenses from only indices table(as created @namandf ) takes less time than getting GAV from indices and also matching licenses from new table with found GAV ? Or i am wrong?

Sorry for the late reply. I am on vacation and can only skim this PR using my smartphone.

To answer your questions:, I expect the difference in performance to be barely measurable. Every database management system worth its salt will join the tables using highly specialised indexes. I wouldn't even be too surprised to see an improvement in performance because most database management systems handle large TEXT columns badly. I don't know which database management system is used here. I assume something like SQLite? Shouldn't make a difference anyway.

Edit: I will be mostly offline for the next two weeks or so. I am also not a Go developer, but I am highly skilled in Java and Python. Even if this PR gets merged as-is, I am curious enough to compare it to the model I would've preferred, so maybe you will see a follow-up PR by me later.

DmitriyLewen

Now it looks better!
I left 1 more idea.

pkg/builder/builder.go

pkg/crawler/crawler.go

DmitriyLewen

I left some thoughts about optimization.
Can you take a look?

DmitriyLewen · 2023-08-11T03:27:08Z

pkg/crawler/crawler.go

+	return nil
+}
+
+func (c *Crawler) prepareClassifierData() cmap.ConcurrentMap[string, License] {


I looked on build action. This function worked 1:25 hours.
Can we update this logic to up work speed?

What if we will use similar logic as in Crawl function?
We can use semafore, use limit of gorutines from options, etc...
We can save urls in channel and take them in loop.

wdyt?

I'm suggesting this because I'm worried about maintaining this code. It will be hard to find a bug if you have to wait 1.5 hours for 1 Crawler launch.

Yup. Have been trying to figure out why this function takes so much time. Ideally license file text shouldn't be huge. Downloading them and writing contents to files concurrently shouldn't ideally take so much time. Wondering if i am missing some nuance.
My guess is , it's because of the use of concurrent maps (uniqueLicenseKeys and filesLicenseMap) within the function. They are thread safe so blocking other goroutines.

What if we will use similar logic as in Crawl function?
We can use semafore, use limit of gorutines from options, etc...
We can save urls in channel and take them in loop.

Are you suggesting to use a similar framework? Not sure if this is going to make a difference.

I can avoid filesLicenseMap. It need not be a concurrent map. There are other ways to achieve it.

Regarding uniqueLicenseKeys, we can extract the contents into a normal map, since we just need a get.

Let me give that a try

Not sure if this is going to make a difference.

If i understand correctly - you use 10 parallel goroutines to write license files. Can we increase this value to 1000?

Yup. Http client timeout error is not being handled. Updated it

Seems to be fine now. Takes 31m for crawl

yep. It is really good.
Let's wait result size of db.

new sizes:
454.36 MB is size before unpacking (Trivy will download image with this size).
792 MB - is db size in cache dir.

current sizes:
452.25 MB is size before unpacking.
743M - is db size.

CI/CD time changes:
Crawl before - 8-10 min
Crawl after - 30 min
Build actions without changes.

LGTM.
Good job, @namandf !

pkg/crawler/crawler.go

DmitriyLewen · 2023-08-11T08:04:01Z

@knqyf263 hello!
@namandf created PR. We updated it.
Test run - https://github.com/namandf/trivy-java-db/actions/runs/5829778250/job/15809887486
size changes - #21 (comment)

Can you take a look?
What do you think of this PR?
Can we start working on changes for Trivy, go-dep-parser, etc...

pkg/builder/builder.go

namandf · 2023-08-16T07:43:21Z

@knqyf263 hello! @namandf created PR. We updated it. Test run - https://github.com/namandf/trivy-java-db/actions/runs/5829778250/job/15809887486 size changes - #21 (comment)

Can you take a look? What do you think of this PR? Can we start working on changes for Trivy, go-dep-parser, etc...

Hi @DmitriyLewen @knqyf263

Hope you guys are doing great. Wondering if there is any update. Looking forward to getting this change in as soon as possible.

Thank you
Naman

namandf · 2023-08-22T05:54:03Z

Hi @DmitriyLewen @knqyf263

Hope you guys are doing great. Wondering if there is any update. Looking forward to getting this change in as soon as possible.

Thank you
Naman

knqyf263 · 2023-08-22T07:51:03Z

@namandf We have tons of things to do. Please wait a little longer.

namandf · 2023-08-23T10:59:14Z

Thank you for the update @knqyf263 .

FYI noticed a related bug in the latest trivy version aquasecurity/trivy#5027

knqyf263 · 2023-09-18T14:11:39Z

pkg/db/db.go

@@ -59,7 +59,7 @@ func (db *DB) Init() error {
 	if _, err := db.client.Exec("CREATE TABLE artifacts(id INTEGER PRIMARY KEY, group_id TEXT, artifact_id TEXT)"); err != nil {
 		return xerrors.Errorf("unable to create 'artifacts' table: %w", err)
 	}
-	if _, err := db.client.Exec("CREATE TABLE indices(artifact_id INTEGER, version TEXT, sha1 BLOB, archive_type TEXT, foreign key (artifact_id) references artifacts(id))"); err != nil {
+	if _, err := db.client.Exec("CREATE TABLE indices(artifact_id INTEGER, version TEXT, sha1 BLOB, archive_type TEXT, license TEXT, foreign key (artifact_id) references artifacts(id))"); err != nil {


Looks like we should create a license table and store foreign keys here.

We could have done that but it would require 2 tables. One to track licenses and the other to maintain a mapping of indices logical id to license logical id since it isn't a 1-1 mapping.

Querying the data would involve JOINS which would mean a change in all queries. Moreever the effort involved didn't seem to reap proportionate benefits since the db size with the current change is just 50MB more than normal and queries are simpler.

Intended to keep the changes to existing code as minimal as possible.

We could have done that but it would require 2 tables. One to track licenses and the other to maintain a mapping of indices logical id to license logical id since it isn't a 1-1 mapping.

It is RDBMS. I don't think this change would be complex.

I agree. Since there is no use case of filtering by license strings, I thought normalization can be deferred and queries can be kept simpler where a direct access to the row in the indices table will have all required info.

It does affect the database size since information is replicated, but it didn't seem to have a major impact hence decided to stick to it.

But I agree, in the long run it might blow up and cause issues w.r.t database size. Don't foresee query performance issues since indexes will help efficiently access the artifact metadata. Will let you guys take a call.

knqyf263 · 2023-09-18T14:12:28Z

pkg/crawler/crawler.go

+	// license classifier
+	classifier *backend.ClassifierBackend
+
+	// uniqueLicenseKeys : key is hash of license url or name in POM, whichever available


Why do we need hash here?

We saw scenarios where individuals have dumped full license files and invalid characters in the license name/url field within pom.xml so a hash ensures the key is consistent and short.

We dump the same hash in the indexes cache file and use it during the build stage to map/update the license information from the license cache files which track the license key hash to classified license string mapping.

I believe this approach allows us to avoid using digests.
#21 (comment)

knqyf263 · 2023-09-18T14:13:42Z

pkg/crawler/crawler.go

+	defer close(prepStatus)
+
+	// process license keys channel
+	go func() {


I don't think we need concurrency for classifying licenses. We can fetch licenses when crawling URLs in parallel similar to sha1.

This is a side effect of google license classifier usage. If we were to only process content in pom.xml licenses header directly, this won't be needed as you mentioned.

We are deferring the process of license classification so that we don't repeat it while crawling URLs and sha. Unique URLs/name across maven are around 2k. We'll be unncessarily repeating the classification process unless we introduce some thread safe cache which might slow down crawl due to locking.

We are trying to aggregate unique license urls from pom.xml, dump license url content in files concurrently and later on process them as a batch using google license classifier to get standardised license strings

unless we introduce some thread safe cache which might slow down crawl due to locking.

Using cache looks much simpler.

Yup simpler, but slower. Crawling to concurrently process maven indexes will be slow due to blocking lock/unlock operations while reading/updating cache

Did you measure it? I don't think it is so slow.

knqyf263 · 2023-09-18T14:17:22Z

pkg/crawler/crawler.go

+
+				licenseFileName := getLicenseFileName(c.licensedir, licenseKey)
+				licenseMeta := uniqLicenseKeyMap[licenseKey]
+				ok, err := c.generateLicenseFile(client, licenseFileName, licenseMeta)


Why do we need to generate license files? It accepts []byte or io.Reader.
https://github.com/google/licenseclassifier/blob/c1ed8fcf4babdf7c37d872cf6da5b1c32907d34c/v2/classifier.go#L321

The v2 license classifier tool loads assets i.e uses embedded license data https://github.com/google/licenseclassifier/blob/main/v2/tools/identify_license/backend/backend.go#L50 which is used to find appropriate matches for supplied input.

The code you pointed out would require us providing a local directory with all assets/training data if I am not wrong. It is doable but we'll have to maintain a copy locally I believe.

The code you pointed out would require us providing a local directory with all assets/training data if I am not wrong.

No, we don't have to do that.

Didn't get you.

https://github.com/google/licenseclassifier/blob/main/v2/classifier.go#L278
https://github.com/google/licenseclassifier/blob/main/v2/classifier_test.go#L44

Am I missing something?

MPV · 2023-11-10T08:29:34Z

Just wanted to chime in and say thanks for all your efforts on this. 🙏

otbe · 2024-01-20T15:18:18Z

Scared to ask, but any progress on this? :)
Would love to have better license support.

improve: [aquasecurity/trivy#4236] Parse POM from maven repo to extra…

c3c9e2c

…ct and save license information in Java DB

namandf changed the title ~~improve: [https://github.com/aquasecurity/trivy/discussions/4236] Par…~~ Improve: Parse POM from maven repo to extract and save license information in Java DB Aug 7, 2023

knqyf263 requested a review from DmitriyLewen August 7, 2023 10:36

Add notes to trim license string beyond a certain limit

e5e60fa

DmitriyLewen reviewed Aug 8, 2023

View reviewed changes

pkg/crawler/crawler.go Outdated Show resolved Hide resolved

pkg/crawler/crawler.go Show resolved Hide resolved

DmitriyLewen reviewed Aug 8, 2023

View reviewed changes

namandf added 3 commits August 8, 2023 14:03

Remove dependency on an external module for POM parsing.

eb56cc3

Update separator for license strings to | instead of ,

ecab1d4

go mod tidy

8ff0829

DmitriyLewen reviewed Aug 9, 2023

View reviewed changes

pkg/crawler/crawler.go Outdated Show resolved Hide resolved

namandf added 2 commits August 9, 2023 10:59

Deduplicate licenses fetched from POM

6ea37a5

Use license classifier for information obtained from POM

239ddc8

namandf marked this pull request as draft August 9, 2023 14:13

Cache license information and use it on the fly while writing data to db

4c23e78

namandf marked this pull request as ready for review August 9, 2023 14:28

namandf added 9 commits August 9, 2023 20:07

Cache directory creation

302e619

Cache directory creation

153a133

Cache directory creation

4dbd769

Add test cases for license classifier

9bc1dbb

sort license keys before they are written to indexes cache directory

28ef2b3

update comment

ff74564

update comment

9a557a4

precautionary check to limit licnense strings to 30 characters

e214d84

Update timeout for license classification

21885b1

namandf added 3 commits August 10, 2023 13:07

Get rid of processFilesMap

d483d27

Add constant for normalized license json file name

7be6cf4

Update test cases

2f4c07c

DmitriyLewen reviewed Aug 10, 2023

View reviewed changes

pkg/builder/builder.go Outdated Show resolved Hide resolved

pkg/crawler/crawler.go Outdated Show resolved Hide resolved

pkg/crawler/crawler.go Outdated Show resolved Hide resolved

namandf added 8 commits August 10, 2023 13:49

clean up

1640f0a

clean up

512dba5

clean up

c4108cd

Add a utils file in crawler package

7cc48c1

Update comment

33f2473

Set classification confidence

b001b9f

Set classification confidence

029dcfe

Add check for license string length

176caa8

DmitriyLewen reviewed Aug 11, 2023

View reviewed changes

namandf added 4 commits August 11, 2023 11:47

Speed up license file generation

87c7b41

Add comment

a0769b5

reuse http client

d7a9bdf

skip file copy error

3f3344b

DmitriyLewen reviewed Aug 11, 2023

View reviewed changes

pkg/builder/builder.go Show resolved Hide resolved

Handle empty normalized license strings

e991f76

DmitriyLewen mentioned this pull request Aug 15, 2023

feat(java): add license support for jar files aquasecurity/trivy#4734

Open

namandf added 2 commits August 16, 2023 13:23

Add temporary log to check license strings being trimmed

674534e

Increase license string length threshold to 150 characters

8a924f3

knqyf263 reviewed Sep 18, 2023

View reviewed changes

Improve: Parse POM from maven repo to extract and save license information in Java DB #21

Are you sure you want to change the base?

Improve: Parse POM from maven repo to extract and save license information in Java DB #21

Conversation

namandf commented Aug 7, 2023 • edited Loading

CLAassistant commented Aug 7, 2023 • edited Loading

ChristianCiach commented Aug 7, 2023 • edited Loading

namandf commented Aug 7, 2023 • edited Loading

ChristianCiach commented Aug 7, 2023 • edited Loading

DmitriyLewen left a comment

Choose a reason for hiding this comment

DmitriyLewen commented Aug 8, 2023 • edited Loading

Choose a reason for hiding this comment

namandf Aug 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

namandf Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriyLewen Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

ChristianCiach commented Aug 9, 2023 • edited Loading

DmitriyLewen left a comment

Choose a reason for hiding this comment

DmitriyLewen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

namandf Aug 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriyLewen Aug 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriyLewen commented Aug 11, 2023

namandf commented Aug 16, 2023

namandf commented Aug 22, 2023

knqyf263 commented Aug 22, 2023

namandf commented Aug 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

namandf Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

namandf Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

namandf Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MPV commented Nov 10, 2023

otbe commented Jan 20, 2024

namandf commented Aug 7, 2023 •

edited

Loading

CLAassistant commented Aug 7, 2023 •

edited

Loading

ChristianCiach commented Aug 7, 2023 •

edited

Loading

namandf commented Aug 7, 2023 •

edited

Loading

ChristianCiach commented Aug 7, 2023 •

edited

Loading

DmitriyLewen commented Aug 8, 2023 •

edited

Loading

namandf Aug 8, 2023 •

edited

Loading

namandf Aug 9, 2023 •

edited

Loading

DmitriyLewen Aug 10, 2023 •

edited

Loading

ChristianCiach commented Aug 9, 2023 •

edited

Loading

namandf Aug 11, 2023 •

edited

Loading

DmitriyLewen Aug 11, 2023 •

edited

Loading

namandf commented Aug 23, 2023 •

edited

Loading

namandf Sep 19, 2023 •

edited

Loading

namandf Sep 19, 2023 •

edited

Loading

namandf Sep 18, 2023 •

edited

Loading