Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve: Parse POM from maven repo to extract and save license information in Java DB #21

Open
wants to merge 45 commits into
base: main
Choose a base branch
from

Conversation

namandf
Copy link

@namandf namandf commented Aug 7, 2023

Discussion: aquasecurity/trivy#4236 (reply in thread)
Unit Tests: Working as expected

Screenshot 2023-08-07 at 4 01 42 PM

@CLAassistant
Copy link

CLAassistant commented Aug 7, 2023

CLA assistant check
All committers have signed the CLA.

@namandf namandf changed the title improve: [https://github.com/aquasecurity/trivy/discussions/4236] Par… Improve: Parse POM from maven repo to extract and save license information in Java DB Aug 7, 2023
@ChristianCiach
Copy link

ChristianCiach commented Aug 7, 2023

Sorry for chiming in. I am just an interested outsider who knows next to nothing about the architecture of this project.

I see that you save the license as a TEXT column of the indices table. I just want to add my random thoughts just in case, even though these may have already been addressed:

  • Maven artifacts can have multiple licenses, and you are joining them into a single string with a comma as the separator. I don't think this is the best way to model this from a database perspective.
  • I think it would be better to create a dedicated licenses table and to join it to the indices table via a index_license table to resolve the n:m relation. This would also lead to massive string-deduplication that would be currently impossible due to the comma-joined licenses, but will possibly save a lot of space.

@namandf
Copy link
Author

namandf commented Aug 7, 2023

Hi @ChristianCiach ,
Thats a valid point.

Didn't want to over complicate the change(additional tables to track licenses, map GAV i.e create a unique id in indices table, additional joins in read queries). It would definitely be an optimization , but not sure if the benefits are worth the effort especially because the difference in size isn't a lot and the DB is cached/refreshed every 3 days if I am not wrong.

Even in case of CICD, there might be alternatives to mount the cache directory to avoid db download multiple times.

But if everybody feels it's a mandatory change, we can work on it.

Regarding licenses being a TEXT column; my bad. I had my use case in mind. It could be an array (sqlite might not support it directly) similar to the existing license type in the Package model. @DmitriyLewen let me know if you have any suggestions.

@ChristianCiach
Copy link

ChristianCiach commented Aug 7, 2023

The savings in storage are actually the least of my concerns, but I wanted to mention this because these concerns have been mentioned multiple times in the linked issue.

I have a strong background in database administration, so I always try to normalize the data instead of storing them in a pre-processed way (like the joined strings here). I also think it would be easier to move from a normalized schema to a de-normalized schema later instead of the other way around.

Lastly, storing the licences in an atomic way (meaning un-joined) would also address the concerns about the length of the attribute.

I want to stress again that I know next to nothing about the architecture of Trivy, so all my my points may very well be, well, pointless.

Anyway, thank you so much for adressing this issue! I was actually tasked by my company to develop a trivy-sbom-postprocessor to add all the missing licenses, so you might save me a lot of time :D

Copy link
Collaborator

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left small comments

pkg/crawler/crawler.go Outdated Show resolved Hide resolved
pkg/crawler/crawler.go Show resolved Hide resolved
@DmitriyLewen
Copy link
Collaborator

DmitriyLewen commented Aug 8, 2023

Hello @ChristianCiach
Thanks for your tips!

I think it would be better to create a dedicated licenses table and to join it to the indices table via a index_license table to resolve the n:m relation. This would also lead to massive string-deduplication that would be currently impossible due to the comma-joined licenses, but will possibly save a lot of space.

This make sense. I don't have much experience with sql can you point out the timing nuances:
Getting GAV + licenses from only indices table(as created @namandf ) takes less time than getting GAV from indices and also matching licenses from new table with found GAV ?
Or i am wrong?

I ask because we have 2 important points:

  • database size
  • time to get information from database

}

// TODO: Check if we can limit the length of license string i.e trim and save
return strings.Join(licenses, ","), nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to think about this, because some licenses (e.g. The Apache License, Version 2.0) can contain commas. In this case will be confusion  if we will split by commas (e.g. The Apache License, Version 2.0,The 2-Clause BSD License).

Copy link
Author

@namandf namandf Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the , separator, we can probably use |

Gave license classifier a try.

It expects a license file . Does not result in matches using just the license name in POM in most cases.
We essentially need to extract license URL from POM, fetch the contents of URL, dump it in a file and then use license classifier for the right results.

It will add an additional overhead. We might have to add an in memory cache as well in order to avoid redundant processing.
In cases where URL is missing in POM, we'll have to rely on the license name in POM.

Effectively a 3 step process:

  1. Fetch license URL contents and classify license
  2. If URL is missing or license couldn't be classified in step 1 , then classify license using license name in POM
  3. If step 2 also fails then dump the license name as is (precautionary check to limit the license string length)
    In all cases where license classifier is being used, confidence threshold used could be 80%

Integrating this in the current design of java db might be a problem, since concurrent goroutines for license classification might not work. It instantiates a backend instance , expects a list of license files, spawns go routines to process them and accumulates results in the backend.
Still need to figure out the specifics.

=========

The other perspective could be not over complicate things and claim support for whats highlighted in pom.xml. Since the user controls the pom.xml , it's their responsibility to specify the license information accurately. We will show whatever resides in pom.xml (license.Name) with some precautionary checks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integrating this in the current design of java db might be a problem, since concurrent goroutines for license classification might not work. It instantiates a backend instance , expects a list of license files, spawns go routines to process them and accumulates results in the backend.

This is really problem. We can add new step before crawler: parse all pom.xml files (as with sha1 files) as you said (Effectively a 3 step process:) and save to map/file (url -> license name). And use this map/file in crawler.
But there may be problem with CI/CD (work time, memory, etc...).

The other perspective could be not over complicate things and claim support for whats highlighted in pom.xml. Since the user controls the pom.xml , it's their responsibility to specify the license information accurately. We will show whatever resides in pom.xml (license.Name) with some precautionary checks.

I think it is good solution. At least we can start with this
Maven docs say(https://maven.apache.org/pom.html#Licenses): Using an SPDX identifier as the license name is recommended.
It is not required, but we can refer on it.

Copy link
Author

@namandf namandf Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really problem. We can add new step before crawler: parse all pom.xml files (as with sha1 files) as you said (Effectively a 3 step process:) and save to map/file (url -> license name). And use this map/file in crawler.
But there may be problem with CI/CD (work time, memory, etc...).

Let me give this a try.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR.

  1. Crawl to fetch GAV information + parse POM to extract license keys (hash of url if available , else name)
  2. Save license keys along with GAV information in the indexes cache directory and simultaneously build a map of unique license keys with meta information about license name and url
  3. Use license classifier to find license info associated with unique license keys found in step 2 and save them referenced by license key in a licenses cache directory
  4. As part of build, parse the cached data in indexes directory. Extract license keys from the same and use it to extract license metadata from the licenses cache directory.
  5. Write aggregated data to DB

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DB size now is just 50MB more with license information(800MB) vs 750MB without license information.

| separated list of unique licenses :
licenses.csv

The only problem seems to be time taken for building java db. https://github.com/aquasecurity/trivy-java-db/pull/21/files#diff-526a7142f7b15a72eab215dcfa540f5b6d502951a2f9f3b34dd900013e82c675R426 seems to be a major contributor other than additional POM parsing.

Don't think it should be a problem though since we refresh the database once a week.

Copy link
Collaborator

@DmitriyLewen DmitriyLewen Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good way for us. And changes of place look good. 50mb is not much.

pkg/crawler/crawler.go Outdated Show resolved Hide resolved
@ChristianCiach
Copy link

ChristianCiach commented Aug 9, 2023

This make sense. I don't have much experience with sql can you point out the timing nuances: Getting GAV + licenses from only indices table(as created @namandf ) takes less time than getting GAV from indices and also matching licenses from new table with found GAV ? Or i am wrong?

Sorry for the late reply. I am on vacation and can only skim this PR using my smartphone.

To answer your questions:, I expect the difference in performance to be barely measurable. Every database management system worth its salt will join the tables using highly specialised indexes. I wouldn't even be too surprised to see an improvement in performance because most database management systems handle large TEXT columns badly. I don't know which database management system is used here. I assume something like SQLite? Shouldn't make a difference anyway.

Edit: I will be mostly offline for the next two weeks or so. I am also not a Go developer, but I am highly skilled in Java and Python. Even if this PR gets merged as-is, I am curious enough to compare it to the model I would've preferred, so maybe you will see a follow-up PR by me later.

@namandf namandf marked this pull request as draft August 9, 2023 14:13
@namandf namandf marked this pull request as ready for review August 9, 2023 14:28
Copy link
Collaborator

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it looks better!
I left 1 more idea.

pkg/builder/builder.go Outdated Show resolved Hide resolved
pkg/crawler/crawler.go Outdated Show resolved Hide resolved
pkg/crawler/crawler.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@DmitriyLewen DmitriyLewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some thoughts about optimization.
Can you take a look?

return nil
}

func (c *Crawler) prepareClassifierData() cmap.ConcurrentMap[string, License] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked on build action. This function worked 1:25 hours.
Can we update this logic to up work speed?

What if we will use similar logic as in Crawl function?
We can use semafore, use limit of gorutines from options, etc...
We can save urls in channel and take them in loop.

wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suggesting this because I'm worried about maintaining this code. It will be hard to find a bug if you have to wait 1.5 hours for 1 Crawler launch.

Copy link
Author

@namandf namandf Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Have been trying to figure out why this function takes so much time. Ideally license file text shouldn't be huge. Downloading them and writing contents to files concurrently shouldn't ideally take so much time. Wondering if i am missing some nuance.
My guess is , it's because of the use of concurrent maps (uniqueLicenseKeys and filesLicenseMap) within the function. They are thread safe so blocking other goroutines.

What if we will use similar logic as in Crawl function?
We can use semafore, use limit of gorutines from options, etc...
We can save urls in channel and take them in loop.

Are you suggesting to use a similar framework? Not sure if this is going to make a difference.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can avoid filesLicenseMap. It need not be a concurrent map. There are other ways to achieve it.

Regarding uniqueLicenseKeys, we can extract the contents into a normal map, since we just need a get.

Let me give that a try

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is going to make a difference.

If i understand correctly - you use 10 parallel goroutines to write license files. Can we increase this value to 1000?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Http client timeout error is not being handled. Updated it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be fine now. Takes 31m for crawl

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. It is really good.
Let's wait result size of db.

Copy link
Collaborator

@DmitriyLewen DmitriyLewen Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new sizes:
454.36 MB is size before unpacking (Trivy will download image with this size).
792 MB - is db size in cache dir.

current sizes:
452.25 MB is size before unpacking.
743M - is db size.

CI/CD time changes:
Crawl before - 8-10 min
Crawl after - 30 min
Build actions without changes.

LGTM.
Good job, @namandf !

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great :)

pkg/crawler/crawler.go Outdated Show resolved Hide resolved
@DmitriyLewen
Copy link
Collaborator

@knqyf263 hello!
@namandf created PR. We updated it.
Test run - https://github.com/namandf/trivy-java-db/actions/runs/5829778250/job/15809887486
size changes - #21 (comment)

Can you take a look?
What do you think of this PR?
Can we start working on changes for Trivy, go-dep-parser, etc...

@namandf
Copy link
Author

namandf commented Aug 16, 2023

@knqyf263 hello! @namandf created PR. We updated it. Test run - https://github.com/namandf/trivy-java-db/actions/runs/5829778250/job/15809887486 size changes - #21 (comment)

Can you take a look? What do you think of this PR? Can we start working on changes for Trivy, go-dep-parser, etc...

Hi @DmitriyLewen @knqyf263

Hope you guys are doing great. Wondering if there is any update. Looking forward to getting this change in as soon as possible.

Thank you
Naman

@namandf
Copy link
Author

namandf commented Aug 22, 2023

Hi @DmitriyLewen @knqyf263

Hope you guys are doing great. Wondering if there is any update. Looking forward to getting this change in as soon as possible.

Thank you
Naman

@knqyf263
Copy link
Collaborator

@namandf We have tons of things to do. Please wait a little longer.

@namandf
Copy link
Author

namandf commented Aug 23, 2023

Thank you for the update @knqyf263 .

FYI noticed a related bug in the latest trivy version aquasecurity/trivy#5027

@@ -59,7 +59,7 @@ func (db *DB) Init() error {
if _, err := db.client.Exec("CREATE TABLE artifacts(id INTEGER PRIMARY KEY, group_id TEXT, artifact_id TEXT)"); err != nil {
return xerrors.Errorf("unable to create 'artifacts' table: %w", err)
}
if _, err := db.client.Exec("CREATE TABLE indices(artifact_id INTEGER, version TEXT, sha1 BLOB, archive_type TEXT, foreign key (artifact_id) references artifacts(id))"); err != nil {
if _, err := db.client.Exec("CREATE TABLE indices(artifact_id INTEGER, version TEXT, sha1 BLOB, archive_type TEXT, license TEXT, foreign key (artifact_id) references artifacts(id))"); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we should create a license table and store foreign keys here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have done that but it would require 2 tables. One to track licenses and the other to maintain a mapping of indices logical id to license logical id since it isn't a 1-1 mapping.

Querying the data would involve JOINS which would mean a change in all queries. Moreever the effort involved didn't seem to reap proportionate benefits since the db size with the current change is just 50MB more than normal and queries are simpler.

Intended to keep the changes to existing code as minimal as possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have done that but it would require 2 tables. One to track licenses and the other to maintain a mapping of indices logical id to license logical id since it isn't a 1-1 mapping.

It is RDBMS. I don't think this change would be complex.

Copy link
Author

@namandf namandf Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Since there is no use case of filtering by license strings, I thought normalization can be deferred and queries can be kept simpler where a direct access to the row in the indices table will have all required info.

It does affect the database size since information is replicated, but it didn't seem to have a major impact hence decided to stick to it.

But I agree, in the long run it might blow up and cause issues w.r.t database size. Don't foresee query performance issues since indexes will help efficiently access the artifact metadata. Will let you guys take a call.

// license classifier
classifier *backend.ClassifierBackend

// uniqueLicenseKeys : key is hash of license url or name in POM, whichever available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need hash here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We saw scenarios where individuals have dumped full license files and invalid characters in the license name/url field within pom.xml so a hash ensures the key is consistent and short.

We dump the same hash in the indexes cache file and use it during the build stage to map/update the license information from the license cache files which track the license key hash to classified license string mapping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this approach allows us to avoid using digests.
#21 (comment)

defer close(prepStatus)

// process license keys channel
go func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need concurrency for classifying licenses. We can fetch licenses when crawling URLs in parallel similar to sha1.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side effect of google license classifier usage. If we were to only process content in pom.xml licenses header directly, this won't be needed as you mentioned.

We are deferring the process of license classification so that we don't repeat it while crawling URLs and sha. Unique URLs/name across maven are around 2k. We'll be unncessarily repeating the classification process unless we introduce some thread safe cache which might slow down crawl due to locking.

We are trying to aggregate unique license urls from pom.xml, dump license url content in files concurrently and later on process them as a batch using google license classifier to get standardised license strings

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless we introduce some thread safe cache which might slow down crawl due to locking.

Using cache looks much simpler.

Copy link
Author

@namandf namandf Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup simpler, but slower. Crawling to concurrently process maven indexes will be slow due to blocking lock/unlock operations while reading/updating cache

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you measure it? I don't think it is so slow.


licenseFileName := getLicenseFileName(c.licensedir, licenseKey)
licenseMeta := uniqLicenseKeyMap[licenseKey]
ok, err := c.generateLicenseFile(client, licenseFileName, licenseMeta)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@namandf namandf Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v2 license classifier tool loads assets i.e uses embedded license data https://github.com/google/licenseclassifier/blob/main/v2/tools/identify_license/backend/backend.go#L50 which is used to find appropriate matches for supplied input.

The code you pointed out would require us providing a local directory with all assets/training data if I am not wrong. It is doable but we'll have to maintain a copy locally I believe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code you pointed out would require us providing a local directory with all assets/training data if I am not wrong.

No, we don't have to do that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MPV
Copy link

MPV commented Nov 10, 2023

Just wanted to chime in and say thanks for all your efforts on this. 🙏

@otbe
Copy link

otbe commented Jan 20, 2024

Scared to ask, but any progress on this? :)
Would love to have better license support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants