-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest NPM dependencies using Semgrep API (v0.96.0rc2) #1385
Conversation
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
PTAL @heryxpc @achantavy - before I add tests and docs, does this look like a reasonable approach to making the semgrep ecosystem list configurable? |
cleanup(neo4j_session, common_job_parameters) | ||
for ecosystem in ecosystems: | ||
schema = ECOSYSTEM_TO_SCHEMA[ecosystem] | ||
raw_deps = get_dependencies(semgrep_app_token, deployment_id, ecosystem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably an early performance improvement, but what do you think about calling get_dependencies
from all the ecosystems in a single call and then splitting the results via transform_dependencies
?
Semgrep API can sometimes be unreliable and reducing the number of open HTTP connections might help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that, it just seemed like it would make the code more complex, especially if we ever need to do ecosystem-specific handling for dependencies. But I guess we can cross that bridge when we reach it. I'll take a stab at rearranging the logic so that get_dependencies is only called once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semgrep API can sometimes be unreliable and reducing the number of open HTTP connections might help.
Getting all the dependencies in a single function won't reduce the number of API calls needed though because it'll still do the same pagination across all ecoystems.
especially if we ever need to do ecosystem-specific handling for dependencies
For that reason I'm a big fan of repeating ourselves and avoiding generics early on since we don't know yet if some of the data types will need special treatment.
I wouldn't be against splitting it out to sync_npm_dependencies(), sync_go_dependencies(), .. etc..
The general pattern is to have the sync function call a get+transform+load where the get is a fairly thin wrapper around the upstream API, the transform gets the data to a shape that makes sense for db ingestion, and load does its thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion, organize it in a way that makes sense to you as long as the tests pass and the structure roughly resembles get+transform+load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first approach was to have separate code blocks for "sync go deps", "sync npm deps" etc, but it felt unnecessarily hardcoded at this point in time. The nice thing about the way the code is structured in this PR is that adding a third ecosystem (and more) will be as simple as adding an entry to ECOSYSTEM_TO_SCHEMA
and adding a Semgrep*LibrarySchema
class
That is, assuming that special handling isn't needed for those future ecosystems. I guess I'm inclined to leave the code like I have it here until we do need ecosystem-specific handling, then we can refactor as needed? All of this code is pretty malleable, shouldn't be hard to change it in the future
Signed-off-by: Hans Wernetti <[email protected]>
|
||
@dataclass(frozen=True) | ||
class SemgrepJavascriptLibrarySchema(CartographyNodeSchema): | ||
label: str = 'JavascriptLibrary' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small suggestion: the name JavascriptLibrary
might be a bit too broad, as it could imply any external JavaScript code or library (e.g., from a CDN or other sources). Since this specifically refers to packages managed via npm, renaming it to NpmtLibrary
would make it more precise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll use NpmLibrary
. Thanks for the feedback!
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
…ncf#1385) ### Summary This PR adds support to ingest dependencies from Semgrep for the NPM ecosystem, as well as introducing a CLI flag allowing users to specify which ecosystems to ingest. ### Related issues or links cartography-cncf#1368 added support for ingesting dependencies from Semgrep (only for the `gomod` ecosystem) ### Demo Before these changes, a project with both Go and NPM dependencies will only have GoLibrary nodes in the dependency graph: <img width="1036" alt="image" src="https://github.com/user-attachments/assets/31d97626-be70-4c80-9a5b-71c26056a53b"> After these changes, for the same project the graph contains both GoLibrary and NpmLibrary nodes: <img width="1039" alt="image" src="https://github.com/user-attachments/assets/d09cc265-ccd6-463e-bd01-2b3e7c6d1778"> <details> <summary>Logs from semgrep module before these changes</summary> ``` INFO:cartography.sync:Starting sync stage 'semgrep' INFO:cartography.intel.semgrep.deployment:Loading Semgrep deployment info {'id': ...} into the graph... INFO:cartography.intel.semgrep.dependencies:Running Semgrep dependencies sync job. INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep dependencies for deployment 'X'. INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep dependencies. ... INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep dependencies. INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep dependencies in X pages. INFO:cartography.intel.semgrep.dependencies:Loading X GoLibrary objects into the graph. INFO:cartography.intel.semgrep.dependencies:Running Semgrep Go Library cleanup job. INFO:cartography.graph.statement:Completed GoLibrary statement #1 ... INFO:cartography.graph.statement:Completed GoLibrary statement #X INFO:cartography.graph.job:Finished job GoLibrary INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings sync job. ... INFO:cartography.sync:Finishing sync stage 'semgrep' INFO:cartography.sync:Finishing sync with update tag '1730497895' ``` </details> <details> <summary>Logs from semgrep module after these changes</summary> ``` INFO:cartography.intel.semgrep.deployment:Loading SemgrepDeployment {'id': ...} into the graph. INFO:cartography.intel.semgrep.dependencies:Running Semgrep dependencies sync job. INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep gomod dependencies for deployment 'X'. INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep gomod dependencies. INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep gomod dependencies. INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep gomod dependencies in X pages. INFO:cartography.intel.semgrep.dependencies:Loading X GoLibrary objects into the graph. INFO:cartography.intel.semgrep.dependencies:Running Semgrep Dependencies cleanup job for GoLibrary. INFO:cartography.graph.statement:Completed GoLibrary statement #1 INFO:cartography.graph.statement:Completed GoLibrary statement cartography-cncf#2 INFO:cartography.graph.statement:Completed GoLibrary statement cartography-cncf#3 INFO:cartography.graph.job:Finished job GoLibrary INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep npm dependencies for deployment 'X'. INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep npm dependencies. ... INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep npm dependencies. INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep npm dependencies in X pages. INFO:cartography.intel.semgrep.dependencies:Loading X NpmLibrary objects into the graph. INFO:cartography.intel.semgrep.dependencies:Running Semgrep Dependencies cleanup job for NpmLibrary. INFO:cartography.graph.statement:Completed NpmLibrary statement #1 INFO:cartography.graph.statement:Completed NpmLibrary statement cartography-cncf#2 INFO:cartography.graph.statement:Completed NpmLibrary statement cartography-cncf#3 INFO:cartography.graph.job:Finished job NpmLibrary INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings sync job. ... INFO:cartography.sync:Finishing sync stage 'semgrep' INFO:cartography.sync:Finishing sync with update tag '1731969699' ``` </details> ### Checklist Provide proof that this works (this makes reviews move faster). Please perform one or more of the following: - [x] Update/add unit or integration tests. - [x] Include a screenshot showing what the graph looked like before and after your changes. - [x] Include console log trace showing what happened before and after your changes. If you are changing a node or relationship: - [x] Update the [schema](https://github.com/lyft/cartography/tree/master/docs/root/modules) and [readme](https://github.com/lyft/cartography/blob/master/docs/schema/README.md). If you are implementing a new intel module: - [x] Use the NodeSchema [data model](https://cartography-cncf.github.io/cartography/dev/writing-intel-modules.html#defining-a-node). --------- Signed-off-by: Hans Wernetti <[email protected]> Signed-off-by: chandanchowdhury <[email protected]>
Summary
This PR adds support to ingest dependencies from Semgrep for the NPM ecosystem, as well as introducing a CLI flag allowing users to specify which ecosystems to ingest.
Related issues or links
#1368 added support for ingesting dependencies from Semgrep (only for the
gomod
ecosystem)Demo
Before these changes, a project with both Go and NPM dependencies will only have GoLibrary nodes in the dependency graph:
After these changes, for the same project the graph contains both GoLibrary and NpmLibrary nodes:
Logs from semgrep module before these changes
Logs from semgrep module after these changes
Checklist
Provide proof that this works (this makes reviews move faster). Please perform one or more of the following:
If you are changing a node or relationship:
If you are implementing a new intel module: