Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest NPM dependencies using Semgrep API (v0.96.0rc2) #1385

Merged
merged 16 commits into from
Nov 20, 2024
Merged

Conversation

hanzo
Copy link
Contributor

@hanzo hanzo commented Nov 15, 2024

Summary

This PR adds support to ingest dependencies from Semgrep for the NPM ecosystem, as well as introducing a CLI flag allowing users to specify which ecosystems to ingest.

Related issues or links

#1368 added support for ingesting dependencies from Semgrep (only for the gomod ecosystem)

Demo

Before these changes, a project with both Go and NPM dependencies will only have GoLibrary nodes in the dependency graph:

image

After these changes, for the same project the graph contains both GoLibrary and NpmLibrary nodes:
image

Logs from semgrep module before these changes
INFO:cartography.sync:Starting sync stage 'semgrep'
INFO:cartography.intel.semgrep.deployment:Loading Semgrep deployment info {'id': ...} into the graph...
INFO:cartography.intel.semgrep.dependencies:Running Semgrep dependencies sync job.
INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep dependencies for deployment 'X'.
INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep dependencies.
...
INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep dependencies.
INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep dependencies in X pages.
INFO:cartography.intel.semgrep.dependencies:Loading X GoLibrary objects into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep Go Library cleanup job.
INFO:cartography.graph.statement:Completed GoLibrary statement #1
...
INFO:cartography.graph.statement:Completed GoLibrary statement #X
INFO:cartography.graph.job:Finished job GoLibrary
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings sync job.
...
INFO:cartography.sync:Finishing sync stage 'semgrep'
INFO:cartography.sync:Finishing sync with update tag '1730497895'
Logs from semgrep module after these changes
INFO:cartography.intel.semgrep.deployment:Loading SemgrepDeployment {'id': ...} into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep dependencies sync job.
INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep gomod dependencies for deployment 'X'.
INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep gomod dependencies.
INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep gomod dependencies.
INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep gomod dependencies in X pages.
INFO:cartography.intel.semgrep.dependencies:Loading X GoLibrary objects into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep Dependencies cleanup job for GoLibrary.
INFO:cartography.graph.statement:Completed GoLibrary statement #1
INFO:cartography.graph.statement:Completed GoLibrary statement #2
INFO:cartography.graph.statement:Completed GoLibrary statement #3
INFO:cartography.graph.job:Finished job GoLibrary
INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep npm dependencies for deployment 'X'.
INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep npm dependencies.
...
INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep npm dependencies.
INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep npm dependencies in X pages.
INFO:cartography.intel.semgrep.dependencies:Loading X NpmLibrary objects into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep Dependencies cleanup job for NpmLibrary.
INFO:cartography.graph.statement:Completed NpmLibrary statement #1
INFO:cartography.graph.statement:Completed NpmLibrary statement #2
INFO:cartography.graph.statement:Completed NpmLibrary statement #3
INFO:cartography.graph.job:Finished job NpmLibrary
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings sync job.
...
INFO:cartography.sync:Finishing sync stage 'semgrep'
INFO:cartography.sync:Finishing sync with update tag '1731969699'

Checklist

Provide proof that this works (this makes reviews move faster). Please perform one or more of the following:

  • Update/add unit or integration tests.
  • Include a screenshot showing what the graph looked like before and after your changes.
  • Include console log trace showing what happened before and after your changes.

If you are changing a node or relationship:

If you are implementing a new intel module:

cartography/cli.py Outdated Show resolved Hide resolved
@hanzo
Copy link
Contributor Author

hanzo commented Nov 15, 2024

PTAL @heryxpc @achantavy - before I add tests and docs, does this look like a reasonable approach to making the semgrep ecosystem list configurable?

@hanzo hanzo requested review from heryxpc and achantavy November 15, 2024 21:52
cleanup(neo4j_session, common_job_parameters)
for ecosystem in ecosystems:
schema = ECOSYSTEM_TO_SCHEMA[ecosystem]
raw_deps = get_dependencies(semgrep_app_token, deployment_id, ecosystem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably an early performance improvement, but what do you think about calling get_dependencies from all the ecosystems in a single call and then splitting the results via transform_dependencies?

Semgrep API can sometimes be unreliable and reducing the number of open HTTP connections might help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, it just seemed like it would make the code more complex, especially if we ever need to do ecosystem-specific handling for dependencies. But I guess we can cross that bridge when we reach it. I'll take a stab at rearranging the logic so that get_dependencies is only called once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semgrep API can sometimes be unreliable and reducing the number of open HTTP connections might help.

Getting all the dependencies in a single function won't reduce the number of API calls needed though because it'll still do the same pagination across all ecoystems.

especially if we ever need to do ecosystem-specific handling for dependencies

For that reason I'm a big fan of repeating ourselves and avoiding generics early on since we don't know yet if some of the data types will need special treatment.

I wouldn't be against splitting it out to sync_npm_dependencies(), sync_go_dependencies(), .. etc..

The general pattern is to have the sync function call a get+transform+load where the get is a fairly thin wrapper around the upstream API, the transform gets the data to a shape that makes sense for db ingestion, and load does its thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion, organize it in a way that makes sense to you as long as the tests pass and the structure roughly resembles get+transform+load

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first approach was to have separate code blocks for "sync go deps", "sync npm deps" etc, but it felt unnecessarily hardcoded at this point in time. The nice thing about the way the code is structured in this PR is that adding a third ecosystem (and more) will be as simple as adding an entry to ECOSYSTEM_TO_SCHEMA and adding a Semgrep*LibrarySchema class

That is, assuming that special handling isn't needed for those future ecosystems. I guess I'm inclined to leave the code like I have it here until we do need ecosystem-specific handling, then we can refactor as needed? All of this code is pretty malleable, shouldn't be hard to change it in the future

Signed-off-by: Hans Wernetti <[email protected]>

@dataclass(frozen=True)
class SemgrepJavascriptLibrarySchema(CartographyNodeSchema):
label: str = 'JavascriptLibrary'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small suggestion: the name JavascriptLibrary might be a bit too broad, as it could imply any external JavaScript code or library (e.g., from a CDN or other sources). Since this specifically refers to packages managed via npm, renaming it to NpmtLibrary would make it more precise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll use NpmLibrary. Thanks for the feedback!

cartography/cli.py Outdated Show resolved Hide resolved
cartography/intel/semgrep/dependencies.py Outdated Show resolved Hide resolved
cartography/models/semgrep/dependencies.py Outdated Show resolved Hide resolved
cartography/models/semgrep/dependencies.py Outdated Show resolved Hide resolved
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
@hanzo hanzo changed the title [WIP] Ingest NPM dependencies using Semgrep API Ingest NPM dependencies using Semgrep API Nov 19, 2024
@hanzo hanzo marked this pull request as ready for review November 19, 2024 00:20
@hanzo hanzo changed the title Ingest NPM dependencies using Semgrep API Ingest NPM dependencies using Semgrep API (version bump to 0.96.0rc2) Nov 20, 2024
@hanzo hanzo changed the title Ingest NPM dependencies using Semgrep API (version bump to 0.96.0rc2) Ingest NPM dependencies using Semgrep API (v0.96.0rc2) Nov 20, 2024
@achantavy achantavy enabled auto-merge (squash) November 20, 2024 00:08
@achantavy achantavy merged commit f9d0c58 into master Nov 20, 2024
7 checks passed
@achantavy achantavy deleted the npm-semgrep branch November 20, 2024 00:10
chandanchowdhury pushed a commit to chandanchowdhury/cartography that referenced this pull request Nov 27, 2024
…ncf#1385)

### Summary
This PR adds support to ingest dependencies from Semgrep for the NPM
ecosystem, as well as introducing a CLI flag allowing users to specify
which ecosystems to ingest.

### Related issues or links
cartography-cncf#1368 added support
for ingesting dependencies from Semgrep (only for the `gomod` ecosystem)

### Demo

Before these changes, a project with both Go and NPM dependencies will
only have GoLibrary nodes in the dependency graph:

<img width="1036" alt="image"
src="https://github.com/user-attachments/assets/31d97626-be70-4c80-9a5b-71c26056a53b">

After these changes, for the same project the graph contains both
GoLibrary and NpmLibrary nodes:
<img width="1039" alt="image"
src="https://github.com/user-attachments/assets/d09cc265-ccd6-463e-bd01-2b3e7c6d1778">

<details>
<summary>Logs from semgrep module before these changes</summary>

```
INFO:cartography.sync:Starting sync stage 'semgrep'
INFO:cartography.intel.semgrep.deployment:Loading Semgrep deployment info {'id': ...} into the graph...
INFO:cartography.intel.semgrep.dependencies:Running Semgrep dependencies sync job.
INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep dependencies for deployment 'X'.
INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep dependencies.
...
INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep dependencies.
INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep dependencies in X pages.
INFO:cartography.intel.semgrep.dependencies:Loading X GoLibrary objects into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep Go Library cleanup job.
INFO:cartography.graph.statement:Completed GoLibrary statement #1
...
INFO:cartography.graph.statement:Completed GoLibrary statement #X
INFO:cartography.graph.job:Finished job GoLibrary
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings sync job.
...
INFO:cartography.sync:Finishing sync stage 'semgrep'
INFO:cartography.sync:Finishing sync with update tag '1730497895'
```
</details>

<details>
<summary>Logs from semgrep module after these changes</summary>

```
INFO:cartography.intel.semgrep.deployment:Loading SemgrepDeployment {'id': ...} into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep dependencies sync job.
INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep gomod dependencies for deployment 'X'.
INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep gomod dependencies.
INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep gomod dependencies.
INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep gomod dependencies in X pages.
INFO:cartography.intel.semgrep.dependencies:Loading X GoLibrary objects into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep Dependencies cleanup job for GoLibrary.
INFO:cartography.graph.statement:Completed GoLibrary statement #1
INFO:cartography.graph.statement:Completed GoLibrary statement cartography-cncf#2
INFO:cartography.graph.statement:Completed GoLibrary statement cartography-cncf#3
INFO:cartography.graph.job:Finished job GoLibrary
INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep npm dependencies for deployment 'X'.
INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep npm dependencies.
...
INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep npm dependencies.
INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep npm dependencies in X pages.
INFO:cartography.intel.semgrep.dependencies:Loading X NpmLibrary objects into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep Dependencies cleanup job for NpmLibrary.
INFO:cartography.graph.statement:Completed NpmLibrary statement #1
INFO:cartography.graph.statement:Completed NpmLibrary statement cartography-cncf#2
INFO:cartography.graph.statement:Completed NpmLibrary statement cartography-cncf#3
INFO:cartography.graph.job:Finished job NpmLibrary
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings sync job.
...
INFO:cartography.sync:Finishing sync stage 'semgrep'
INFO:cartography.sync:Finishing sync with update tag '1731969699'
```
</details>

### Checklist

Provide proof that this works (this makes reviews move faster). Please
perform one or more of the following:
- [x] Update/add unit or integration tests.
- [x] Include a screenshot showing what the graph looked like before and
after your changes.
- [x] Include console log trace showing what happened before and after
your changes.

If you are changing a node or relationship:
- [x] Update the
[schema](https://github.com/lyft/cartography/tree/master/docs/root/modules)
and
[readme](https://github.com/lyft/cartography/blob/master/docs/schema/README.md).

If you are implementing a new intel module:
- [x] Use the NodeSchema [data
model](https://cartography-cncf.github.io/cartography/dev/writing-intel-modules.html#defining-a-node).

---------

Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: chandanchowdhury <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants