Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest Go dependencies using Semgrep API #1368

Merged
merged 30 commits into from
Nov 5, 2024
Merged

Conversation

hanzo
Copy link
Contributor

@hanzo hanzo commented Oct 15, 2024

Summary

Motivation: The library dependency graph is already populated with PythonLibrary nodes via Cartography's Github module, but no other languages are supported. This PR adds GoLibrary nodes to the library dependency graph to bring support for Go up to parity with Python. I concur with the recommendation from @heryxpc that rather than writing code to manually parse go.mod files, we should instead use the dependency data returned by the Semgrep API.

Cartography's Semgrep module is already able to import supply chain vulnerability data from the Findings endpoint of the Semgrep API. Semgrep also provides a List Dependencies endpoint that returns a list of every known dependency for a given ecosystem (e.g. specifying the “gomod” ecosystem returns all dependencies found in go.mod files). The response contains useful information including the transitivity of the dependency and a link to where it’s defined in source code.

The dependency nodes imported from the Semgrep API will be labelled GoLibrary::SemgrepDependency::Dependency and will match the properties of existing PythonLibrary::Dependency nodes as closely as possible. This PR only imports Go dependencies from Semgrep, but I've structured the code to make it easy to import additional languages from Semgrep in the future.

Before these changes, a project with both Python and Go dependencies will only have PythonLibrary nodes in the dependency graph:
image

After these changes, for the same project the graph contains both PythonLibrary and GoLibrary nodes:
image

Logs from semgrep module before these changes
INFO:cartography.sync:Starting sync with update tag '1730497895'
INFO:cartography.sync:Starting sync stage 'create-indexes'
INFO:cartography.intel.create_indexes:Creating indexes for cartography node types.
INFO:cartography.sync:Finishing sync stage 'create-indexes'
INFO:cartography.sync:Starting sync stage 'semgrep'
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings sync job.
INFO:cartography.intel.semgrep.findings:Loading Semgrep deployment info {'id': ...} into the graph...
INFO:cartography.intel.semgrep.findings:Retrieving Semgrep SCA vulns for deployment 'X'.
INFO:cartography.intel.semgrep.findings:Processed page 0 of Semgrep SCA vulnerabilities.
...
INFO:cartography.intel.semgrep.findings:Processed page X of Semgrep SCA vulnerabilities.
INFO:cartography.intel.semgrep.findings:Retrieved X Semgrep SCA vulns in X pages.
INFO:cartography.intel.semgrep.findings:Loading X Semgrep SCA vulns info into the graph.
INFO:cartography.intel.semgrep.findings:Loading X Semgrep SCA usages info into the graph.
INFO:cartography.graph.statement:Completed semgrep_sca_risk_analysis statement #1
...
INFO:cartography.graph.statement:Completed semgrep_sca_risk_analysis statement #X
INFO:cartography.graph.job:Finished job semgrep_sca_risk_analysis
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings cleanup job.
INFO:cartography.graph.statement:Completed SemgrepSCAFinding statement #1
...
INFO:cartography.graph.statement:Completed SemgrepSCAFinding statement #X
INFO:cartography.graph.job:Finished job SemgrepSCAFinding
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA Locations cleanup job.
INFO:cartography.graph.statement:Completed SemgrepSCALocation statement #1
...
INFO:cartography.graph.statement:Completed SemgrepSCALocation statement #X
INFO:cartography.graph.job:Finished job SemgrepSCALocation
INFO:cartography.sync:Finishing sync stage 'semgrep'
INFO:cartography.sync:Finishing sync with update tag '1730497895'
Logs from semgrep module after these changes
INFO:cartography.sync:Starting sync with update tag '1730505324'
INFO:cartography.sync:Starting sync stage 'create-indexes'
INFO:cartography.intel.create_indexes:Creating indexes for cartography node types.
INFO:cartography.sync:Finishing sync stage 'create-indexes'
INFO:cartography.sync:Starting sync stage 'semgrep'
INFO:cartography.intel.semgrep.deployment:Loading Semgrep deployment info {'id': ...} into the graph...
INFO:cartography.intel.semgrep.dependencies:Running Semgrep dependencies sync job.
INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep dependencies for deployment 'X'.
INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep dependencies.
...
INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep dependencies.
INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep dependencies in X pages.
INFO:cartography.intel.semgrep.dependencies:Loading X GoLibrary objects into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep Go Library cleanup job.
INFO:cartography.graph.statement:Completed GoLibrary statement #1
...
INFO:cartography.graph.statement:Completed GoLibrary statement #X
INFO:cartography.graph.job:Finished job GoLibrary
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings sync job.
INFO:cartography.intel.semgrep.findings:Retrieving Semgrep SCA vulns for deployment 'lyft'.
INFO:cartography.intel.semgrep.findings:Processed page 0 of Semgrep SCA vulnerabilities.
...
INFO:cartography.intel.semgrep.findings:Processed page X of Semgrep SCA vulnerabilities.
INFO:cartography.intel.semgrep.findings:Retrieved X Semgrep SCA vulns in X pages.
INFO:cartography.intel.semgrep.findings:Loading X Semgrep SCA vulns info into the graph.
INFO:cartography.intel.semgrep.findings:Loading X Semgrep SCA usages info into the graph.
INFO:cartography.graph.statement:Completed semgrep_sca_risk_analysis statement #1
...
INFO:cartography.graph.statement:Completed semgrep_sca_risk_analysis statement #X
INFO:cartography.graph.job:Finished job semgrep_sca_risk_analysis
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings cleanup job.
INFO:cartography.graph.statement:Completed SemgrepSCAFinding statement #1
...
INFO:cartography.graph.statement:Completed SemgrepSCAFinding statement #X
INFO:cartography.graph.job:Finished job SemgrepSCAFinding
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA Locations cleanup job.
INFO:cartography.graph.statement:Completed SemgrepSCALocation statement #1
...
INFO:cartography.graph.statement:Completed SemgrepSCALocation statement #X
INFO:cartography.graph.job:Finished job SemgrepSCALocation
INFO:cartography.sync:Finishing sync stage 'semgrep'
INFO:cartography.sync:Finishing sync with update tag '1730497895'

Checklist

Provide proof that this works (this makes reviews move faster). Please perform one or more of the following:

  • Update/add unit or integration tests.
  • Include a screenshot showing what the graph looked like before and after your changes.
  • Include console log trace showing what happened before and after your changes.

If you are changing a node or relationship:

If you are implementing a new intel module:

TODO

  • Clean up TODO comments in code
  • Add/update files like cartography/data/jobs/scoped_analysis/semgrep_sca_risk_analysis.json?

@hanzo hanzo changed the title [WIP] Ingest dependencies using Semgrep API [WIP] Ingest Go dependencies using Semgrep API Oct 21, 2024
@@ -282,6 +286,12 @@ def sync(
load_semgrep_sca_vulns(neo4j_sesion, vulns, deployment_id, update_tag)
load_semgrep_sca_usages(neo4j_sesion, usages, deployment_id, update_tag)
run_scoped_analysis_job('semgrep_sca_risk_analysis.json', neo4j_sesion, common_job_parameters)

# fetch and load dependencies for the Go ecosystem
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this sync() function is being used for more than just findings, should I move it to a separate file, e.g. common.py? Or I could create a separate sync() function in dependencies.py, but it would need to duplicate a lot of this function's code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I would say an approach like the github ingestion is good, multiple resource syncs called from the start_semgrep_ingestion.

The only code you need to repeat is the cleanup and merge_module_sync_metadata process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing is that the findings API requires the deployment ID, currently retrieved here:

semgrep_deployment = get_deployment(semgrep_app_token)
deployment_id = semgrep_deployment["id"]
deployment_slug = semgrep_deployment["slug"]
load_semgrep_deployment(neo4j_sesion, semgrep_deployment, update_tag)

I could update start_semgrep_ingestion to fetch the deployment ID and pass it as an argument to the other sync functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heryxpc I've done some refactoring, what do you think? I don't love the way it's set up now but couldn't figure out a cleaner approach, open to suggestions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken another pass at organizing the code, this way is much cleaner but it feels a little brittle that sync_deployment sets values in common_job_parameters that are required by sync_findings and sync_dependencies, so it would break if you called the sync functions in a different order

Copy link
Contributor Author

@hanzo hanzo Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named this deployment.py instead of deployments.py to match the existing file models/deployment.py.

The contents of this file have been moved here from intel/semgrep/findings.py without changes

hanzo and others added 9 commits October 29, 2024 11:44
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
### Summary
> Describe your changes.

Now that cartography has been donated to the CNCF, time to update the
docs

Signed-off-by: Alex Chantavy <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
First release after CNCF donation, lets see if the CI tooling works

Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Hans Wernetti <[email protected]>

# We could call a different endpoint to get all repo IDs and store a mapping of repo ID to URL,
# but it's much simpler to just extract the URL from the definedAt field.
repo_url = raw_dep["definedAt"]["url"].split("/blob/", 1)[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered what might cause this string split to give the wrong result, but I think it's very unlikely. Even if a repo stored its go.mod file inside a directory named /blob/ (which would be really strange), the url returned from semgrep would be something like https://github.com/org/repo/blob/sha/blob/go.mod#L112, so repo_url would still be set to https://github.com/org/repo as expected.

@hanzo hanzo changed the title [WIP] Ingest Go dependencies using Semgrep API Ingest Go dependencies using Semgrep API Nov 1, 2024
@hanzo hanzo marked this pull request as ready for review November 1, 2024 21:03

deployment_id = common_job_parameters.get("DEPLOYMENT_ID")
deployment_slug = common_job_parameters.get("DEPLOYMENT_SLUG")
if not deployment_id or not deployment_slug:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this mechanism for getting the required parameters, would love to hear suggestions for improvement

achantavy
achantavy previously approved these changes Nov 4, 2024
Copy link
Contributor

@achantavy achantavy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left nits but nothing blocking. Thanks for doing this and for including the screenshots and tests!!

logger.warning(f"Failed to retrieve Semgrep dependencies for page {page}. Retrying...")
retries += 1
if retries >= _MAX_RETRIES:
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, why not just raise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only because this is copied from

I'll update both lines to raise

cartography/intel/semgrep/dependencies.py Outdated Show resolved Hide resolved
deployment_id: str,
update_tag: int,
) -> None:
logger.info(f"Loading {len(dependencies)} Semgrep dependencies into the graph.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[non-block] i'm not a huge fan of metaprogramming here but I won't block on this.

That aside, if we decide to keep this bit of metaprogramming, it'd be good to log the label of the dependency_schema object so that the log message shows what asset is getting written to the graph.

Copy link
Contributor Author

@hanzo hanzo Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from

logger.info(f"Loading {len(vulns)} Semgrep SCA vulns info into the graph.")

I'll update all of these log lines to use the label of the schema object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated logs:

INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep dependencies in Y pages.
INFO:cartography.intel.semgrep.dependencies:Loading X GoLibrary objects into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep Go Library cleanup job.

cartography/models/semgrep/dependencies.py Outdated Show resolved Hide resolved

@dataclass(frozen=True)
# (:SemgrepDependency)<-[:REQUIRES]-(:GitHubRepository)
class SemgrepDependencyToGithubRepoRel(CartographyRelSchema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-block]
since this rel is specifically for a go-library to GitHub repo, I'd prefer to add some copypasta to make it separate. This is a style choice and I won't block here though since in future we can decouple that out if we decide that another semgrep dependency needs to have a different relationship definition than a go library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I've seen, all semgrep dependencies share the same relationship with the github repo where they were detected by semgrep. I'd be inclined to leave it as is for now and we can decouple it later if we discover that a different relationship is needed

Signed-off-by: Hans Wernetti <[email protected]>
Copy link
Contributor

@olivia-hong olivia-hong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will defer approval to the cartography experts but overall lgtm!

@achantavy achantavy merged commit 5029b00 into master Nov 5, 2024
5 checks passed
@achantavy achantavy deleted the semgrep-dependencies branch November 5, 2024 18:53
@hanzo hanzo mentioned this pull request Nov 5, 2024
achantavy pushed a commit that referenced this pull request Nov 6, 2024
### Summary

Small followup to
#1368, noticed that
the schema doc is not being rendered correctly:


![image](https://github.com/user-attachments/assets/e3f34e94-a7c0-4b39-b243-460fe39acb57)

I'm not sure how to test that it will render correctly after this
change, but this format matches the existing format used by the github
schema:

https://github.com/cartography-cncf/cartography/blob/f11d7b2ff87331ed736a5de2547cde812face1a0/docs/root/modules/github/schema.md?plain=1#L210-L218

---------

Signed-off-by: Hans Wernetti <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants