Skip to content

Commit

Permalink
Ingest Go dependencies using Semgrep API (#1368)
Browse files Browse the repository at this point in the history
### Summary

Motivation: The library dependency graph is already populated with
`PythonLibrary` nodes via Cartography's Github module, but no other
languages are supported. This PR adds `GoLibrary` nodes to the library
dependency graph to bring support for Go up to parity with Python. I
concur with the recommendation from @heryxpc that rather than writing
code to manually parse go.mod files, we should instead use the
dependency data returned by the Semgrep API.

Cartography's Semgrep module is already able to import supply chain
vulnerability data from the
[Findings](https://semgrep.dev/api/v1/docs/#tag/Finding/operation/semgrep_app.core_exp.findings.handlers.issue.openapi_list_recent_issues)
endpoint of the Semgrep API. Semgrep also provides a [List
Dependencies](https://semgrep.dev/api/v1/docs/#tag/SupplyChainService/operation/semgrep_app.products.sca.handlers.dependency.list_dependencies_conexxion)
endpoint that returns a list of every known dependency for a given
ecosystem (e.g. specifying the “gomod” ecosystem returns all
dependencies found in go.mod files). The response contains useful
information including the transitivity of the dependency and a link to
where it’s defined in source code.

The dependency nodes imported from the Semgrep API will be labelled
`GoLibrary::SemgrepDependency::Dependency` and will match the properties
of existing `PythonLibrary::Dependency` nodes as closely as possible.
This PR only imports Go dependencies from Semgrep, but I've structured
the code to make it easy to import additional languages from Semgrep in
the future.

Before these changes, a project with both Python and Go dependencies
will only have PythonLibrary nodes in the dependency graph:
<img width="1019" alt="image"
src="https://github.com/user-attachments/assets/9e291012-103e-4dae-a2bb-2da5205421b7">



After these changes, for the same project the graph contains both
PythonLibrary and GoLibrary nodes:
<img width="1015" alt="image"
src="https://github.com/user-attachments/assets/f945e489-6a3e-4edf-85d4-424bacd763b2">


<details>
<summary>Logs from semgrep module before these changes</summary>

```
INFO:cartography.sync:Starting sync with update tag '1730497895'
INFO:cartography.sync:Starting sync stage 'create-indexes'
INFO:cartography.intel.create_indexes:Creating indexes for cartography node types.
INFO:cartography.sync:Finishing sync stage 'create-indexes'
INFO:cartography.sync:Starting sync stage 'semgrep'
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings sync job.
INFO:cartography.intel.semgrep.findings:Loading Semgrep deployment info {'id': ...} into the graph...
INFO:cartography.intel.semgrep.findings:Retrieving Semgrep SCA vulns for deployment 'X'.
INFO:cartography.intel.semgrep.findings:Processed page 0 of Semgrep SCA vulnerabilities.
...
INFO:cartography.intel.semgrep.findings:Processed page X of Semgrep SCA vulnerabilities.
INFO:cartography.intel.semgrep.findings:Retrieved X Semgrep SCA vulns in X pages.
INFO:cartography.intel.semgrep.findings:Loading X Semgrep SCA vulns info into the graph.
INFO:cartography.intel.semgrep.findings:Loading X Semgrep SCA usages info into the graph.
INFO:cartography.graph.statement:Completed semgrep_sca_risk_analysis statement #1
...
INFO:cartography.graph.statement:Completed semgrep_sca_risk_analysis statement #X
INFO:cartography.graph.job:Finished job semgrep_sca_risk_analysis
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings cleanup job.
INFO:cartography.graph.statement:Completed SemgrepSCAFinding statement #1
...
INFO:cartography.graph.statement:Completed SemgrepSCAFinding statement #X
INFO:cartography.graph.job:Finished job SemgrepSCAFinding
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA Locations cleanup job.
INFO:cartography.graph.statement:Completed SemgrepSCALocation statement #1
...
INFO:cartography.graph.statement:Completed SemgrepSCALocation statement #X
INFO:cartography.graph.job:Finished job SemgrepSCALocation
INFO:cartography.sync:Finishing sync stage 'semgrep'
INFO:cartography.sync:Finishing sync with update tag '1730497895'
```
</details>

<details>
<summary>Logs from semgrep module after these changes</summary>

```
INFO:cartography.sync:Starting sync with update tag '1730505324'
INFO:cartography.sync:Starting sync stage 'create-indexes'
INFO:cartography.intel.create_indexes:Creating indexes for cartography node types.
INFO:cartography.sync:Finishing sync stage 'create-indexes'
INFO:cartography.sync:Starting sync stage 'semgrep'
INFO:cartography.intel.semgrep.deployment:Loading Semgrep deployment info {'id': ...} into the graph...
INFO:cartography.intel.semgrep.dependencies:Running Semgrep dependencies sync job.
INFO:cartography.intel.semgrep.dependencies:Retrieving Semgrep dependencies for deployment 'X'.
INFO:cartography.intel.semgrep.dependencies:Processed page 0 of Semgrep dependencies.
...
INFO:cartography.intel.semgrep.dependencies:Processed page X of Semgrep dependencies.
INFO:cartography.intel.semgrep.dependencies:Retrieved X Semgrep dependencies in X pages.
INFO:cartography.intel.semgrep.dependencies:Loading X GoLibrary objects into the graph.
INFO:cartography.intel.semgrep.dependencies:Running Semgrep Go Library cleanup job.
INFO:cartography.graph.statement:Completed GoLibrary statement #1
...
INFO:cartography.graph.statement:Completed GoLibrary statement #X
INFO:cartography.graph.job:Finished job GoLibrary
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings sync job.
INFO:cartography.intel.semgrep.findings:Retrieving Semgrep SCA vulns for deployment 'lyft'.
INFO:cartography.intel.semgrep.findings:Processed page 0 of Semgrep SCA vulnerabilities.
...
INFO:cartography.intel.semgrep.findings:Processed page X of Semgrep SCA vulnerabilities.
INFO:cartography.intel.semgrep.findings:Retrieved X Semgrep SCA vulns in X pages.
INFO:cartography.intel.semgrep.findings:Loading X Semgrep SCA vulns info into the graph.
INFO:cartography.intel.semgrep.findings:Loading X Semgrep SCA usages info into the graph.
INFO:cartography.graph.statement:Completed semgrep_sca_risk_analysis statement #1
...
INFO:cartography.graph.statement:Completed semgrep_sca_risk_analysis statement #X
INFO:cartography.graph.job:Finished job semgrep_sca_risk_analysis
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA findings cleanup job.
INFO:cartography.graph.statement:Completed SemgrepSCAFinding statement #1
...
INFO:cartography.graph.statement:Completed SemgrepSCAFinding statement #X
INFO:cartography.graph.job:Finished job SemgrepSCAFinding
INFO:cartography.intel.semgrep.findings:Running Semgrep SCA Locations cleanup job.
INFO:cartography.graph.statement:Completed SemgrepSCALocation statement #1
...
INFO:cartography.graph.statement:Completed SemgrepSCALocation statement #X
INFO:cartography.graph.job:Finished job SemgrepSCALocation
INFO:cartography.sync:Finishing sync stage 'semgrep'
INFO:cartography.sync:Finishing sync with update tag '1730497895'
```
</details>

### Checklist

Provide proof that this works (this makes reviews move faster). Please
perform one or more of the following:
- [x] Update/add unit or integration tests.
- [x] Include a screenshot showing what the graph looked like before and
after your changes.
- [x] Include console log trace showing what happened before and after
your changes.

If you are changing a node or relationship:
- [x] Update the
[schema](https://github.com/lyft/cartography/tree/master/docs/root/modules)
and
[readme](https://github.com/lyft/cartography/blob/master/docs/schema/README.md).

If you are implementing a new intel module:
- [x] Use the NodeSchema [data
model](https://cartography-cncf.github.io/cartography/dev/writing-intel-modules.html#defining-a-node).


### TODO
- [ ] Clean up TODO comments in code
- [ ] Add/update files like
cartography/data/jobs/scoped_analysis/semgrep_sca_risk_analysis.json?

---------

Signed-off-by: Hans Wernetti <[email protected]>
Signed-off-by: Alex Chantavy <[email protected]>
Co-authored-by: Alex Chantavy <[email protected]>
  • Loading branch information
hanzo and achantavy authored Nov 5, 2024
1 parent 3e7941b commit 5029b00
Show file tree
Hide file tree
Showing 13 changed files with 671 additions and 144 deletions.
11 changes: 9 additions & 2 deletions cartography/intel/semgrep/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
import neo4j

from cartography.config import Config
from cartography.intel.semgrep.findings import sync
from cartography.intel.semgrep.dependencies import sync_dependencies
from cartography.intel.semgrep.deployment import sync_deployment
from cartography.intel.semgrep.findings import sync_findings
from cartography.util import timeit


Expand All @@ -20,4 +22,9 @@ def start_semgrep_ingestion(
if not config.semgrep_app_token:
logger.info('Semgrep import is not configured - skipping this module. See docs to configure.')
return
sync(neo4j_session, config.semgrep_app_token, config.update_tag, common_job_parameters)

# sync_deployment must be called first since it populates common_job_parameters
# with the deployment ID and slug, which are required by the other sync functions
sync_deployment(neo4j_session, config.semgrep_app_token, config.update_tag, common_job_parameters)
sync_dependencies(neo4j_session, config.semgrep_app_token, config.update_tag, common_job_parameters)
sync_findings(neo4j_session, config.semgrep_app_token, config.update_tag, common_job_parameters)
201 changes: 201 additions & 0 deletions cartography/intel/semgrep/dependencies.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
import logging
from typing import Any
from typing import Callable
from typing import Dict
from typing import List

import neo4j
import requests
from requests.exceptions import HTTPError
from requests.exceptions import ReadTimeout

from cartography.client.core.tx import load
from cartography.graph.job import GraphJob
from cartography.models.semgrep.dependencies import SemgrepGoLibrarySchema
from cartography.stats import get_stats_client
from cartography.util import merge_module_sync_metadata
from cartography.util import timeit

logger = logging.getLogger(__name__)
stat_handler = get_stats_client(__name__)
_PAGE_SIZE = 10000
_TIMEOUT = (60, 60)
_MAX_RETRIES = 3


@timeit
def get_dependencies(semgrep_app_token: str, deployment_id: str, ecosystems: List[str]) -> List[Dict[str, Any]]:
"""
Gets all dependencies for the given ecosystems within the given Semgrep deployment ID.
param: semgrep_app_token: The Semgrep App token to use for authentication.
param: deployment_id: The Semgrep deployment ID to use for retrieving dependencies.
param: ecosystems: One or more ecosystems to import dependencies from, e.g. "gomod" or "pypi".
The list of supported ecosystems is defined here:
https://semgrep.dev/api/v1/docs/#tag/SupplyChainService/operation/semgrep_app.products.sca.handlers.dependency.list_dependencies_conexxion
"""
all_deps = []
deps_url = f"https://semgrep.dev/api/v1/deployments/{deployment_id}/dependencies"
has_more = True
page = 0
retries = 0
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {semgrep_app_token}",
}

request_data: dict[str, Any] = {
"pageSize": _PAGE_SIZE,
"dependencyFilter": {
"ecosystem": ecosystems,
},
}

logger.info(f"Retrieving Semgrep dependencies for deployment '{deployment_id}'.")
while has_more:
try:
response = requests.post(deps_url, json=request_data, headers=headers, timeout=_TIMEOUT)
response.raise_for_status()
data = response.json()
except (ReadTimeout, HTTPError):
logger.warning(f"Failed to retrieve Semgrep dependencies for page {page}. Retrying...")
retries += 1
if retries >= _MAX_RETRIES:
raise
continue
deps = data.get("dependencies", [])
has_more = data.get("hasMore", False)
logger.info(f"Processed page {page} of Semgrep dependencies.")
all_deps.extend(deps)
retries = 0
page += 1
request_data["cursor"] = data.get("cursor")

logger.info(f"Retrieved {len(all_deps)} Semgrep dependencies in {page} pages.")
return all_deps


def transform_dependencies(raw_deps: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""
Transforms the raw dependencies response from Semgrep API into a list of dicts
that can be used to create the Dependency nodes.
"""

"""
sample raw_dep as of November 2024:
{
"repositoryId": "123456",
"definedAt": {
"path": "go.mod",
"startLine": "6",
"endLine": "6",
"url": "https://github.com/org/repo-name/blob/00000000000000000000000000000000/go.mod#L6",
"committedAt": "1970-01-01T00:00:00Z",
"startCol": "0",
"endCol": "0"
},
"transitivity": "DIRECT",
"package": {
"name": "github.com/foo/bar",
"versionSpecifier": "1.2.3"
},
"ecosystem": "gomod",
"licenses": [],
"pathToTransitivity": []
},
"""
deps = []
for raw_dep in raw_deps:

# We could call a different endpoint to get all repo IDs and store a mapping of repo ID to URL,
# but it's much simpler to just extract the URL from the definedAt field.
repo_url = raw_dep["definedAt"]["url"].split("/blob/", 1)[0]

name = raw_dep["package"]["name"]
version = raw_dep["package"]["versionSpecifier"]
id = f"{name}|{version}"

# As of November 2024, Semgrep does not import dependencies with version specifiers such as >, <, etc.
# For now, hardcode the specifier to ==<version> to align with GitHub-sourced Python dependencies.
# If Semgrep eventually supports version specifiers, update this line accordingly.
specifier = f"=={version}"

deps.append({
# existing dependency properties:
"id": id,
"name": name,
"specifier": specifier,
"version": version,
"repo_url": repo_url,

# Semgrep-specific properties:
"ecosystem": raw_dep["ecosystem"],
"transitivity": raw_dep["transitivity"].lower(),
"url": raw_dep["definedAt"]["url"],
})

return deps


@timeit
def load_dependencies(
neo4j_session: neo4j.Session,
dependency_schema: Callable,
dependencies: List[Dict],
deployment_id: str,
update_tag: int,
) -> None:
logger.info(f"Loading {len(dependencies)} {dependency_schema().label} objects into the graph.")
load(
neo4j_session,
dependency_schema(),
dependencies,
lastupdated=update_tag,
DEPLOYMENT_ID=deployment_id,
)


@timeit
def cleanup(
neo4j_session: neo4j.Session,
common_job_parameters: Dict[str, Any],
) -> None:
logger.info("Running Semgrep Go Library cleanup job.")
go_libraries_cleanup_job = GraphJob.from_node_schema(
SemgrepGoLibrarySchema(), common_job_parameters,
)
go_libraries_cleanup_job.run(neo4j_session)


@timeit
def sync_dependencies(
neo4j_session: neo4j.Session,
semgrep_app_token: str,
update_tag: int,
common_job_parameters: Dict[str, Any],
) -> None:

deployment_id = common_job_parameters.get("DEPLOYMENT_ID")
if not deployment_id:
logger.warning(
"Missing Semgrep deployment ID, ensure that sync_deployment() has been called."
"Skipping Semgrep dependencies sync job.",
)
return

logger.info("Running Semgrep dependencies sync job.")

# fetch and load dependencies for the Go ecosystem
raw_go_deps = get_dependencies(semgrep_app_token, deployment_id, ecosystems=["gomod"])
go_deps = transform_dependencies(raw_go_deps)
load_dependencies(neo4j_session, SemgrepGoLibrarySchema, go_deps, deployment_id, update_tag)

cleanup(neo4j_session, common_job_parameters)

merge_module_sync_metadata(
neo4j_session=neo4j_session,
group_type='Semgrep',
group_id=deployment_id,
synced_type='SemgrepDependency',
update_tag=update_tag,
stat_handler=stat_handler,
)
67 changes: 67 additions & 0 deletions cartography/intel/semgrep/deployment.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import logging
from typing import Any
from typing import Dict

import neo4j
import requests

from cartography.client.core.tx import load
from cartography.models.semgrep.deployment import SemgrepDeploymentSchema
from cartography.stats import get_stats_client
from cartography.util import timeit

logger = logging.getLogger(__name__)
stat_handler = get_stats_client(__name__)
_TIMEOUT = (60, 60)


@timeit
def get_deployment(semgrep_app_token: str) -> Dict[str, Any]:
"""
Gets the deployment associated with the passed Semgrep App token.
param: semgrep_app_token: The Semgrep App token to use for authentication.
"""
deployment = {}
deployment_url = "https://semgrep.dev/api/v1/deployments"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {semgrep_app_token}",
}
response = requests.get(deployment_url, headers=headers, timeout=_TIMEOUT)
response.raise_for_status()

data = response.json()
deployment["id"] = data["deployments"][0]["id"]
deployment["name"] = data["deployments"][0]["name"]
deployment["slug"] = data["deployments"][0]["slug"]

return deployment


@timeit
def load_semgrep_deployment(
neo4j_session: neo4j.Session, deployment: Dict[str, Any], update_tag: int,
) -> None:
logger.info(f"Loading SemgrepDeployment {deployment} into the graph.")
load(
neo4j_session,
SemgrepDeploymentSchema(),
[deployment],
lastupdated=update_tag,
)


@timeit
def sync_deployment(
neo4j_session: neo4j.Session,
semgrep_app_token: str,
update_tag: int,
common_job_parameters: Dict[str, Any],
) -> None:

semgrep_deployment = get_deployment(semgrep_app_token)
deployment_id = semgrep_deployment["id"]
deployment_slug = semgrep_deployment["slug"]
load_semgrep_deployment(neo4j_session, semgrep_deployment, update_tag)
common_job_parameters["DEPLOYMENT_ID"] = deployment_id
common_job_parameters["DEPLOYMENT_SLUG"] = deployment_slug
Loading

0 comments on commit 5029b00

Please sign in to comment.