-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dev docs on how to use new object model #1054
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,16 @@ | ||
# How to write a new intel module | ||
|
||
This doc contains guidelines on creating a Cartography intel module. If you want to add a new data type to Cartography, | ||
this is the guide for you! It is fairly straightforward to copy the structure of an existing intel module and test it, | ||
but we'll share some best practices in this doc to save you some time. We look forward to receiving your PR! | ||
If you want to add a new data type to Cartography, this is the guide for you. We look forward to receiving your PR! | ||
|
||
## Before getting started... | ||
|
||
Read through and follow the setup steps in [the Cartography developer guide](developer-guide.html). Learn the basics of | ||
running, testing, and linting your code there. | ||
|
||
## The fast way | ||
|
||
To get started coding without reading this doc, just copy the structure of our [AWS EMR module](https://github.com/lyft/cartography/blob/master/cartography/intel/aws/emr.py) and use it as an example. For a longer written explanation of the "how" and "why", read on. | ||
|
||
## Configuration and credential management | ||
|
||
### Supplying credentials and arguments to your module | ||
|
@@ -25,16 +27,18 @@ A cartography intel module consists of one `sync` function. `sync` should call ` | |
|
||
### Get | ||
|
||
The `get` function [retrieves necessary data](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L98) | ||
The `get` function [returns data as a list of dicts](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L98) | ||
from a resource provider API, which is GCP in this particular example. | ||
|
||
`get` should be "dumb" in the sense that it should not handle retry logic or data | ||
manipulation. It should also raise an exception if it's not able to complete successfully. | ||
|
||
### Transform | ||
|
||
The `transform` function [manipulates data](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L193) | ||
to make it easier to ingest to the graph. We have some best practices on handling transforms: | ||
The `transform` function [manipulates the list of dicts](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L193) | ||
to make it easier to ingest to the graph. `transform` functions are sometimes omitted when a module author decides that the output from the `get` is already in the shape that they need. | ||
|
||
We have some best practices on handling transforms: | ||
|
||
#### Handling required versus optional fields | ||
|
||
|
@@ -54,19 +58,144 @@ For the sake of consistency, if a field does not exist, set it to `None` and not | |
|
||
### Load | ||
|
||
The `load` function ingests the processed data to Neo4j, [as seen in this GCP VPC example](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L442). | ||
There are many best practices to consider here. | ||
[As seen in our AWS EMR example](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L113-L132), the `load` function ingests a list of dicts to Neo4j by calling [cartography.client.core.tx.load_graph_data()](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/client/core/tx.py#L191-L212): | ||
```python | ||
def load_emr_clusters( | ||
neo4j_session: neo4j.Session, | ||
cluster_data: List[Dict[str, Any]], | ||
region: str, | ||
current_aws_account_id: str, | ||
aws_update_tag: int, | ||
) -> None: | ||
logger.info(f"Loading EMR {len(cluster_data)} clusters for region '{region}' into graph.") | ||
|
||
ingestion_query = build_ingestion_query(EMRClusterSchema()) | ||
|
||
load_graph_data( | ||
neo4j_session, | ||
ingestion_query, | ||
cluster_data, | ||
lastupdated=aws_update_tag, | ||
Region=region, | ||
AccountId=current_aws_account_id, | ||
) | ||
|
||
``` | ||
|
||
|
||
`load_graph_data()` requires an `ingestion_query` to be generated from `CartographyNodeSchema` and `CartographyRelSchema` objects. [cartography.graph.querybuilder.build_ingestion_query()](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/querybuilder.py#L312) does just that: it accepts those schema objects as input and returns a well-formed and optimized cypher query. | ||
|
||
|
||
#### Defining a node | ||
|
||
As an example of a `CartographyNodeSchema`, you can view our [EMRClusterSchema code](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L106-L110): | ||
|
||
```python | ||
@dataclass(frozen=True) | ||
class EMRClusterSchema(CartographyNodeSchema): | ||
label: str = 'EMRCluster' # The label of the node | ||
properties: EMRClusterNodeProperties = EMRClusterNodeProperties() # An object representing all properties on the EMR Cluster node | ||
sub_resource_relationship: EMRClusterToAWSAccount = EMRClusterToAWSAccount() | ||
``` | ||
|
||
An `EMRClusterSchema` object inherits from the `CartographyNodeSchema` class and contains a node label, properties, and connection to its [sub-resource](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L216-L228): an `AWSAccount`. | ||
|
||
Note that the typehints are necessary for Python dataclasses to work properly. | ||
|
||
|
||
#### Defining node properties | ||
|
||
Here's our [EMRClusterNodeProperties code](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L106-L110): | ||
|
||
```python | ||
@dataclass(frozen=True) | ||
class EMRClusterNodeProperties(CartographyNodeProperties): | ||
arn: PropertyRef = PropertyRef('ClusterArn') | ||
firstseen: PropertyRef = PropertyRef('firstseen') | ||
id: PropertyRef = PropertyRef('Id') | ||
# ... | ||
lastupdated: PropertyRef = PropertyRef('lastupdated', set_in_kwargs=True) | ||
region: PropertyRef = PropertyRef('Region', set_in_kwargs=True) | ||
security_configuration: PropertyRef = PropertyRef('SecurityConfiguration') | ||
``` | ||
|
||
A `CartographyNodeProperties` object consists of [`PropertyRef`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L37) objects. `PropertyRefs` tell `querybuilder.build_ingestion_query()` where to find appropriate values for each field from the list of dicts. | ||
|
||
For example, `id: PropertyRef = PropertyRef('Id')` above tells the querybuilder to set a field called `id` on the `EMRCluster` node using the value located at key `'id'` on each dict in the list. | ||
|
||
As another example, `region: PropertyRef = PropertyRef('Region', set_in_kwargs=True)` tells the querybuilder to set a field called `region` on the `EMRCluster` node using a keyword argument called `Region` supplied to `cartography.client.core.tx.load_graph_data()`. `set_in_kwargs=True` is useful in cases where we want every object loaded by a single call to `load_graph_data()` to have the same value for a given attribute. | ||
|
||
|
||
#### Defining relationships | ||
|
||
Relationships can be defined on `CartographyNodeSchema` on either their [`sub_resource_relationship`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L216-L228) field or their [`other_relationships`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L230-L237) field. | ||
|
||
As seen above, an `EMRClusterSchema` only has a single relationship defined: an [`EMRClusterToAWSAccount`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L94-L103): | ||
|
||
#### Handling cartography's `update_tag`: | ||
```python | ||
@dataclass(frozen=True) | ||
# (:EMRCluster)<-[:RESOURCE]-(:AWSAccount) | ||
class EMRClusterToAWSAccount(CartographyRelSchema): | ||
target_node_label: str = 'AWSAccount' # (1) | ||
target_node_matcher: TargetNodeMatcher = make_target_node_matcher( # (2) | ||
{'id': PropertyRef('AccountId', set_in_kwargs=True)}, | ||
) | ||
direction: LinkDirection = LinkDirection.INWARD # (3) | ||
rel_label: str = "RESOURCE" # (4) | ||
properties: EMRClusterToAwsAccountRelProperties = EMRClusterToAwsAccountRelProperties() # (5) | ||
``` | ||
|
||
This class is best described by explaining how it is processed: `build_ingestion_query()` will traverse the `EMRClusterSchema` to its `sub_resource_relationship` field and find the above `EMRClusterToAWSAccount` object. With this information, we know to | ||
- draw a relationship to an `AWSAccount` node (1) using the label "`RESOURCE`" (4) | ||
- by matching on the AWSAccount's "`id`" field" (2) | ||
- where the relationship [directionality](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L12-L34) is pointed _inward_ toward the EMRCluster (3) | ||
- making sure to define a set of properties for the relationship (5). The [full example RelProperties](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L89-L91) is very short: | ||
|
||
```python | ||
@dataclass(frozen=True) | ||
class EMRClusterToAwsAccountRelProperties(CartographyRelProperties): | ||
lastupdated: PropertyRef = PropertyRef('lastupdated', set_in_kwargs=True) | ||
``` | ||
|
||
#### The result | ||
|
||
And those are all the objects necessary for this example! The resulting query will look something like this: | ||
|
||
```cypher | ||
UNWIND $DictList AS item | ||
MERGE (i:EMRCluster{id: item.Id}) | ||
ON CREATE SET i.firstseen = timestamp() | ||
SET | ||
i.lastupdated = $lastupdated, | ||
i.arn = item.ClusterArn | ||
// ... | ||
|
||
WITH i, item | ||
CALL { | ||
WITH i, item | ||
|
||
OPTIONAL MATCH (j:AWSAccount{id: $AccountId}) | ||
WITH i, item, j WHERE j IS NOT NULL | ||
MERGE (i)<-[r:RESOURCE]-(j) | ||
ON CREATE SET r.firstseen = timestamp() | ||
SET | ||
r.lastupdated = $lastupdated | ||
} | ||
``` | ||
|
||
And that's basically all you need to know to understand how to define your own nodes and relationships using cartography's data objects. For more information, you can view the [object model API documentation](https://github.com/lyft/cartography/blob/master/cartography/graph/model.py) as a reference. | ||
|
||
### Additional concepts | ||
|
||
This section explains cartography general patterns, conventions, and design decisions. | ||
|
||
#### cartography's `update_tag`: | ||
|
||
`cartography`'s global [config object carries around an `update_tag` property](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/cli.py#L91-L98) | ||
which is a tag/label associated with the current sync. | ||
Cartography's CLI code [sets this to a Unix timestamp of when the CLI was run](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/sync.py#L131-L134). | ||
|
||
All `cartography` intel modules need to set the `lastupdated` property on all nodes and all relationships to this | ||
`update_tag`. You can see a couple examples of this in our | ||
[AWS ingestion code](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/aws/__init__.py#L106) and our | ||
[GCP ingestion code](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/__init__.py#L134). | ||
All `cartography` intel modules set the `lastupdated` property on all nodes and all relationships to this `update_tag`. | ||
|
||
|
||
#### All nodes need these fields | ||
|
@@ -80,21 +209,21 @@ All `cartography` intel modules need to set the `lastupdated` property on all no | |
|
||
When setting an `id`, ensure that you also include the field name that it came from. For example, since we've | ||
decided to use `partial_uri`s as an id for a GCPVpc, we should include both `partial_uri` _and_ `id` on the node. | ||
This way, a user can tell what fields were used to derive the `id`. This is accomplished [here](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L455-L457). | ||
This way, a user can tell what fields were used to derive the `id`. This is accomplished [here](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L455-L457) | ||
|
||
- `lastupdated` - See [below](#lastupdated-and-firstseen) on how to set this. | ||
- `firstseen` - See [below](#lastupdated-and-firstseen) on how to set this. | ||
- `lastupdated` - See [below](#lastupdated-and-firstseen) on how this gets set automatically. | ||
- `firstseen` - See [below](#lastupdated-and-firstseen) on how this gets set automatically. | ||
|
||
#### All relationships need these fields | ||
|
||
Cartography currently does not create indexes on relationships, so we should keep relationships lightweight with only these two fields: | ||
Cartography currently does not create indexes on relationships, so in most cases we should keep relationships lightweight with only these two fields: | ||
|
||
- `lastupdated` - See [below](#lastupdated-and-firstseen) on how to set this. | ||
- `firstseen` - See [below](#lastupdated-and-firstseen) on how to set this. | ||
- `lastupdated` - See [below](#lastupdated-and-firstseen) on how this gets set automatically. | ||
- `firstseen` - See [below](#lastupdated-and-firstseen) on how this gets set automatically. | ||
|
||
#### Run queries only on indexed fields for best performance | ||
|
||
In this example of ingesting GCP VPCs, we connect VPCs with GCPProjects | ||
In this older example of ingesting GCP VPCs, we connect VPCs with GCPProjects | ||
[based on GCPProject `id`s and GCPVpc `id`s](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L451). | ||
`id`s are indexed, as seen [here](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/data/indexes.cypher#L45) | ||
and [here](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/data/indexes.cypher#L42). | ||
|
@@ -103,52 +232,12 @@ All of these queries use indexes for faster lookup. | |
#### Create an index for new nodes | ||
|
||
Be sure to [update the indexes.cypher file](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/data/indexes.cypher) | ||
with your new node type. Indexing on ID is required, and indexing on anything else that will be frequently queried is | ||
with your new node type. Indexing on "`id`" is required, and indexing on anything else that will be frequently queried is | ||
encouraged. | ||
|
||
#### lastupdated and firstseen | ||
|
||
Set the `lastupdated` and `firstseen` fields on both nodes and relationships. Suppose we are creating | ||
the following chain: | ||
|
||
```cypher | ||
MERGE (n:NodeType)-[r:RELATIONSHIP]->(n2:NodeType2) | ||
``` | ||
|
||
- To handle nodes in this case, | ||
|
||
- Every `MERGE` query that creates a new node should look like this | ||
|
||
```cypher | ||
ON CREATE SET n.firstseen = $UpdateTag | ||
SET | ||
n.lastupdated = $UpdateTag, | ||
node.field1 = $value1, | ||
node.field2 = $value2, | ||
... | ||
node.fieldN = $valueN | ||
``` | ||
|
||
- To handle relationships in this case, | ||
|
||
- Every `MERGE` query that creates a new relationship should look like this | ||
|
||
```cypher | ||
ON CREATE SET r.firstseen = $UpdateTag | ||
SET | ||
r.lastupdated = $UpdateTag | ||
``` | ||
|
||
#### Connecting different node types with the `_attach` pattern | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like their may be cases where we want to attach after loading. But in the cases I can think of, it's still best to transform the data and use a NodeSchema that properly defines those extra relationships. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, and yes that's the approach I would encourage. It's also possible to first generate a query using |
||
|
||
Node connections can be complex. In many cases we need to connect many different node types together, so we use an | ||
`_attach` function to manage this. | ||
|
||
The best way to explain `_attach` is through an example, like when [we connect GCP instances to their VPCs](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L439). | ||
In this case, we create a [helper `_attach` function](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L660) | ||
that accepts the instance's `id` and connects the instance to the VPC using a `MERGE` query. | ||
|
||
This pattern can also be seen when [attaching AWS RDS instances to EC2 security groups](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/aws/rds.py#L108). | ||
On every cartography node and relationship, we set the `lastupdated` field to the `UPDATE_TAG` and `firstseen` field to `timestamp()` (a built-in Neo4j function equivalent to epoch time in milliseconds). This is automatically handled by the cartography object model. | ||
|
||
### Cleanup | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth adding an example here on the
other_relationships
, even if it's only pointing to the test cases we have.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added below