From 41a7e3589bf81a0c2c6c7fd0e25142e0cea1acf3 Mon Sep 17 00:00:00 2001
From: Alex Chantavy <achantavy@lyft.com>
Date: Tue, 20 Dec 2022 22:48:25 -0800
Subject: [PATCH 1/3] Update dev docs on how to use new object model

---
 docs/root/dev/writing-intel-modules.md | 213 ++++++++++++++++++-------
 1 file changed, 151 insertions(+), 62 deletions(-)

diff --git a/docs/root/dev/writing-intel-modules.md b/docs/root/dev/writing-intel-modules.md
index 680f40bb2d..ce2491031b 100644
--- a/docs/root/dev/writing-intel-modules.md
+++ b/docs/root/dev/writing-intel-modules.md
@@ -1,14 +1,16 @@
 # How to write a new intel module
 
-This doc contains guidelines on creating a Cartography intel module. If you want to add a new data type to Cartography,
-this is the guide for you!  It is fairly straightforward to copy the structure of an existing intel module and test it,
-but we'll share some best practices in this doc to save you some time. We look forward to receiving your PR!
+If you want to add a new data type to Cartography, this is the guide for you. We look forward to receiving your PR!
 
 ## Before getting started...
 
 Read through and follow the setup steps in [the Cartography developer guide](developer-guide.html). Learn the basics of
 running, testing, and linting your code there.
 
+## The fast way
+
+To get started coding without reading this doc, just copy the structure of our [AWS EMR module](https://github.com/lyft/cartography/blob/master/cartography/intel/aws/emr.py) and use it as an example. For a longer written explanation of the "how" and "why", read on.
+
 ## Configuration and credential management
 
 ### Supplying credentials and arguments to your module
@@ -25,7 +27,7 @@ A cartography intel module consists of one `sync` function. `sync` should call `
 
 ### Get
 
-The `get` function [retrieves necessary data](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L98)
+The `get` function [returns data as a list of dicts](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L98)
 from a resource provider API, which is GCP in this particular example.
 
 `get` should be "dumb" in the sense that it should not handle retry logic or data
@@ -33,8 +35,10 @@ manipulation. It should also raise an exception if it's not able to complete suc
 
 ### Transform
 
-The `transform` function [manipulates data](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L193)
-to make it easier to ingest to the graph. We have some best practices on handling transforms:
+The `transform` function [manipulates the list of dicts](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L193)
+to make it easier to ingest to the graph. `transform` functions are sometimes omitted when a module author decides that the output from the `get` is already in the shape that they need.
+
+We have some best practices on handling transforms:
 
 #### Handling required versus optional fields
 
@@ -54,19 +58,144 @@ For the sake of consistency, if a field does not exist, set it to `None` and not
 
 ### Load
 
-The `load` function ingests the processed data to Neo4j, [as seen in this GCP VPC example](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L442).
-There are many best practices to consider here.
+[As seen in our AWS EMR example](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L113-L132), the `load` function ingests a list of dicts to Neo4j by calling [cartography.client.core.tx.load_graph_data()](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/client/core/tx.py#L191-L212):
+```python
+def load_emr_clusters(
+        neo4j_session: neo4j.Session,
+        cluster_data: List[Dict[str, Any]],
+        region: str,
+        current_aws_account_id: str,
+        aws_update_tag: int,
+) -> None:
+    logger.info(f"Loading EMR {len(cluster_data)} clusters for region '{region}' into graph.")
+
+    ingestion_query = build_ingestion_query(EMRClusterSchema())
+
+    load_graph_data(
+        neo4j_session,
+        ingestion_query,
+        cluster_data,
+        lastupdated=aws_update_tag,
+        Region=region,
+        AccountId=current_aws_account_id,
+    )
+
+```
+
+
+`load_graph_data()` requires an `ingestion_query` to be generated from `CartographyNodeSchema` and `CartographyRelSchema` objects. [cartography.graph.querybuilder.build_ingestion_query()](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/querybuilder.py#L312) does just that: it accepts those schema objects as input and returns a well-formed and optimized cypher query.
+
+
+#### Defining a node
+ 
+As an example of a `CartographyNodeSchema`, you can view our [EMRClusterSchema code](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L106-L110):
+
+```python
+@dataclass(frozen=True)
+class EMRClusterSchema(CartographyNodeSchema):
+    label: str = 'EMRCluster'  # The label of the node
+    properties: EMRClusterNodeProperties = EMRClusterNodeProperties()  # An object representing all properties on the EMR Cluster node
+    sub_resource_relationship: EMRClusterToAWSAccount = EMRClusterToAWSAccount()  
+```
+
+An `EMRClusterSchema` object inherits from the `CartographyNodeSchema` class and contains a node label, properties, and connection to its [sub-resource](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L216-L228): an `AWSAccount`.
+
+Note that the typehints are necessary for Python dataclasses to work properly.
 
-#### Handling cartography's `update_tag`:
+
+#### Defining node properties
+
+Here's our [EMRClusterNodeProperties code](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L106-L110):
+
+```python
+@dataclass(frozen=True)
+class EMRClusterNodeProperties(CartographyNodeProperties):
+    arn: PropertyRef = PropertyRef('ClusterArn')
+    firstseen: PropertyRef = PropertyRef('firstseen')
+    id: PropertyRef = PropertyRef('Id')
+    # ...
+    lastupdated: PropertyRef = PropertyRef('lastupdated', set_in_kwargs=True)
+    region: PropertyRef = PropertyRef('Region', set_in_kwargs=True)
+    security_configuration: PropertyRef = PropertyRef('SecurityConfiguration')
+```
+
+A `CartographyNodeProperties` object consists of [`PropertyRef`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L37) objects. `PropertyRefs` tell `querybuilder.build_ingestion_query()` where to find appropriate values for each field from the list of dicts.
+
+For example, `id: PropertyRef = PropertyRef('Id')` above tells the querybuilder to set a field called `id` on the `EMRCluster` node using the value located at key `'id'` on each dict in the list.
+
+As another example, `region: PropertyRef = PropertyRef('Region', set_in_kwargs=True)` tells the querybuilder to set a field called `region` on the `EMRCluster` node using a keyword argument called `Region` supplied to `cartography.client.core.tx.load_graph_data()`. `set_in_kwargs=True` is useful in cases where we want every object loaded by a single call to `load_graph_data()` to have the same value for a given attribute.
+
+
+#### Defining relationships
+
+Relationships can be defined on `CartographyNodeSchema` on either their [`sub_resource_relationship`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L216-L228) field or their [`other_relationships`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L230-L237) field.
+
+As seen above, an `EMRClusterSchema` only has a single relationship defined: an [`EMRClusterToAWSAccount`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L94-L103):
+
+```python
+@dataclass(frozen=True)
+# (:EMRCluster)<-[:RESOURCE]-(:AWSAccount)
+class EMRClusterToAWSAccount(CartographyRelSchema):
+    target_node_label: str = 'AWSAccount'  # (1)
+    target_node_matcher: TargetNodeMatcher = make_target_node_matcher(  # (2)
+        {'id': PropertyRef('AccountId', set_in_kwargs=True)},
+    )
+    direction: LinkDirection = LinkDirection.INWARD  # (3)
+    rel_label: str = "RESOURCE"  # (4)
+    properties: EMRClusterToAwsAccountRelProperties = EMRClusterToAwsAccountRelProperties()  #  (5)
+```
+
+This class is best described by explaining how it is processed: `build_ingestion_query()` will traverse the `EMRClusterSchema` to its `sub_resource_relationship` field and find the above `EMRClusterToAWSAccount` object. With this information, we know to
+- draw a relationship to an `AWSAccount` node (1) using the label "`RESOURCE`" (4)
+- by matching on the AWSAccount's "`id`" field" (2)
+- where the relationship [directionality](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L12-L34) is pointed _inward_ toward the EMRCluster (3)
+- making sure to define a set of properties for the relationship (5). The [full example RelProperties](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L89-L91) is very short:
+
+```python
+@dataclass(frozen=True)
+class EMRClusterToAwsAccountRelProperties(CartographyRelProperties):
+    lastupdated: PropertyRef = PropertyRef('lastupdated', set_in_kwargs=True)
+```
+
+#### The result
+
+And those are all the objects necessary for this example! The resulting query will look something like this:
+
+```cypher
+UNWIND $DictList AS item
+    MERGE (i:EMRCluster{id: item.Id})
+    ON CREATE SET i.firstseen = timestamp()
+    SET
+        i.lastupdated = $lastupdated,
+        i.arn = item.ClusterArn
+        // ...
+            
+        WITH i, item
+        CALL {
+            WITH i, item
+            
+            OPTIONAL MATCH (j:AWSAccount{id: $AccountId})
+            WITH i, item, j WHERE j IS NOT NULL
+            MERGE (i)<-[r:RESOURCE]-(j)
+            ON CREATE SET r.firstseen = timestamp()
+            SET
+                r.lastupdated = $lastupdated
+        }
+```
+
+And that's basically all you need to know to understand how to define your own nodes and relationships using cartography's data objects. For more information, you can view the [object model API documentation](https://github.com/lyft/cartography/blob/master/cartography/graph/model.py) as a reference.
+
+### Additional concepts
+
+This section explains cartography general patterns, conventions, and design decisions. 
+
+#### cartography's `update_tag`:
 
 `cartography`'s global [config object carries around an `update_tag` property](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/cli.py#L91-L98)
 which is a tag/label associated with the current sync.
 Cartography's CLI code [sets this to a Unix timestamp of when the CLI was run](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/sync.py#L131-L134).
 
-All `cartography` intel modules need to set the `lastupdated` property on all nodes and all relationships to this
-`update_tag`. You can see a couple examples of this in our
-[AWS ingestion code](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/aws/__init__.py#L106) and our
-    [GCP ingestion code](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/__init__.py#L134).
+All `cartography` intel modules set the `lastupdated` property on all nodes and all relationships to this `update_tag`.
 
 
 #### All nodes need these fields
@@ -80,21 +209,21 @@ All `cartography` intel modules need to set the `lastupdated` property on all no
 
     When setting an `id`, ensure that you also include the field name that it came from. For example, since we've
     decided to use `partial_uri`s as an id for a GCPVpc,  we should include both `partial_uri` _and_ `id` on the node.
-    This way, a user can tell what fields were used to derive the `id`. This is accomplished [here](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L455-L457).
+    This way, a user can tell what fields were used to derive the `id`. This is accomplished [here](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L455-L457)
 
-- `lastupdated` - See [below](#lastupdated-and-firstseen) on how to set this.
-- `firstseen` - See [below](#lastupdated-and-firstseen) on how to set this.
+- `lastupdated` - See [below](#lastupdated-and-firstseen) on how this gets set automatically.
+- `firstseen` - See [below](#lastupdated-and-firstseen) on how this gets set automatically.
 
 #### All relationships need these fields
 
-Cartography currently does not create indexes on relationships, so we should keep relationships lightweight with only these two fields:
+Cartography currently does not create indexes on relationships, so in most cases we should keep relationships lightweight with only these two fields:
 
-- `lastupdated` - See [below](#lastupdated-and-firstseen) on how to set this.
-- `firstseen` - See [below](#lastupdated-and-firstseen) on how to set this.
+- `lastupdated` - See [below](#lastupdated-and-firstseen) on how this gets set automatically.
+- `firstseen` - See [below](#lastupdated-and-firstseen) on how this gets set automatically.
 
 #### Run queries only on indexed fields for best performance
 
-In this example of ingesting GCP VPCs, we connect VPCs with GCPProjects
+In this older example of ingesting GCP VPCs, we connect VPCs with GCPProjects
 [based on GCPProject `id`s and GCPVpc `id`s](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L451).
 `id`s are indexed, as seen [here](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/data/indexes.cypher#L45)
 and [here](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/data/indexes.cypher#L42).
@@ -103,52 +232,12 @@ All of these queries use indexes for faster lookup.
 #### Create an index for new nodes
 
 Be sure to [update the indexes.cypher file](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/data/indexes.cypher)
-with your new node type. Indexing on ID is required, and indexing on anything else that will be frequently queried is
+with your new node type. Indexing on "`id`" is required, and indexing on anything else that will be frequently queried is
 encouraged.
 
 #### lastupdated and firstseen
 
-Set the `lastupdated` and `firstseen` fields on both nodes and relationships. Suppose we are creating
-the following chain:
-
-```cypher
-MERGE (n:NodeType)-[r:RELATIONSHIP]->(n2:NodeType2)
-```
-
-- To handle nodes in this case,
-
-    - Every `MERGE` query that creates a new node should look like this
-
-        ```cypher
-        ON CREATE SET n.firstseen = $UpdateTag
-        SET
-        n.lastupdated = $UpdateTag,
-        node.field1 = $value1,
-        node.field2 = $value2,
-        ...
-        node.fieldN = $valueN
-        ```
-
-- To handle relationships in this case,
-
-    - Every `MERGE` query that creates a new relationship should look like this
-
-        ```cypher
-        ON CREATE SET r.firstseen = $UpdateTag
-        SET
-        r.lastupdated = $UpdateTag
-        ```
-
-#### Connecting different node types with the `_attach` pattern
-
-Node connections can be complex. In many cases we need to connect many different node types together, so we use an
-`_attach` function to manage this.
-
-The best way to explain `_attach` is through an example, like when [we connect GCP instances to their VPCs](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L439).
-In this case, we create a [helper `_attach` function](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/gcp/compute.py#L660)
-that accepts the instance's `id` and connects the instance to the VPC using a `MERGE` query.
-
-This pattern can also be seen when [attaching AWS RDS instances to EC2 security groups](https://github.com/lyft/cartography/blob/8d60311a10156cd8aa16de7e1fe3e109cc3eca0f/cartography/intel/aws/rds.py#L108).
+On every cartography node and relationship, we set the `lastupdated` field to the `UPDATE_TAG` and `firstseen` field to `timestamp()` (a built-in Neo4j function equivalent to epoch time in milliseconds). This is automatically handled by the cartography object model.
 
 ### Cleanup
 

From 9c2ec2dd221c6644183cb9da29623a9e24c594b7 Mon Sep 17 00:00:00 2001
From: Alex Chantavy <achantavy@lyft.com>
Date: Tue, 20 Dec 2022 22:53:53 -0800
Subject: [PATCH 2/3] linter

---
 docs/root/dev/writing-intel-modules.md | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/docs/root/dev/writing-intel-modules.md b/docs/root/dev/writing-intel-modules.md
index ce2491031b..9ad0039d46 100644
--- a/docs/root/dev/writing-intel-modules.md
+++ b/docs/root/dev/writing-intel-modules.md
@@ -87,7 +87,7 @@ def load_emr_clusters(
 
 
 #### Defining a node
- 
+
 As an example of a `CartographyNodeSchema`, you can view our [EMRClusterSchema code](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L106-L110):
 
 ```python
@@ -95,7 +95,7 @@ As an example of a `CartographyNodeSchema`, you can view our [EMRClusterSchema c
 class EMRClusterSchema(CartographyNodeSchema):
     label: str = 'EMRCluster'  # The label of the node
     properties: EMRClusterNodeProperties = EMRClusterNodeProperties()  # An object representing all properties on the EMR Cluster node
-    sub_resource_relationship: EMRClusterToAWSAccount = EMRClusterToAWSAccount()  
+    sub_resource_relationship: EMRClusterToAWSAccount = EMRClusterToAWSAccount()
 ```
 
 An `EMRClusterSchema` object inherits from the `CartographyNodeSchema` class and contains a node label, properties, and connection to its [sub-resource](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L216-L228): an `AWSAccount`.
@@ -169,11 +169,11 @@ UNWIND $DictList AS item
         i.lastupdated = $lastupdated,
         i.arn = item.ClusterArn
         // ...
-            
+
         WITH i, item
         CALL {
             WITH i, item
-            
+
             OPTIONAL MATCH (j:AWSAccount{id: $AccountId})
             WITH i, item, j WHERE j IS NOT NULL
             MERGE (i)<-[r:RESOURCE]-(j)
@@ -187,7 +187,7 @@ And that's basically all you need to know to understand how to define your own n
 
 ### Additional concepts
 
-This section explains cartography general patterns, conventions, and design decisions. 
+This section explains cartography general patterns, conventions, and design decisions.
 
 #### cartography's `update_tag`:
 

From 492210927ac3838f49347ec96ddcf5e9530ea42c Mon Sep 17 00:00:00 2001
From: Alex Chantavy <achantavy@lyft.com>
Date: Thu, 22 Dec 2022 14:59:38 -0800
Subject: [PATCH 3/3] example of other_relationships

---
 docs/root/dev/writing-intel-modules.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/root/dev/writing-intel-modules.md b/docs/root/dev/writing-intel-modules.md
index 9ad0039d46..afef157281 100644
--- a/docs/root/dev/writing-intel-modules.md
+++ b/docs/root/dev/writing-intel-modules.md
@@ -128,7 +128,7 @@ As another example, `region: PropertyRef = PropertyRef('Region', set_in_kwargs=T
 
 #### Defining relationships
 
-Relationships can be defined on `CartographyNodeSchema` on either their [`sub_resource_relationship`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L216-L228) field or their [`other_relationships`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L230-L237) field.
+Relationships can be defined on `CartographyNodeSchema` on either their [`sub_resource_relationship`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L216-L228) field or their [`other_relationships`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/graph/model.py#L230-L237) field (you can find an example of `other_relationships` [here in our test data](https://github.com/lyft/cartography/blob/4bfafe0e0c205909d119cc7f0bae84b9f6944bdd/tests/data/graph/querybuilder/sample_models/interesting_asset.py#L89-L94)).
 
 As seen above, an `EMRClusterSchema` only has a single relationship defined: an [`EMRClusterToAWSAccount`](https://github.com/lyft/cartography/blob/e6ada9a1a741b83a34c1c3207515a1863debeeb9/cartography/intel/aws/emr.py#L94-L103):