Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#349: Generalize graph write queries #1038

Merged
merged 31 commits into from
Dec 16, 2022
Merged

#349: Generalize graph write queries #1038

merged 31 commits into from
Dec 16, 2022

Conversation

achantavy
Copy link
Contributor

@achantavy achantavy commented Nov 28, 2022

Addresses #349, also a good step working toward #1024.

  • Provides module authors with an automated way to load data to the graph so that they don't have to hand write Neo4j queries. Makes the process significantly less error prone by generating queries, and helps with reliability because transactions and batching are abstracted away.
  • Defines schema objects so that we can automate index creation and cleanup job creation in a future PR.
  • Demonstrates this functionality by having the AWS EMR sync adopt it.
  • Allows module authors to connect relationships to nodes using one or more attributes.

Why not use an existing ORM? Existing ones:

  • don't use UNWIND + MERGE for speed and batching
  • don't allow configuring Neo4j driver details such as connection timeouts, which are absolutely necessary if you must send traffic across a load balancer
  • don't run Neo4j managed transaction functions for automatic retries
  • in some cases use their own custom driver (or even require you to run a separate webserver for IPC!) and not the official Neo4j driver

@achantavy achantavy changed the title Generalize graph write queries #349: Generalize graph write queries Nov 28, 2022
.pre-commit-config.yaml Outdated Show resolved Hide resolved
) -> None:
query = """
Copy link
Contributor Author

@achantavy achantavy Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR refactors the EMR sync to demonstrate using the new functionality. Rather than hand write queries, module authors now define schema objects and call build_ingestion_query() and load_graph_data().

Note that the existing EMR integration tests still pass.

Copy link
Collaborator

@ramonpetgrave64 ramonpetgrave64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not yet finished reviewing, and I want to add a helper function to make it a bit easier to create a CartographyNodeSchema . like this, using the types library. But this can come later.

create_node_schema(
  label='EMRCluster',
  properties=create_node_properties(
    arn='ClusterArn',
    auto_terminate='AutoTerminate',
    ...
  ),
  ...
)

cartography/graph/model.py Outdated Show resolved Hide resolved
cartography/graph/model.py Show resolved Hide resolved
cartography/graph/model.py Show resolved Hide resolved
@dataclass
class EMRClusterSchema(CartographyNodeSchema):
label: str = 'EMRCluster'
properties: EMRClusterNodeProperties = EMRClusterNodeProperties()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
properties: EMRClusterNodeProperties = EMRClusterNodeProperties()
properties: CartographyNodeProperties = EMRClusterNodeProperties()

Copy link
Contributor Author

@achantavy achantavy Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we typehint with the specific type and not the generic one?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's helpful just as a clue to devs on what the supertype should be. When editing in an IDE, though, the effect is the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing this we're saying that EMRClusterSchema.properties will accept any CartographyNodeProperties type, which is not correct, so I think we should keep this as-is.

cartography/graph/querybuilder.py Outdated Show resolved Hide resolved
ramonpetgrave64
ramonpetgrave64 previously approved these changes Dec 7, 2022
Copy link
Collaborator

@ramonpetgrave64 ramonpetgrave64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! The type tricks I mentioned can come later.

Co-authored-by: Ramon Petgrave <[email protected]>
Copy link
Collaborator

@ramonpetgrave64 ramonpetgrave64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but that one test case needs to fix.

@achantavy achantavy merged commit 9179f59 into master Dec 16, 2022
@achantavy achantavy deleted the generalizequerywrites branch December 16, 2022 19:39
@achantavy achantavy mentioned this pull request Jan 30, 2023
chandanchowdhury pushed a commit to juju4/cartography that referenced this pull request Jun 26, 2024
…f#1038)

* Build ingest query

* Linter

* Save cleanup query for another PR

* Implement schema

* bump mypy to 0.981 for python/mypy#13398

* linter

* make load_graph_data interface make more sense

* fix comment

* Docs and some better names

* add a todo

* Doc updates, rename some fields

* Fix pre-commit

* Code commment suggestions

Co-authored-by: Ramon Petgrave <[email protected]>

* Stackoverflow comment for clarity)

* Support ingesting only parts of a schema without breaking the others

* Doc comment

* Linter

* Support matching on one or more properties

* Correctly name test

* Change key_refs to TargetNodeMatcher to enforce it as a mandatory field

* Remove use of hacky default_field()

* Support subset of schema relationships for query generation, test multiple node labels

* Docstrings

* Comments in tests

* Better comments

* Test for exception conditions

* Remove irrelevant comment

Co-authored-by: Ramon Petgrave <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants