Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

696 update dagster op for materialize alldocs #817

Merged
merged 6 commits into from
Dec 13, 2024

Conversation

mkshah605
Copy link
Collaborator

In this branch, I updated materialize_alldocs in the dagster file.

Details

In this PR, I:

  • updated the "materialize_alldocs" in the dagster op to reflect functionality in bulk_validation_referential_integrity_check.ipynb
  • added necessary utils to utils.py
  • replaced pick function to use keyfilter directly

Related issue(s)

Related subsystem(s)

  • Dagster
  • Project documentation (in the docs directory)

Testing

  • I tested these changes (explain below)
  • I did not test these changes

I tested these changes by making sure the ensure_alldocs job runs in my local dagster environment: Succesful Dagster Run

Documentation

  • I have not checked for relevant documentation yet (e.g. in the docs directory)
  • I have updated all relevant documentation so it will remain accurate

Maintainability

  • Every Python function I defined includes a docstring (test functions are exempt from this)
  • Every Python function parameter I introduced includes a type hint (e.g. study_id: str)
  • All "to do" or "fix me" Python comments I added begin with either # TODO or # FIXME
  • I used black to format all the Python files I created/modified
  • The PR title is in the imperative mood (e.g. "Do X") and not the declarative mood (e.g. "Does X" or "Did X")

@mkshah605 mkshah605 self-assigned this Dec 6, 2024
@dwinston dwinston merged commit 1b1a25c into main Dec 13, 2024
2 checks passed
@eecavanna
Copy link
Collaborator

Hi @mkshah605, thanks for implementing this.

I have a question about something in the diff (CC: @dwinston):

            slots_to_include = ["id", "type"] + document_reference_ranged_slots[
                doc_type
            ]
            new_doc = keyfilter(lambda slot: slot in slots_to_include, doc)
            new_doc["_type_and_ancestors"] = schema_view.class_ancestors(doc_type)

Will the documents in the alldocs collection populated via this algorithm have all of the fields they have in the upstream, source-of-truth collections; or will they only have the fields id, type, _type_and_ancestors, and any fields that the schema says can contain references to documents?

@dwinston
Copy link
Collaborator

Hi @mkshah605, thanks for implementing this.

I have a question about something in the diff (CC: @dwinston):

            slots_to_include = ["id", "type"] + document_reference_ranged_slots[
                doc_type
            ]
            new_doc = keyfilter(lambda slot: slot in slots_to_include, doc)
            new_doc["_type_and_ancestors"] = schema_view.class_ancestors(doc_type)

Will the documents in the alldocs collection populated via this algorithm have all of the fields they have in the upstream, source-of-truth collections; or will they only have the fields id, type, _type_and_ancestors, and any fields that the schema says can contain references to documents?

The latter. I verified that the current direct usage of alldocs in the API only seeks document-reference-containing fields.

@eecavanna
Copy link
Collaborator

Thanks for confirming, @dwinston. I will update my mental model (and the one-liner description of it that I have ended up giving during various meetings) accordingly.

@eecavanna eecavanna deleted the 696-update-dagster-op-for-materialize_alldocs branch December 14, 2024 01:04
@PeopleMakeCulture PeopleMakeCulture mentioned this pull request Dec 17, 2024
16 tasks
@shreddd
Copy link
Collaborator

shreddd commented Dec 19, 2024

@dwinston - we should be careful when changing fields in all-docs since other parts of the API depend on it. Let's do a sanity check during one if the sync or squad meetings if there is a potentially breaking change to all-docs format or interface.

@dwinston
Copy link
Collaborator

Gotcha. I'm glad we have a sanity-checking "happy path" automated test for GET /data_objects/study/{study_id} incoming via #725.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants