Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update bulk_validation_referential_integrity_check notebook to concur with refscan (no false positives) #796

Merged
merged 8 commits into from
Nov 26, 2024

Conversation

dwinston
Copy link
Collaborator

In this branch, I updated docs/nb/bulk_validation_referential_integrity_check.ipynb so that it prefers a union of a LinkML slot's any_of ranges, when present, to the value of range (cf. refscan's get_names_of_classes_in_effective_range_of_slot). Furthermore, I ditch use of concurrent.futures.ThreadPoolExecutor, as its usage results in inconsistent collections of errors from run to run. Tuning efficiency as needed will be raised as a separate issue.

Details

Now returns 33 "not found" errors and zero "invalid type" errors on using nmdc-schema v11.1.0 on /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-prod_2024-11-25_20-12-02/nmdc, in alignment with the corresponding refscan_report.20241126_083003_UTC.schema_v11.1.0.nmdc.violations.tsv.

...

Related issue(s)

Fixes #576

...

Related subsystem(s)

  • Runtime API (except the Minter)
  • Minter
  • Dagster
  • Project documentation (in the docs directory)
  • Translators (metadata ingest pipelines)
  • MongoDB migrations
  • Other

Testing

  • I tested these changes (explain below)
  • I did not test these changes

I tested these changes by...

Documentation

  • I have not checked for relevant documentation yet (e.g. in the docs directory)
  • I have updated all relevant documentation so it will remain accurate
  • Other (explain below)

Maintainability

  • Every Python function I defined includes a docstring (test functions are exempt from this)
  • Every Python function parameter I introduced includes a type hint (e.g. study_id: str)
  • All "to do" or "fix me" Python comments I added begin with either # TODO or # FIXME
  • I used black to format all the Python files I created/modified
  • The PR title is in the imperative mood (e.g. "Do X") and not the declarative mood (e.g. "Does X" or "Did X")

@dwinston dwinston merged commit d7da63f into main Nov 26, 2024
1 check passed
@dwinston
Copy link
Collaborator Author

Merged in order to focus reviewer bandwidth on #696. As is, this PR does not "operationalize" anything.

Comment on lines 103 to 107
ssh -i ~/.ssh/nersc ${NERSC_USERNAME}@dtn01.nersc.gov \
'tar -czf - -C /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-prod_2024-07-29_20-12-07/nmdc .' \
'tar -czf - -C /global/cfs/projectdirs/m3408/nmdc-mongodumps/dump_nmdc-prod_2024-11-25_20-12-02/nmdc .' \
| tar -xzv -C /tmp/remote-mongodump/nmdc
mongorestore -v -h localhost:27018 -u admin -p root --authenticationDatabase=admin \
--drop --nsInclude='nmdc.*' --dir /tmp/remote-mongodump
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, @dwinston, as of about 24 hours ago (via this PR), the daily dumps are being created with the --gzip CLI option to mongodump. Before that, the --gzip CLI option was not being included. So, the mongorestore command here may require a change (e.g. the addition of its --gzip CLI option—assuming you're restoring a dump that was created after about 24 hours ago).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, the mongorestore works without an explicit --gzip. smart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

False positive type errors from alldocs ref integrity code
3 participants