Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code for NMDC to NCBI export #518

Merged
merged 30 commits into from
Jun 25, 2024
Merged

Code for NMDC to NCBI export #518

merged 30 commits into from
Jun 25, 2024

Conversation

sujaypatil96
Copy link
Collaborator

@sujaypatil96 sujaypatil96 commented May 8, 2024

Description

Code that creates NCBI submission.xml using an NMDC slot-to-NCBI BioSample Attribute mapping file.

Fixes #503

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Tests are still to come.

Configuration Details: none

Checklist:

  • My code follows the style guidelines of this project (have you run black nmdc_runtime/?)
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (in docs/ and in https://github.com/microbiomedata/NMDC_documentation/?)
  • I have added tests that prove my fix is effective or that my feature works, incl. considering downstream usage (e.g. https://github.com/microbiomedata/notebook_hackathons) if applicable.
  • New and existing unit and functional tests pass locally with my changes (make up-test && make test-run)

@turbomam
Copy link
Member

turbomam commented May 8, 2024

where did you get those checklists? I may want to use them in the future

nmdc_runtime/site/export/ncbi_xml.py looks like you're following good practices!

@sujaypatil96
Copy link
Collaborator Author

@sujaypatil96
Copy link
Collaborator Author

CC: @pkalita-lbl if I could get your eyes on the code that I'm writing as part of this PR for a quick glance/review that would be super helpful! 🙏🏼

@sujaypatil96 sujaypatil96 requested a review from aclum May 16, 2024 00:53
nmdc_runtime/site/export/nmdc_api_client.py Outdated Show resolved Hide resolved
nmdc_runtime/site/export/ncbi_xml.py Show resolved Hide resolved
Comment on lines +152 to +154
attribute_mappings, slot_range_mappings = load_mappings(
self.nmdc_ncbi_attribute_mapping_file_url
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to not tie the process of obtaining the mapping directly into this class. Let this class be just about transformation. Let the client of the class worry about how to get the mapping and pass it (the mapping itself, not a URL) to this class.

from nmdc_runtime.site.export.nmdc_api_client import NMDCApiClient


class NCBISubmissionXML:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quibbles about naming. NCBISubmissionXML feels more representative of what this class produces. So what is the class itself? Something like NCBISubmissionGenerator or NCBISubmissionTranslator?

Also this class has a lot of methods named set_*. When I see a method with such a name I expect it to set some piece of internal class state and return nothing. Instead, some of these methods (set_element, set_descriptor) produce XML elements and return them. Other (set_description, set_bioproject, set_biosample) produce XML elements and append them to the XML root. I think this code might be a bit more readable if all those methods were named something like build_* or generate_* and always return what they produced. Then the caller (get_submission_xml) would be responsible for taking the return values and use them as needed (appending to the root element).

nmdc_runtime/site/export/ncbi_xml.py Outdated Show resolved Hide resolved
nmdc_runtime/site/export/ncbi_xml_utils.py Outdated Show resolved Hide resolved
nmdc_runtime/site/ops.py Outdated Show resolved Hide resolved
requirements/main.in Outdated Show resolved Hide resolved
requirements/dev.in Outdated Show resolved Hide resolved
Comment on lines 285 to 293
# ============= Uncomment the following code to validate the XML against NCBI XSDs ============ #
# submission_xsd_url = "https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/common/submission.xsd?view=co"
# submission_xsd_validation = validate_xml(submission_xml, submission_xsd_url)

# bioproject_xsd_url = "https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/common/bioproject.xsd?view=co"
# bioproject_xsd_validation = validate_xml(submission_xml, bioproject_xsd_url)

# biosample_xsd_url = "https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/common/biosample.xsd?view=co"
# biosample_xsd_validation = validate_xml(submission_xml, biosample_xsd_url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't leave commented-out code like this in. But I would turn these into unit tests.

@sujaypatil96 sujaypatil96 marked this pull request as ready for review June 4, 2024 23:31
Copy link
Contributor

@aclum aclum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some logic to skip records or error if biosample metadata exists already on biosample_set
'insdc_biosample_identifiers':{$exists:true,$not: {$size: 0}}
Same would be true for insdc_experiment_identifiers, the can currently be on OmicsProcessing. In Berkeley they can be on DataGeneration or DataObject.

@sujaypatil96
Copy link
Collaborator Author

@pkalita-lbl i think this code is good to go for the most part. I've addressed most of your code review comments in this PR, but i haven't addressed all. I was hoping to get this PR merged in because I want to avoid it from growing any more and becoming increasingly more painful to review.

nmdc_runtime/site/export/ncbi_xml.py Show resolved Hide resolved
nmdc_runtime/site/export/ncbi_xml.py Outdated Show resolved Hide resolved
nmdc_runtime/site/export/ncbi_xml.py Outdated Show resolved Hide resolved
)
def ncbi_submission_xml_asset(context: OpExecutionContext, data: str):
filename = "ncbi_submission.xml"
file_path = os.path.join(context.instance.storage_directory(), filename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this storage_directory function before. Is there any kind of documentation on it? I can't seem to find any.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch again!

So some context as to what I'm trying to do/experiment with here. A hard requirement that came from @aclum was that we should be able to download the content, in this case, XML content that is rendered through the Dagit interface as a file on our local system.

This storage_directory() is part of https://docs.dagster.io/deployment/dagster-instance#local-artifact-storage which we can use to store files. Then those files need to be allowed/able to be "served up" through Dagit somehow. I'm still trying to iron out the kinks with respect to that detail, but do let me know if you have any ideas.

Copy link
Collaborator

@pkalita-lbl pkalita-lbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still some room for clean up but I think this is good as a first pass

@sujaypatil96 sujaypatil96 merged commit dca5a70 into main Jun 25, 2024
2 checks passed
@PeopleMakeCulture PeopleMakeCulture deleted the issue-503 branch July 10, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Develop NMDC to NCBI export code
4 participants