-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code for NMDC to NCBI export #518
Conversation
where did you get those checklists? I may want to use them in the future
|
@turbomam it's the GitHub PR template in this repo: https://github.com/microbiomedata/nmdc-runtime/blob/main/.github/pull_request_template.md |
CC: @pkalita-lbl if I could get your eyes on the code that I'm writing as part of this PR for a quick glance/review that would be super helpful! 🙏🏼 |
attribute_mappings, slot_range_mappings = load_mappings( | ||
self.nmdc_ncbi_attribute_mapping_file_url | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to not tie the process of obtaining the mapping directly into this class. Let this class be just about transformation. Let the client of the class worry about how to get the mapping and pass it (the mapping itself, not a URL) to this class.
from nmdc_runtime.site.export.nmdc_api_client import NMDCApiClient | ||
|
||
|
||
class NCBISubmissionXML: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some quibbles about naming. NCBISubmissionXML
feels more representative of what this class produces. So what is the class itself? Something like NCBISubmissionGenerator
or NCBISubmissionTranslator
?
Also this class has a lot of methods named set_*
. When I see a method with such a name I expect it to set some piece of internal class state and return nothing. Instead, some of these methods (set_element
, set_descriptor
) produce XML elements and return them. Other (set_description
, set_bioproject
, set_biosample
) produce XML elements and append them to the XML root. I think this code might be a bit more readable if all those methods were named something like build_*
or generate_*
and always return what they produced. Then the caller (get_submission_xml
) would be responsible for taking the return values and use them as needed (appending to the root element).
nmdc_runtime/site/export/ncbi_xml.py
Outdated
# ============= Uncomment the following code to validate the XML against NCBI XSDs ============ # | ||
# submission_xsd_url = "https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/common/submission.xsd?view=co" | ||
# submission_xsd_validation = validate_xml(submission_xml, submission_xsd_url) | ||
|
||
# bioproject_xsd_url = "https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/common/bioproject.xsd?view=co" | ||
# bioproject_xsd_validation = validate_xml(submission_xml, bioproject_xsd_url) | ||
|
||
# biosample_xsd_url = "https://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/submit/public-docs/common/biosample.xsd?view=co" | ||
# biosample_xsd_validation = validate_xml(submission_xml, biosample_xsd_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't leave commented-out code like this in. But I would turn these into unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some logic to skip records or error if biosample metadata exists already on biosample_set
'insdc_biosample_identifiers':{$exists:true,$not: {$size: 0}}
Same would be true for insdc_experiment_identifiers, the can currently be on OmicsProcessing. In Berkeley they can be on DataGeneration or DataObject.
@pkalita-lbl i think this code is good to go for the most part. I've addressed most of your code review comments in this PR, but i haven't addressed all. I was hoping to get this PR merged in because I want to avoid it from growing any more and becoming increasingly more painful to review. |
) | ||
def ncbi_submission_xml_asset(context: OpExecutionContext, data: str): | ||
filename = "ncbi_submission.xml" | ||
file_path = os.path.join(context.instance.storage_directory(), filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen this storage_directory
function before. Is there any kind of documentation on it? I can't seem to find any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch again!
So some context as to what I'm trying to do/experiment with here. A hard requirement that came from @aclum was that we should be able to download the content, in this case, XML content that is rendered through the Dagit interface as a file on our local system.
This storage_directory()
is part of https://docs.dagster.io/deployment/dagster-instance#local-artifact-storage which we can use to store files. Then those files need to be allowed/able to be "served up" through Dagit somehow. I'm still trying to iron out the kinks with respect to that detail, but do let me know if you have any ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still some room for clean up but I think this is good as a first pass
Description
Code that creates NCBI submission.xml using an NMDC slot-to-NCBI BioSample Attribute mapping file.
Fixes #503
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Tests are still to come.
Configuration Details: none
Checklist:
black nmdc_runtime/
?)docs/
and in https://github.com/microbiomedata/NMDC_documentation/?)make up-test && make test-run
)