-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update scripts making nmdc json documents to exclude key if an empty array for sequencing workflows #24
Comments
Underlying cause was incorrect serialization of the output - we are now using the correct serialization from |
Example after correcting serialization: [
{
"data_object_set": [
{
"id": "nmdc:dobj-11-k7vny888",
"name": "9422.8.132674.GTTTCG.fastq.gz",
"description": "Raw sequencer read data",
"file_size_bytes": 2861414297,
"data_object_type": "Metagenome Raw Reads",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-019yes10",
"name": "nmdc_wfrqc-11-zma0ys31.1_filtered.fastq.gz",
"description": "Filtered Reads for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 2571324879,
"md5_checksum": "7bf778baef033d36f118f8591256d6ef",
"data_object_type": "Filtered Sequencing Reads",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrqc-11-zma0ys31.1/nmdc_wfrqc-11-zma0ys31.1_filtered.fastq.gz",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-hty12n62",
"name": "nmdc_wfrqc-11-zma0ys31.1_filterStats.txt",
"description": "Filtered Stats for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 290,
"md5_checksum": "b99ce8adc125c95f0bfdadf36a3f6848",
"data_object_type": "QC Statistics",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrqc-11-zma0ys31.1/nmdc_wfrqc-11-zma0ys31.1_filterStats.txt",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-gast3j11",
"name": "nmdc_wfmgas-11-3jvymb63.1_contigs.fna",
"description": "Assembled contigs fasta for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 91134523,
"md5_checksum": "b96c8e7796616a8eefe473bff2c62e52",
"data_object_type": "Assembly Contigs",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_contigs.fna",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-bkza5366",
"name": "nmdc_wfmgas-11-3jvymb63.1_scaffolds.fna",
"description": "Assembled scaffold fasta for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 90622585,
"md5_checksum": "6ca496a8b9b298278ad2b4010a7c8cb2",
"data_object_type": "Assembly Scaffolds",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_scaffolds.fna",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-v9xfxp70",
"name": "nmdc_wfmgas-11-3jvymb63.1_covstats.txt",
"description": "Metagenome Contig Coverage Stats for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 14431055,
"md5_checksum": "19782102f68575b03b7c12dd3d48e840",
"data_object_type": "Assembly Coverage Stats",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_covstats.txt",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-dz2mw103",
"name": "nmdc_wfmgas-11-3jvymb63.1_assembly.agp",
"description": "Assembled AGP file for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 14581247,
"md5_checksum": "419b294106e3fca4a06d18fd3c8e9181",
"data_object_type": "Assembly AGP",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_assembly.agp",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-75skzn36",
"name": "nmdc_wfmgas-11-3jvymb63.1_pairedMapped_sorted.bam",
"description": "Metagenome Alignment BAM file for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 0,
"md5_checksum": "d41d8cd98f00b204e9800998ecf8427e",
"data_object_type": "Assembly Coverage BAM",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_pairedMapped_sorted.bam",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-ppa5pg23",
"name": "nmdc_wfrbt-11-e79d5x03.1_gottcha2_report.tsv",
"description": "Gottcha2 TSV report for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 13174,
"md5_checksum": "bc7c1bda004aab357c8f6cf5a42242f9",
"data_object_type": "GOTTCHA2 Classification Report",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_gottcha2_report.tsv",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-0yn4b055",
"name": "nmdc_wfrbt-11-e79d5x03.1_gottcha2_report_full.tsv",
"description": "Gottcha2 full TSV report for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 1035818,
"md5_checksum": "9481434cadd0d6c154e2ec4c11ef0e04",
"data_object_type": "GOTTCHA2 Report Full",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_gottcha2_report_full.tsv",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-ty0z3p61",
"name": "nmdc_wfrbt-11-e79d5x03.1_gottcha2_krona.html",
"description": "Gottcha2 Krona HTML report for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 262669,
"md5_checksum": "6b5bc6ce7f11c1336a5f85a98fc18541",
"data_object_type": "GOTTCHA2 Krona Plot",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_gottcha2_krona.html",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-e6h68y35",
"name": "nmdc_wfrbt-11-e79d5x03.1_centrifuge_classification.tsv",
"description": "Centrifuge classification TSV report for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 2189843623,
"md5_checksum": "933c71bbc2f4a2e84d50f0d3864cf940",
"data_object_type": "Centrifuge Taxonomic Classification",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_centrifuge_classification.tsv",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-chgp8k25",
"name": "nmdc_wfrbt-11-e79d5x03.1_centrifuge_report.tsv",
"description": "Centrifuge TSV report for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 260134,
"md5_checksum": "1a208e2519770ef50740ac39f1b9ba9a",
"data_object_type": "Centrifuge Classification Report",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_centrifuge_report.tsv",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-0wbjqw24",
"name": "nmdc_wfrbt-11-e79d5x03.1_centrifuge_krona.html",
"description": "Centrifuge Krona HTML report for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 2343980,
"md5_checksum": "f112a3840464ae7a9cf4a3bf295edd5c",
"data_object_type": "Centrifuge Krona Plot",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_centrifuge_krona.html",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-xteq6n75",
"name": "nmdc_wfrbt-11-e79d5x03.1_kraken2_classification.tsv",
"description": "Kraken classification TSV report for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 1785563917,
"md5_checksum": "7ca01ea379f0baed96f87d1435925f95",
"data_object_type": "Kraken2 Taxonomic Classification",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_kraken2_classification.tsv",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-1n5y1278",
"name": "nmdc_wfrbt-11-e79d5x03.1_kraken2_report.tsv",
"description": "Kraken2 TSV report for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 699896,
"md5_checksum": "c85f2f2b4a518c4adb23970448a5cb45",
"data_object_type": "Kraken2 Classification Report",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_kraken2_report.tsv",
"type": "nmdc:DataObject"
},
{
"id": "nmdc:dobj-11-rtjb8n73",
"name": "nmdc_wfrbt-11-e79d5x03.1_kraken2_krona.html",
"description": "Kraken2 Krona HTML report for nmdc:omprc-11-bn8jcq58",
"file_size_bytes": 4221977,
"md5_checksum": "94ee1bc2dc74830a21d5c3471d6cf223",
"data_object_type": "Kraken2 Krona Plot",
"url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_kraken2_krona.html",
"type": "nmdc:DataObject"
}
],
"metagenome_assembly_set": [
{
"id": "nmdc:wfmgas-11-3jvymb63.1",
"name": "Metagenome Assembly Activity for nmdc:omprc-11-bn8jcq58",
"started_at_time": "2021-10-11T02:28:26Z",
"ended_at_time": "2021-10-11T04:56:04+00:00",
"was_informed_by": "nmdc:omprc-11-bn8jcq58",
"execution_resource": "NERSC-Cori",
"git_url": "https://github.com/microbiomedata/metaAssembly",
"has_input": [
"nmdc:dobj-11-019yes10"
],
"has_output": [
"nmdc:dobj-11-gast3j11",
"nmdc:dobj-11-bkza5366",
"nmdc:dobj-11-v9xfxp70",
"nmdc:dobj-11-dz2mw103",
"nmdc:dobj-11-75skzn36"
],
"type": "nmdc:MetagenomeAssembly",
"part_of": [
"nmdc:omprc-11-bn8jcq58"
],
"version": "v1.0.3",
"asm_score": 6.577,
"scaffolds": 169645,
"scaf_logsum": 215363,
"scaf_powsum": 24422,
"scaf_max": 68135,
"scaf_bp": 83496490,
"scaf_n50": 45550,
"scaf_n90": 141870,
"scaf_l50": 470,
"scaf_l90": 290,
"scaf_n_gt50k": 1,
"scaf_l_gt50k": 68135,
"scaf_pct_gt50k": 0.08160224,
"contigs": 169784,
"contig_bp": 83494920,
"ctg_n50": 45584,
"ctg_l50": 470,
"ctg_n90": 141996,
"ctg_l90": 290,
"ctg_logsum": 214373,
"ctg_powsum": 24284,
"ctg_max": 68135,
"gap_pct": 0.00188,
"gc_std": 0.11726,
"gc_avg": 0.46001
}
],
"omics_processing_set": [
{
"id": "nmdc:omprc-11-bn8jcq58",
"name": "Sand microcosm microbial communities from a hyporheic zone in Columbia River, Washington, USA - GW-RW T2_23-Sept-14",
"description": "Sterilized sand packs were incubated back in the ground and collected at time point T2.",
"has_input": [
"nmdc:bsm-11-qq8s6x03"
],
"add_date": "2015-05-28",
"gold_sequencing_project_identifiers": [
"gold:Gp0115663"
],
"has_output": [
"nmdc:dobj-11-k7vny888"
],
"mod_date": "2021-06-15",
"ncbi_project_name": "Sand microcosm microbial communities from a hyporheic zone in Columbia River, Washington, USA - GW-RW T2_23-Sept-14",
"omics_type": {
"has_raw_value": "Metagenome"
},
"part_of": [
"nmdc:sty-11-aygzgv51"
],
"principal_investigator": {
"has_raw_value": "James Stegen"
},
"processing_institution": "JGI",
"type": "nmdc:OmicsProcessing"
}
],
"read_qc_analysis_activity_set": [
{
"id": "nmdc:wfrqc-11-zma0ys31.1",
"name": "Read QC Activity for nmdc:omprc-11-bn8jcq58",
"started_at_time": "2021-10-11T02:28:26Z",
"ended_at_time": "2021-10-11T04:56:04+00:00",
"was_informed_by": "nmdc:omprc-11-bn8jcq58",
"execution_resource": "NERSC-Cori",
"git_url": "https://github.com/microbiomedata/ReadsQC",
"has_input": [
"nmdc:dobj-11-k7vny888"
],
"has_output": [
"nmdc:dobj-11-019yes10",
"nmdc:dobj-11-hty12n62"
],
"type": "nmdc:ReadQcAnalysisActivity",
"part_of": [
"nmdc:omprc-11-bn8jcq58"
],
"version": "v1.0.8",
"input_read_count": 32238374,
"output_read_count": 30774080,
"input_read_bases": 4867994474,
"output_read_bases": 4608772924
}
],
"read_based_taxonomy_analysis_activity_set": [
{
"id": "nmdc:wfrbt-11-e79d5x03.1",
"name": "Readbased Taxonomy Analysis Activity for nmdc:omprc-11-bn8jcq58",
"started_at_time": "2021-10-11T02:28:26Z",
"ended_at_time": "2021-10-11T04:56:04+00:00",
"was_informed_by": "nmdc:omprc-11-bn8jcq58",
"execution_resource": "NERSC-Cori",
"git_url": "https://github.com/microbiomedata/ReadbasedAnalysis",
"has_input": [
"nmdc:dobj-11-019yes10"
],
"has_output": [
"nmdc:dobj-11-ppa5pg23",
"nmdc:dobj-11-0yn4b055",
"nmdc:dobj-11-ty0z3p61",
"nmdc:dobj-11-e6h68y35",
"nmdc:dobj-11-chgp8k25",
"nmdc:dobj-11-0wbjqw24",
"nmdc:dobj-11-xteq6n75",
"nmdc:dobj-11-1n5y1278",
"nmdc:dobj-11-rtjb8n73"
],
"type": "nmdc:ReadBasedTaxonomyAnalysisActivity",
"part_of": [
"nmdc:omprc-11-bn8jcq58"
],
"version": "v1.0.5"
}
]
}
] |
@aclum We did not create a separate PR for this issue, but it is fixed, and I believe it can be closed |
looks good, this can be closed. |
I'm seeing records in mongo that are new that have null values. |
Closing per Micheal's recommendation in favor of #55 |
Is your feature request related to a problem? Please describe.
Best practice for mongo is to leave a key out if the list of values is empty. We've noticed documents in mongo which use keys whose values are an empty list.
Describe the solution you'd like
Update json creation code to exclude keys if the value is an empty array. The following keys have empty arrays in mongo prod
collection_name,path_to_empty_list
data_object_set,alternative_identifiers
mags_activity_set,mags_list
metagenome_annotation_activity_set,gold_analysis_project_identifiers
read_based_taxonomy_analysis_activity_set,part_of <- this key can be depricated for this workflow execution activity, is is not being populated and is redundant with was_informed_by. was_informed_by is what is required for the data portal.
Describe alternatives you've considered
Leave data in mongo as is.
Acceptance Criteria
Json documents created by sequencing workflows have no keys where the value is an empty array.
Who will use this feature/enhancement? internal staff and possibly external users
When will they use it? when querying the API or using tools like studio 3T/compass or pymongo
How will they use it? queries and data manipulation will be easier
How will they test it to make sure it's working? Eric can test this with code he wrote to check for empty arrays in microbiomedata/nmdc-schema#1306
Is the request achievable? During one sprint? yes, yes
What is your definition of done for this request? See acceptance criteria. Data would/could be cleaned up in mongo prod after scripts are updated deleting of these keys w/empty lists is out of scope for this ticket.
The text was updated successfully, but these errors were encountered: