update scripts making nmdc json documents to exclude key if an empty array for sequencing workflows #24

aclum · 2023-11-10T19:14:24Z

Is your feature request related to a problem? Please describe.
Best practice for mongo is to leave a key out if the list of values is empty. We've noticed documents in mongo which use keys whose values are an empty list.

Describe the solution you'd like
Update json creation code to exclude keys if the value is an empty array. The following keys have empty arrays in mongo prod

collection_name,path_to_empty_list
data_object_set,alternative_identifiers
mags_activity_set,mags_list
metagenome_annotation_activity_set,gold_analysis_project_identifiers
read_based_taxonomy_analysis_activity_set,part_of <- this key can be depricated for this workflow execution activity, is is not being populated and is redundant with was_informed_by. was_informed_by is what is required for the data portal.

Describe alternatives you've considered
Leave data in mongo as is.
Acceptance Criteria
Json documents created by sequencing workflows have no keys where the value is an empty array.

Who will use this feature/enhancement? internal staff and possibly external users
When will they use it? when querying the API or using tools like studio 3T/compass or pymongo
How will they use it? queries and data manipulation will be easier
How will they test it to make sure it's working? Eric can test this with code he wrote to check for empty arrays in microbiomedata/nmdc-schema#1306
Is the request achievable? During one sprint? yes, yes
What is your definition of done for this request? See acceptance criteria. Data would/could be cleaned up in mongo prod after scripts are updated deleting of these keys w/empty lists is out of scope for this ticket.

mbthornton-lbl · 2023-12-06T21:37:04Z

Underlying cause was incorrect serialization of the output - we are now using the correct serialization from linkml_runtime.dumpers

mbthornton-lbl · 2023-12-06T21:38:58Z

Example after correcting serialization:

[
    {
        "data_object_set": [
            {
                "id": "nmdc:dobj-11-k7vny888",
                "name": "9422.8.132674.GTTTCG.fastq.gz",
                "description": "Raw sequencer read data",
                "file_size_bytes": 2861414297,
                "data_object_type": "Metagenome Raw Reads",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-019yes10",
                "name": "nmdc_wfrqc-11-zma0ys31.1_filtered.fastq.gz",
                "description": "Filtered Reads for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 2571324879,
                "md5_checksum": "7bf778baef033d36f118f8591256d6ef",
                "data_object_type": "Filtered Sequencing Reads",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrqc-11-zma0ys31.1/nmdc_wfrqc-11-zma0ys31.1_filtered.fastq.gz",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-hty12n62",
                "name": "nmdc_wfrqc-11-zma0ys31.1_filterStats.txt",
                "description": "Filtered Stats for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 290,
                "md5_checksum": "b99ce8adc125c95f0bfdadf36a3f6848",
                "data_object_type": "QC Statistics",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrqc-11-zma0ys31.1/nmdc_wfrqc-11-zma0ys31.1_filterStats.txt",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-gast3j11",
                "name": "nmdc_wfmgas-11-3jvymb63.1_contigs.fna",
                "description": "Assembled contigs fasta for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 91134523,
                "md5_checksum": "b96c8e7796616a8eefe473bff2c62e52",
                "data_object_type": "Assembly Contigs",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_contigs.fna",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-bkza5366",
                "name": "nmdc_wfmgas-11-3jvymb63.1_scaffolds.fna",
                "description": "Assembled scaffold fasta for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 90622585,
                "md5_checksum": "6ca496a8b9b298278ad2b4010a7c8cb2",
                "data_object_type": "Assembly Scaffolds",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_scaffolds.fna",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-v9xfxp70",
                "name": "nmdc_wfmgas-11-3jvymb63.1_covstats.txt",
                "description": "Metagenome Contig Coverage Stats for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 14431055,
                "md5_checksum": "19782102f68575b03b7c12dd3d48e840",
                "data_object_type": "Assembly Coverage Stats",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_covstats.txt",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-dz2mw103",
                "name": "nmdc_wfmgas-11-3jvymb63.1_assembly.agp",
                "description": "Assembled AGP file for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 14581247,
                "md5_checksum": "419b294106e3fca4a06d18fd3c8e9181",
                "data_object_type": "Assembly AGP",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_assembly.agp",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-75skzn36",
                "name": "nmdc_wfmgas-11-3jvymb63.1_pairedMapped_sorted.bam",
                "description": "Metagenome Alignment BAM file for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 0,
                "md5_checksum": "d41d8cd98f00b204e9800998ecf8427e",
                "data_object_type": "Assembly Coverage BAM",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfmgas-11-3jvymb63.1/nmdc_wfmgas-11-3jvymb63.1_pairedMapped_sorted.bam",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-ppa5pg23",
                "name": "nmdc_wfrbt-11-e79d5x03.1_gottcha2_report.tsv",
                "description": "Gottcha2 TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 13174,
                "md5_checksum": "bc7c1bda004aab357c8f6cf5a42242f9",
                "data_object_type": "GOTTCHA2 Classification Report",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_gottcha2_report.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-0yn4b055",
                "name": "nmdc_wfrbt-11-e79d5x03.1_gottcha2_report_full.tsv",
                "description": "Gottcha2 full TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 1035818,
                "md5_checksum": "9481434cadd0d6c154e2ec4c11ef0e04",
                "data_object_type": "GOTTCHA2 Report Full",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_gottcha2_report_full.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-ty0z3p61",
                "name": "nmdc_wfrbt-11-e79d5x03.1_gottcha2_krona.html",
                "description": "Gottcha2 Krona HTML report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 262669,
                "md5_checksum": "6b5bc6ce7f11c1336a5f85a98fc18541",
                "data_object_type": "GOTTCHA2 Krona Plot",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_gottcha2_krona.html",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-e6h68y35",
                "name": "nmdc_wfrbt-11-e79d5x03.1_centrifuge_classification.tsv",
                "description": "Centrifuge classification TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 2189843623,
                "md5_checksum": "933c71bbc2f4a2e84d50f0d3864cf940",
                "data_object_type": "Centrifuge Taxonomic Classification",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_centrifuge_classification.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-chgp8k25",
                "name": "nmdc_wfrbt-11-e79d5x03.1_centrifuge_report.tsv",
                "description": "Centrifuge TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 260134,
                "md5_checksum": "1a208e2519770ef50740ac39f1b9ba9a",
                "data_object_type": "Centrifuge Classification Report",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_centrifuge_report.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-0wbjqw24",
                "name": "nmdc_wfrbt-11-e79d5x03.1_centrifuge_krona.html",
                "description": "Centrifuge Krona HTML report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 2343980,
                "md5_checksum": "f112a3840464ae7a9cf4a3bf295edd5c",
                "data_object_type": "Centrifuge Krona Plot",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_centrifuge_krona.html",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-xteq6n75",
                "name": "nmdc_wfrbt-11-e79d5x03.1_kraken2_classification.tsv",
                "description": "Kraken classification TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 1785563917,
                "md5_checksum": "7ca01ea379f0baed96f87d1435925f95",
                "data_object_type": "Kraken2 Taxonomic Classification",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_kraken2_classification.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-1n5y1278",
                "name": "nmdc_wfrbt-11-e79d5x03.1_kraken2_report.tsv",
                "description": "Kraken2 TSV report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 699896,
                "md5_checksum": "c85f2f2b4a518c4adb23970448a5cb45",
                "data_object_type": "Kraken2 Classification Report",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_kraken2_report.tsv",
                "type": "nmdc:DataObject"
            },
            {
                "id": "nmdc:dobj-11-rtjb8n73",
                "name": "nmdc_wfrbt-11-e79d5x03.1_kraken2_krona.html",
                "description": "Kraken2 Krona HTML report for nmdc:omprc-11-bn8jcq58",
                "file_size_bytes": 4221977,
                "md5_checksum": "94ee1bc2dc74830a21d5c3471d6cf223",
                "data_object_type": "Kraken2 Krona Plot",
                "url": "https://data.microbiomedata.org/data/nmdc:omprc-11-bn8jcq58/nmdc:wfrbt-11-e79d5x03.1/nmdc_wfrbt-11-e79d5x03.1_kraken2_krona.html",
                "type": "nmdc:DataObject"
            }
        ],
        "metagenome_assembly_set": [
            {
                "id": "nmdc:wfmgas-11-3jvymb63.1",
                "name": "Metagenome Assembly Activity for nmdc:omprc-11-bn8jcq58",
                "started_at_time": "2021-10-11T02:28:26Z",
                "ended_at_time": "2021-10-11T04:56:04+00:00",
                "was_informed_by": "nmdc:omprc-11-bn8jcq58",
                "execution_resource": "NERSC-Cori",
                "git_url": "https://github.com/microbiomedata/metaAssembly",
                "has_input": [
                    "nmdc:dobj-11-019yes10"
                ],
                "has_output": [
                    "nmdc:dobj-11-gast3j11",
                    "nmdc:dobj-11-bkza5366",
                    "nmdc:dobj-11-v9xfxp70",
                    "nmdc:dobj-11-dz2mw103",
                    "nmdc:dobj-11-75skzn36"
                ],
                "type": "nmdc:MetagenomeAssembly",
                "part_of": [
                    "nmdc:omprc-11-bn8jcq58"
                ],
                "version": "v1.0.3",
                "asm_score": 6.577,
                "scaffolds": 169645,
                "scaf_logsum": 215363,
                "scaf_powsum": 24422,
                "scaf_max": 68135,
                "scaf_bp": 83496490,
                "scaf_n50": 45550,
                "scaf_n90": 141870,
                "scaf_l50": 470,
                "scaf_l90": 290,
                "scaf_n_gt50k": 1,
                "scaf_l_gt50k": 68135,
                "scaf_pct_gt50k": 0.08160224,
                "contigs": 169784,
                "contig_bp": 83494920,
                "ctg_n50": 45584,
                "ctg_l50": 470,
                "ctg_n90": 141996,
                "ctg_l90": 290,
                "ctg_logsum": 214373,
                "ctg_powsum": 24284,
                "ctg_max": 68135,
                "gap_pct": 0.00188,
                "gc_std": 0.11726,
                "gc_avg": 0.46001
            }
        ],
        "omics_processing_set": [
            {
                "id": "nmdc:omprc-11-bn8jcq58",
                "name": "Sand microcosm microbial communities from a hyporheic zone in Columbia River, Washington, USA - GW-RW T2_23-Sept-14",
                "description": "Sterilized sand packs were incubated back in the ground and collected at time point T2.",
                "has_input": [
                    "nmdc:bsm-11-qq8s6x03"
                ],
                "add_date": "2015-05-28",
                "gold_sequencing_project_identifiers": [
                    "gold:Gp0115663"
                ],
                "has_output": [
                    "nmdc:dobj-11-k7vny888"
                ],
                "mod_date": "2021-06-15",
                "ncbi_project_name": "Sand microcosm microbial communities from a hyporheic zone in Columbia River, Washington, USA - GW-RW T2_23-Sept-14",
                "omics_type": {
                    "has_raw_value": "Metagenome"
                },
                "part_of": [
                    "nmdc:sty-11-aygzgv51"
                ],
                "principal_investigator": {
                    "has_raw_value": "James Stegen"
                },
                "processing_institution": "JGI",
                "type": "nmdc:OmicsProcessing"
            }
        ],
        "read_qc_analysis_activity_set": [
            {
                "id": "nmdc:wfrqc-11-zma0ys31.1",
                "name": "Read QC Activity for nmdc:omprc-11-bn8jcq58",
                "started_at_time": "2021-10-11T02:28:26Z",
                "ended_at_time": "2021-10-11T04:56:04+00:00",
                "was_informed_by": "nmdc:omprc-11-bn8jcq58",
                "execution_resource": "NERSC-Cori",
                "git_url": "https://github.com/microbiomedata/ReadsQC",
                "has_input": [
                    "nmdc:dobj-11-k7vny888"
                ],
                "has_output": [
                    "nmdc:dobj-11-019yes10",
                    "nmdc:dobj-11-hty12n62"
                ],
                "type": "nmdc:ReadQcAnalysisActivity",
                "part_of": [
                    "nmdc:omprc-11-bn8jcq58"
                ],
                "version": "v1.0.8",
                "input_read_count": 32238374,
                "output_read_count": 30774080,
                "input_read_bases": 4867994474,
                "output_read_bases": 4608772924
            }
        ],
        "read_based_taxonomy_analysis_activity_set": [
            {
                "id": "nmdc:wfrbt-11-e79d5x03.1",
                "name": "Readbased Taxonomy Analysis Activity for nmdc:omprc-11-bn8jcq58",
                "started_at_time": "2021-10-11T02:28:26Z",
                "ended_at_time": "2021-10-11T04:56:04+00:00",
                "was_informed_by": "nmdc:omprc-11-bn8jcq58",
                "execution_resource": "NERSC-Cori",
                "git_url": "https://github.com/microbiomedata/ReadbasedAnalysis",
                "has_input": [
                    "nmdc:dobj-11-019yes10"
                ],
                "has_output": [
                    "nmdc:dobj-11-ppa5pg23",
                    "nmdc:dobj-11-0yn4b055",
                    "nmdc:dobj-11-ty0z3p61",
                    "nmdc:dobj-11-e6h68y35",
                    "nmdc:dobj-11-chgp8k25",
                    "nmdc:dobj-11-0wbjqw24",
                    "nmdc:dobj-11-xteq6n75",
                    "nmdc:dobj-11-1n5y1278",
                    "nmdc:dobj-11-rtjb8n73"
                ],
                "type": "nmdc:ReadBasedTaxonomyAnalysisActivity",
                "part_of": [
                    "nmdc:omprc-11-bn8jcq58"
                ],
                "version": "v1.0.5"
            }
        ]
    }
]

mbthornton-lbl · 2023-12-06T21:40:23Z

@aclum We did not create a separate PR for this issue, but it is fixed, and I believe it can be closed

aclum · 2023-12-06T22:58:00Z

looks good, this can be closed.

aclum · 2024-02-10T00:39:44Z

I'm seeing records in mongo that are new that have null values.
ie {
"_id": {
"$oid": "65c46ec8bbacb81f5f775562"
},
"id": "nmdc:wfmgan-11-4sc85678.1",
"name": "Metagenome Annotation Analysis Activity for nmdc:wfmgan-11-4sc85678.1",
"started_at_time": "2024-02-07T22:56:21.682913+00:00",
"ended_at_time": "2024-02-08T06:03:41.175440+00:00",
"was_informed_by": "nmdc:omprc-11-9mvz7z22",
"used": null,
"execution_resource": "NERSC-Perlmutter",
"git_url": "https://github.com/microbiomedata/mg_annotation",
"has_input": [
"nmdc:dobj-11-5eb6v689"
],
"type": "nmdc:MetagenomeAnnotationActivity",
"has_output": [
"nmdc:dobj-11-y3f47w18",
"nmdc:dobj-11-e2r3ge57",
"nmdc:dobj-11-7k80qv75",
"nmdc:dobj-11-0z5rhk53",
"nmdc:dobj-11-feses595",
"nmdc:dobj-11-sb28nx57",
"nmdc:dobj-11-vn9pwz37",
"nmdc:dobj-11-9x2zaf16",
"nmdc:dobj-11-9prnyr33",
"nmdc:dobj-11-xx1tb938",
"nmdc:dobj-11-72e7f129",
"nmdc:dobj-11-y563v150",
"nmdc:dobj-11-yawesx56",
"nmdc:dobj-11-za36h087",
"nmdc:dobj-11-vjfcne42",
"nmdc:dobj-11-bsq83730",
"nmdc:dobj-11-x2m4k008",
"nmdc:dobj-11-jre1qx13",
"nmdc:dobj-11-9kpz9641",
"nmdc:dobj-11-k62fk420",
"nmdc:dobj-11-ac70cp72",
"nmdc:dobj-11-ntsb3x16",
"nmdc:dobj-11-pcckzg89"
],
"part_of": [
"nmdc:omprc-11-9mvz7z22"
],
"version": "v1.0.4",
"qc_status": null,
"qc_comment": null,
"has_failure_categorization": [],
"gold_analysis_project_identifiers": []
}

aclum · 2024-02-12T18:53:41Z

Closing per Micheal's recommendation in favor of #55

aclum assigned Michal-Babins and mbthornton-lbl Nov 10, 2023

mbthornton-lbl added this to 2023 Squad Sprint 25: December 4 - December 15, 2023 Dec 6, 2023

mbthornton-lbl moved this to In Review in 2023 Squad Sprint 25: December 4 - December 15, 2023 Dec 6, 2023

mbthornton-lbl moved this from In Review to Done in 2023 Squad Sprint 25: December 4 - December 15, 2023 Dec 6, 2023

mbthornton-lbl added this to 2023 Squad Sprint 26: December 18 - December 29, 2023 Dec 18, 2023

mbthornton-lbl moved this to Done in 2023 Squad Sprint 26: December 18 - December 29, 2023 Dec 18, 2023

mbthornton-lbl closed this as completed Dec 18, 2023

aclum reopened this Feb 10, 2024

aclum mentioned this issue Feb 12, 2024

Inconsistent schema validation between /v1/workflows/activities and /metadata/json:validate API endpoints microbiomedata/nmdc-runtime#462

Closed

mbthornton-lbl mentioned this issue Feb 12, 2024

Null values and empty arrays being populated to Database by workflow automation #55

Closed

aclum closed this as completed Feb 12, 2024

aclum unassigned Michal-Babins Oct 1, 2024

aclum mentioned this issue Oct 1, 2024

import automation generating mongo keys for which values are None or an empty list #259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update scripts making nmdc json documents to exclude key if an empty array for sequencing workflows #24

update scripts making nmdc json documents to exclude key if an empty array for sequencing workflows #24

aclum commented Nov 10, 2023

mbthornton-lbl commented Dec 6, 2023

mbthornton-lbl commented Dec 6, 2023

mbthornton-lbl commented Dec 6, 2023

aclum commented Dec 6, 2023

aclum commented Feb 10, 2024

aclum commented Feb 12, 2024

update scripts making nmdc json documents to exclude key if an empty array for sequencing workflows #24

update scripts making nmdc json documents to exclude key if an empty array for sequencing workflows #24

Comments

aclum commented Nov 10, 2023

mbthornton-lbl commented Dec 6, 2023

mbthornton-lbl commented Dec 6, 2023

mbthornton-lbl commented Dec 6, 2023

aclum commented Dec 6, 2023

aclum commented Feb 10, 2024

aclum commented Feb 12, 2024