issues with rerunning import automation #332

aclum · 2024-12-10T23:09:43Z

observed undesirable behavior when running import automation on the same input tsv

generates a new data object for the raw data and appends it to has_output to the data_generation_set record instead of checking if it exists already (a filter would need to match on name
added values to has_output didn't end up being present in data_object_set cc @eecavanna @dwinston
makes a new workflow execution record with a new ID blade instead of checking if it exists already. We need to brainstorm how to make this more robust. Options include leveraging alternative_identifiers populated with https://microbiomedata.github.io/nmdc-schema/jgi_portal_analysis_project_identifiers/ which would require development from the staging code or a check on the md5sum on a subset of the data_object_types which should be unique
We haven't imported new analysis on an already existing record but in that case the version should be incremented using the same blade instead of minting a new ID.

The text was updated successfully, but these errors were encountered:

aclum · 2024-12-10T23:14:50Z

related to #333

aclum · 2024-12-12T18:37:42Z

Example
this created 12 data objects with the same name when one data object is expected:
{
"_id": {
"$oid": "65f4a59a5406a79013c683f2"
},
"add_date": "2022-03-25",
"gold_sequencing_project_identifiers": [
"gold:Gp0618275"
],
"has_input": [
"nmdc:bsm-11-6053t429"
],
"id": "nmdc:omprc-12-hgksne68",
"mod_date": "2024-01-07",
"name": "Soil microbial communities from grassland in Ellensburg, Washington, USA - FTA3_TOP",
"ncbi_project_name": "Soil microbial communities from grassland in Ellensburg, Washington, USA - FTA3_TOP",
"principal_investigator": {
"email": "[email protected]",
"has_raw_value": "Nancy Hess",
"name": "Nancy Hess",
"type": "nmdc:PersonValue"
},
"processing_institution": "JGI",
"type": "nmdc:NucleotideSequencing",
"analyte_category": "metagenome",
"associated_studies": [
"nmdc:sty-11-28tm5d36"
],
"instrument_used": [
"nmdc:inst-14-xx07be40"
],
"has_output": [
"nmdc:dobj-11-ngnp0x38",
"nmdc:dobj-11-p9ppyw63",
"nmdc:dobj-11-mt7w5d20",
"nmdc:dobj-11-se8j5y46",
"nmdc:dobj-11-c1vb9b95",
"nmdc:dobj-11-077b5f80",
"nmdc:dobj-11-5wqq9253",
"nmdc:dobj-11-hcdssz60",
"nmdc:dobj-11-dsce6x62",
"nmdc:dobj-11-9gb9sv11",
"nmdc:dobj-11-259sqj14",
"nmdc:dobj-11-ybxtmk16"
],
"data_object_set": [
{
"_id": {
"$oid": "6758976c9fbf1cdbb352c267"
},
"id": "nmdc:dobj-11-mt7w5d20",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758b69b9fbf1cdbb35432de"
},
"id": "nmdc:dobj-11-9gb9sv11",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758af599fbf1cdbb353d29a"
},
"id": "nmdc:dobj-11-5wqq9253",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758a7a89fbf1cdbb3537d8b"
},
"id": "nmdc:dobj-11-se8j5y46",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758a81e9fbf1cdbb353825c"
},
"id": "nmdc:dobj-11-c1vb9b95",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758ae819fbf1cdbb353c805"
},
"id": "nmdc:dobj-11-077b5f80",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "67588ff29fbf1cdbb3526767"
},
"id": "nmdc:dobj-11-p9ppyw63",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758b2e69fbf1cdbb35401e0"
},
"id": "nmdc:dobj-11-hcdssz60",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "675884249fbf1cdbb351d16e"
},
"id": "nmdc:dobj-11-ngnp0x38",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758b4669fbf1cdbb3541504"
},
"id": "nmdc:dobj-11-dsce6x62",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758b7119fbf1cdbb35438a0"
},
"id": "nmdc:dobj-11-259sqj14",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758b7fd9fbf1cdbb354442b"
},
"id": "nmdc:dobj-11-ybxtmk16",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
}
]
}

aclum · 2024-12-12T23:49:26Z

At least for now dumping the calls to file instead of calling the API will help. I'm still not sure how we got so many data objects.

aclum assigned mbthornton-lbl Dec 10, 2024

aclum mentioned this issue Dec 10, 2024

Ingest 1000 soils data generated at JGI microbiomedata/issues#632

Open

5 tasks

aclum added import priority-high labels Dec 12, 2024

mbthornton-lbl added this to 2024 - Sprint 52 - December 16 - 27, 2024 Dec 17, 2024

mbthornton-lbl moved this to In Progress in 2024 - Sprint 52 - December 16 - 27, 2024 Dec 17, 2024

mbthornton-lbl added this to 2024 > 2025 - Sprint 53 - December 30, 2024 - January 10, 2025 Dec 21, 2024

mbthornton-lbl moved this to In Progress in 2024 > 2025 - Sprint 53 - December 30, 2024 - January 10, 2025 Dec 21, 2024

mbthornton-lbl removed this from 2024 - Sprint 52 - December 16 - 27, 2024 Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues with rerunning import automation #332

issues with rerunning import automation #332

aclum commented Dec 10, 2024 •

edited

Loading

aclum commented Dec 10, 2024

aclum commented Dec 12, 2024

aclum commented Dec 12, 2024

issues with rerunning import automation #332

issues with rerunning import automation #332

Comments

aclum commented Dec 10, 2024 • edited Loading

aclum commented Dec 10, 2024

aclum commented Dec 12, 2024

aclum commented Dec 12, 2024

aclum commented Dec 10, 2024 •

edited

Loading