Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with rerunning import automation #332

Open
aclum opened this issue Dec 10, 2024 · 3 comments
Open

issues with rerunning import automation #332

aclum opened this issue Dec 10, 2024 · 3 comments

Comments

@aclum
Copy link
Contributor

aclum commented Dec 10, 2024

observed undesirable behavior when running import automation on the same input tsv

  1. generates a new data object for the raw data and appends it to has_output to the data_generation_set record instead of checking if it exists already (a filter would need to match on name
  2. added values to has_output didn't end up being present in data_object_set cc @eecavanna @dwinston
  3. makes a new workflow execution record with a new ID blade instead of checking if it exists already. We need to brainstorm how to make this more robust. Options include leveraging alternative_identifiers populated with https://microbiomedata.github.io/nmdc-schema/jgi_portal_analysis_project_identifiers/ which would require development from the staging code or a check on the md5sum on a subset of the data_object_types which should be unique
  4. We haven't imported new analysis on an already existing record but in that case the version should be incremented using the same blade instead of minting a new ID.
@aclum
Copy link
Contributor Author

aclum commented Dec 10, 2024

related to #333

@aclum
Copy link
Contributor Author

aclum commented Dec 12, 2024

Example
this created 12 data objects with the same name when one data object is expected:
{
"_id": {
"$oid": "65f4a59a5406a79013c683f2"
},
"add_date": "2022-03-25",
"gold_sequencing_project_identifiers": [
"gold:Gp0618275"
],
"has_input": [
"nmdc:bsm-11-6053t429"
],
"id": "nmdc:omprc-12-hgksne68",
"mod_date": "2024-01-07",
"name": "Soil microbial communities from grassland in Ellensburg, Washington, USA - FTA3_TOP",
"ncbi_project_name": "Soil microbial communities from grassland in Ellensburg, Washington, USA - FTA3_TOP",
"principal_investigator": {
"email": "[email protected]",
"has_raw_value": "Nancy Hess",
"name": "Nancy Hess",
"type": "nmdc:PersonValue"
},
"processing_institution": "JGI",
"type": "nmdc:NucleotideSequencing",
"analyte_category": "metagenome",
"associated_studies": [
"nmdc:sty-11-28tm5d36"
],
"instrument_used": [
"nmdc:inst-14-xx07be40"
],
"has_output": [
"nmdc:dobj-11-ngnp0x38",
"nmdc:dobj-11-p9ppyw63",
"nmdc:dobj-11-mt7w5d20",
"nmdc:dobj-11-se8j5y46",
"nmdc:dobj-11-c1vb9b95",
"nmdc:dobj-11-077b5f80",
"nmdc:dobj-11-5wqq9253",
"nmdc:dobj-11-hcdssz60",
"nmdc:dobj-11-dsce6x62",
"nmdc:dobj-11-9gb9sv11",
"nmdc:dobj-11-259sqj14",
"nmdc:dobj-11-ybxtmk16"
],
"data_object_set": [
{
"_id": {
"$oid": "6758976c9fbf1cdbb352c267"
},
"id": "nmdc:dobj-11-mt7w5d20",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758b69b9fbf1cdbb35432de"
},
"id": "nmdc:dobj-11-9gb9sv11",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758af599fbf1cdbb353d29a"
},
"id": "nmdc:dobj-11-5wqq9253",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758a7a89fbf1cdbb3537d8b"
},
"id": "nmdc:dobj-11-se8j5y46",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758a81e9fbf1cdbb353825c"
},
"id": "nmdc:dobj-11-c1vb9b95",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758ae819fbf1cdbb353c805"
},
"id": "nmdc:dobj-11-077b5f80",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "67588ff29fbf1cdbb3526767"
},
"id": "nmdc:dobj-11-p9ppyw63",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758b2e69fbf1cdbb35401e0"
},
"id": "nmdc:dobj-11-hcdssz60",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "675884249fbf1cdbb351d16e"
},
"id": "nmdc:dobj-11-ngnp0x38",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758b4669fbf1cdbb3541504"
},
"id": "nmdc:dobj-11-dsce6x62",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758b7119fbf1cdbb35438a0"
},
"id": "nmdc:dobj-11-259sqj14",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
},
{
"_id": {
"$oid": "6758b7fd9fbf1cdbb354442b"
},
"id": "nmdc:dobj-11-ybxtmk16",
"type": "nmdc:DataObject",
"name": "52710.1.424012.TACACGCT-TACACGCT.fastq.gz",
"description": "Metagenome Raw Reads for nmdc:omprc-12-hgksne68",
"data_object_type": "Metagenome Raw Reads",
"file_size_bytes": {
"$numberLong": "32990463167"
},
"md5_checksum": "076e1ce8b016543dc97f4bdad90a376c"
}
]
}

@aclum
Copy link
Contributor Author

aclum commented Dec 12, 2024

At least for now dumping the calls to file instead of calling the API will help. I'm still not sure how we got so many data objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants