Identify Mongo documents containing values that are empty lists (`[]`) #1306

aclum · 2023-11-07T17:21:35Z

When doing the re-iding work I noticed in several places we have documents with empty lists
Ie many omics_processing_set records have gold_sequencing_project_identifiers with a list size of zero.
Should the API reject records with empty lists or continue to accept them?

@dwinston @shreddd @turbomam

tasks

identify which collections/keys have empty lists
create tickets to get json generation code updated
create and implement a plan to have runtime API reject records with empty lists.

aclum · 2023-11-07T18:54:13Z

Transferring to nmdc-schema per conversations with Mark. Suggestion was to set min cardinality in the schema. This needs to be coordinated with workflows so code that generates submissions gets updated. @corilo @Michal-Babins @mbthornton-lbl

aclum · 2023-11-07T19:00:05Z

@eecavanna offered to run a check in python to see what slots are populated with empty lists so we can reach out to the folks that write these documents.

turbomam · 2023-11-07T19:09:24Z

This API URL shows an example of a record with an asserted empty gold_sequencing_project_identifiers list:

https://api.microbiomedata.org/nmdcschema/ids/emsl%3A739472

turbomam · 2023-11-07T19:09:47Z

{
  "id": "emsl:739472",
  "name": "Brodie_158_MeOH_R3_23Mar19_HESI_Neg",
  "description": "High resolution MS spectra only",
  "has_input": [
    "igsn:IEWFS000S"
  ],
  "has_output": [
    "emsl:output_739472"
  ],
  "part_of": [
    "gold:Gs0135149"
  ],
  "instrument_name": "21T_Agilent",
  "omics_type": {
    "has_raw_value": "Organic Matter Characterization"
  },
  "processing_institution": "EMSL",
  "type": "nmdc:OmicsProcessing",
  "gold_sequencing_project_identifiers": []
}

turbomam · 2023-11-07T19:18:18Z

That doesn't make it into the YAML output of the pure-export script in it's current state. I think the YAML serializer refuses to write keys with empty values, at least in its default ceonfiguration.

Therefore, the nmdc-schema repo isn't ready to check for this sort of thing right now.

aclum · 2023-11-07T19:50:53Z

@turbomam does this mean we should covert this back to a nmdc-runtime issue? Can the pypi package check this?

eecavanna · 2023-11-08T04:25:45Z

@eecavanna offered to run a check in python to see what slots are populated with empty lists so we can reach out to the folks that write these documents.

I want to clarify the requirements: Generate a list consisting of the id of every document—from every collection—that has a field (any field, at any level of nesting) whose value is an empty list. Is that correct?

Example output:

collection  id
foo_set     foo:1234
foo_set     abc123
bar_set     123
...

Are there any collections you'd be OK with the script ignoring? (The more data there is to process -> the longer the script will take to run... but it might not be a different any of us notices)

turbomam · 2023-11-08T13:50:03Z

does this mean we should covert this back to a nmdc-runtime issue?

I wouldn't object

Can the pypi package check this?

I don't think it would help, at least with the way I was trying to check. I was starting by using the nmdc-schema pure-export command to dump MongoDB contents. That apparently refuses to write False-like values, such as empty lists. The advantage of pure-export is that it wraps the MongoDB contents in the corresponding Database slots.

If we want to use LinkML validation to check for empty lists, I think it would be more helpful for somebody else to write a different dumper that includes the Database slot wrapping.

But we/I should check with other LinkML experts like @cmungall or @pkalita-lbl to see if they have any insights into LinkML's ability to recognize empty lists.

turbomam · 2023-11-08T14:03:47Z

Are there any collections you'd be OK with the script ignoring

At the very minimum, don't bother checking any collection whose name isn't a Database slot.

pure-export has code that addresses the selection of dump-worthy collections, but

it isn't completely broken out into a modular function
it currently requires a PyMongo connection. I have an issue to remove this requirement.

@aclum is this discussion relevant to @eecavanna's question?:

populate GraphDB from https://api-napa.microbiomedata.org/ #1302 (comment)

aclum · 2023-11-08T20:05:42Z

From what I've seen this issue is limited to external identifier slots
Yes, this should be run against prod.

eecavanna · 2023-11-09T02:13:39Z

At the very minimum, don't bother checking any collection whose name isn't a Database slot.

Thanks. I can use this snippet to determine the collections that are in both the schema and the database:

from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict

# ...

# Make a list of names of the collections that are in the schema.
nmdc_jsonschema: dict = get_nmdc_jsonschema_dict()
collection_names_in_schema = nmdc_jsonschema["$defs"]["Database"]["properties"].keys()

# Make a list of names of the collections that are in the database.
# Note: `db` is a pymongo reference to the nmdc database
collection_names_in_database: list[str] = db.list_collection_names()

# Make a list of the collection names that are in both of those lists.
collection_names_to_scan = list(set(collection_names_in_schema).intersection(set(collection_names_in_database)))

Here's the list of collections I came up with using that snippet:

Collections to scan (23 collections):
study_set
biosample_set
metagenome_sequencing_activity_set
read_based_taxonomy_analysis_activity_set
read_qc_analysis_activity_set
activity_set
processed_sample_set
metagenome_assembly_set
extraction_set
metagenome_annotation_activity_set
nom_analysis_activity_set
metatranscriptome_activity_set
omics_processing_set
material_sample_set
pooling_set
metabolomics_analysis_activity_set
functional_annotation_agg
mags_activity_set
data_object_set
metaproteomics_analysis_activity_set
library_preparation_set
collecting_biosamples_from_site_set
field_research_site_set

Here are the numbers of documents in each of those collections (as of right now):

study_set (19 documents)
biosample_set (7594 documents)
metagenome_sequencing_activity_set (631 documents)
read_based_taxonomy_analysis_activity_set (3053 documents)
read_qc_analysis_activity_set (3114 documents)
activity_set (0 documents)
processed_sample_set (5750 documents)
metagenome_assembly_set (2940 documents)
extraction_set (2127 documents)
metagenome_annotation_activity_set (2645 documents)
nom_analysis_activity_set (1985 documents)
metatranscriptome_activity_set (55 documents)
omics_processing_set (6214 documents)
material_sample_set (0 documents)
pooling_set (1491 documents)
metabolomics_analysis_activity_set (209 documents)
functional_annotation_agg (11822821 documents)
mags_activity_set (2645 documents)
data_object_set (138120 documents)
metaproteomics_analysis_activity_set (52 documents)
library_preparation_set (2132 documents)
collecting_biosamples_from_site_set (0 documents)
field_research_site_set (110 documents)

turbomam · 2023-11-09T14:39:50Z

That's awesome, @eecavanna . Could you please enhance your report by giving one example of an empty set from each collection? Ideally the enhanced report would list the id of the entity owning an empty list, and the slot that links that entity to the empty list, like

nmdc:sty-99-123456; has_journal_retractions

eecavanna · 2023-11-09T17:23:01Z

I have a question that will influence the complexity of the search algorithm I use.

Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots?

Here's an example JSON object to illustrate what I mean by "top-level" slots versus "lower-level/nested" slots:

{
  "id": "foo:123",
  "list_e": [],  // <-- empty list in top-level slot
  "obj_e": {},
  "list_f": [ "bar", { "name": "Fred", "age": 123 }, [], 789 ],  // <-- empty list within lower-level/nested slot
}
  "obj_f": { "baz": [] }  // <-- empty list in lower-level/nested slot
}

In that example JSON object:

The only top-level slots: id, list_e, obj_e, list_f, and obj_f
Some lower-level/nested slots are: list_f[1].name and obj_f.baz

If I only check top-level slots, the algorithm won't involve recursion. If I also check lower-level/nested slots, the algorithm will involve recursion.

eecavanna · 2023-11-09T17:28:11Z

Could you please enhance your report by giving one example of an empty set from each collection?

Yes, I'll include that info.

I want to clarify that the lists of collections I posted above show all collections that are in both the schema and the database. It was not a report of collections having documents that contain empty lists. I posted that to share how much data the Python script would be searching through, after filtering out the irrelevant collections.

eecavanna · 2023-11-09T18:51:29Z

Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots?

Assuming it is the former (i.e. always in top-level slots): I have generated a report. It is a 7 MB CSV file with 108,850 rows of data in it. Here's a screenshot of the top of the file, to show its structure:

I spot checked two rows from the report and found they did, indeed, refer to top-level slots whose values were empty lists. Here's an example:

@turbomam and @aclum, I will send you the 7 MB report via Slack.

turbomam · 2023-11-09T18:57:03Z

Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots

We need to check nested slots, too. This might be made more efficient by making a list of multi-valued lists in advance. You could do that by making a SchemaView and iterating over all slots, checking if multivalued is True for each.

turbomam · 2023-11-09T19:02:02Z

Count - document_id	own_key_pointing_to_empty_list
collection_name	alternative_identifiers	gold_analysis_project_identifiers	gold_biosample_identifiers	gold_sequencing_project_identifiers	gold_study_identifiers	has_metabolite_quantifications	mags_list	part_of	Total Result
biosample_set			4007						4007
data_object_set	96203								96203
mags_activity_set							2269		2269
metabolomics_analysis_activity_set						1			1
metagenome_annotation_activity_set		2033							2033
omics_processing_set				3815					3815
read_based_taxonomy_analysis_activity_set								518	518
study_set					4				4
Total Result	96203	2033	4007	3815	4	1	2269	518	108850

eecavanna · 2023-11-10T02:23:04Z

I updated the report so it checks nested slots, too (at infinity levels of depth). I also updated it to include the path to the empty slot. For example:

For this document...

{ a: { b: [ {}, { c: [] }, "foo" ] } }

...the path to the empty list would be reported as...

a.b[1].c

The resulting report (CSV file) was 80 MB and contained 709,679 rows, each of which contained a path to an empty list.

I spot checked two rows from the report and found they did, indeed, contain paths to empty lists. Here's an example:

@turbomam and @aclum, I will send you a 2.5 MB ZIP of the CSV file, via Slack.

aclum · 2023-11-10T18:42:12Z

Unique list of collection, slot
collection_name,path_to_empty_list
biosample_set,gold_biosample_identifiers
data_object_set,alternative_identifiers
mags_activity_set,mags_list
metabolomics_analysis_activity_set,has_metabolite_quantifications
metabolomics_analysis_activity_set,has_metabolite_quantifications[].alternative_identifiers
metagenome_annotation_activity_set,gold_analysis_project_identifiers
metaproteomics_analysis_activity_set,has_peptide_quantifications[].all_proteins
omics_processing_set,gold_sequencing_project_identifiers
read_based_taxonomy_analysis_activity_set,part_of
study_set,gold_study_identifiers

➕ @eecavanna generated the following Markdown table from the above CSV string:

Unique list of collection (`collection_name`)	slot (`path_to_empty_list`)
biosample_set	gold_biosample_identifiers
data_object_set	alternative_identifiers
mags_activity_set	mags_list
metabolomics_analysis_activity_set	has_metabolite_quantifications
metabolomics_analysis_activity_set	has_metabolite_quantifications[].alternative_identifiers
metagenome_annotation_activity_set	gold_analysis_project_identifiers
metaproteomics_analysis_activity_set	has_peptide_quantifications[].all_proteins
omics_processing_set	gold_sequencing_project_identifiers
read_based_taxonomy_analysis_activity_set	part_of
study_set	gold_study_identifiers

aclum · 2023-11-10T19:23:30Z

Once scripts that generate json ( the referenced issues) for ingest to mongo prod are updated we can make a plan to have runtime API return that records with empty lists are invalid. Putting this in the backlog until then.

aclum · 2024-10-01T23:57:11Z

processes that submit to mongo continue to do this. The latest ive found this week are ETL code, #373 has been in the backlog and microbiomedata/nmdc_automation#259

aclum transferred this issue from microbiomedata/nmdc-runtime Nov 7, 2023

aclum assigned eecavanna Nov 7, 2023

eecavanna added this to 2023 Squad Sprint 23: November 6 - November 17, 2023 Nov 8, 2023

eecavanna moved this to Todo in 2023 Squad Sprint 23: November 6 - November 17, 2023 Nov 8, 2023

eecavanna changed the title ~~how to deal with empty lists~~ Identify Mongo documents containing values that are empty lists ([]) Nov 9, 2023

eecavanna moved this from Todo to In Progress in 2023 Squad Sprint 23: November 6 - November 17, 2023 Nov 9, 2023

eecavanna added the X SMALL Less than 8 hours, less than 1 day label Nov 9, 2023

eecavanna moved this from In Progress to In Review in 2023 Squad Sprint 23: November 6 - November 17, 2023 Nov 10, 2023

aclum added the backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. label Nov 10, 2023

aclum removed this from 2023 Squad Sprint 23: November 6 - November 17, 2023 Nov 10, 2023

aclum mentioned this issue Jan 10, 2024

Portal Help - search items microbiomedata/nmdc-server#357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify Mongo documents containing values that are empty lists (`[]`) #1306

Identify Mongo documents containing values that are empty lists (`[]`) #1306

aclum commented Nov 7, 2023 •

edited

Loading

aclum commented Nov 7, 2023

aclum commented Nov 7, 2023

turbomam commented Nov 7, 2023 •

edited

Loading

turbomam commented Nov 7, 2023

turbomam commented Nov 7, 2023

aclum commented Nov 7, 2023

eecavanna commented Nov 8, 2023 •

edited

Loading

turbomam commented Nov 8, 2023

turbomam commented Nov 8, 2023 •

edited

Loading

aclum commented Nov 8, 2023

eecavanna commented Nov 9, 2023 •

edited

Loading

turbomam commented Nov 9, 2023

eecavanna commented Nov 9, 2023 •

edited

Loading

eecavanna commented Nov 9, 2023 •

edited

Loading

eecavanna commented Nov 9, 2023

turbomam commented Nov 9, 2023

turbomam commented Nov 9, 2023

eecavanna commented Nov 10, 2023

aclum commented Nov 10, 2023 •

edited by eecavanna

Loading

aclum commented Nov 10, 2023

aclum commented Oct 1, 2024

Identify Mongo documents containing values that are empty lists ([]) #1306

Identify Mongo documents containing values that are empty lists ([]) #1306

Comments

aclum commented Nov 7, 2023 • edited Loading

aclum commented Nov 7, 2023

aclum commented Nov 7, 2023

turbomam commented Nov 7, 2023 • edited Loading

turbomam commented Nov 7, 2023

turbomam commented Nov 7, 2023

aclum commented Nov 7, 2023

eecavanna commented Nov 8, 2023 • edited Loading

turbomam commented Nov 8, 2023

turbomam commented Nov 8, 2023 • edited Loading

aclum commented Nov 8, 2023

eecavanna commented Nov 9, 2023 • edited Loading

turbomam commented Nov 9, 2023

eecavanna commented Nov 9, 2023 • edited Loading

eecavanna commented Nov 9, 2023 • edited Loading

eecavanna commented Nov 9, 2023

turbomam commented Nov 9, 2023

turbomam commented Nov 9, 2023

eecavanna commented Nov 10, 2023

aclum commented Nov 10, 2023 • edited by eecavanna Loading

aclum commented Nov 10, 2023

aclum commented Oct 1, 2024

Identify Mongo documents containing values that are empty lists (`[]`) #1306

Identify Mongo documents containing values that are empty lists (`[]`) #1306

aclum commented Nov 7, 2023 •

edited

Loading

turbomam commented Nov 7, 2023 •

edited

Loading

eecavanna commented Nov 8, 2023 •

edited

Loading

turbomam commented Nov 8, 2023 •

edited

Loading

eecavanna commented Nov 9, 2023 •

edited

Loading

eecavanna commented Nov 9, 2023 •

edited

Loading

eecavanna commented Nov 9, 2023 •

edited

Loading

aclum commented Nov 10, 2023 •

edited by eecavanna

Loading