Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify Mongo documents containing values that are empty lists ([]) #1306

Open
2 of 3 tasks
aclum opened this issue Nov 7, 2023 · 21 comments
Open
2 of 3 tasks

Identify Mongo documents containing values that are empty lists ([]) #1306

aclum opened this issue Nov 7, 2023 · 21 comments
Assignees
Labels
backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. X SMALL Less than 8 hours, less than 1 day

Comments

@aclum
Copy link
Contributor

aclum commented Nov 7, 2023

When doing the re-iding work I noticed in several places we have documents with empty lists
Ie many omics_processing_set records have gold_sequencing_project_identifiers with a list size of zero.
Should the API reject records with empty lists or continue to accept them?

@dwinston @shreddd @turbomam

tasks

  • identify which collections/keys have empty lists
  • create tickets to get json generation code updated
  • create and implement a plan to have runtime API reject records with empty lists.
@aclum aclum transferred this issue from microbiomedata/nmdc-runtime Nov 7, 2023
@aclum
Copy link
Contributor Author

aclum commented Nov 7, 2023

Transferring to nmdc-schema per conversations with Mark. Suggestion was to set min cardinality in the schema. This needs to be coordinated with workflows so code that generates submissions gets updated. @corilo @Michal-Babins @mbthornton-lbl

@aclum
Copy link
Contributor Author

aclum commented Nov 7, 2023

@eecavanna offered to run a check in python to see what slots are populated with empty lists so we can reach out to the folks that write these documents.

@turbomam
Copy link
Member

turbomam commented Nov 7, 2023

This API URL shows an example of a record with an asserted empty gold_sequencing_project_identifiers list:

https://api.microbiomedata.org/nmdcschema/ids/emsl%3A739472

@turbomam
Copy link
Member

turbomam commented Nov 7, 2023

{
  "id": "emsl:739472",
  "name": "Brodie_158_MeOH_R3_23Mar19_HESI_Neg",
  "description": "High resolution MS spectra only",
  "has_input": [
    "igsn:IEWFS000S"
  ],
  "has_output": [
    "emsl:output_739472"
  ],
  "part_of": [
    "gold:Gs0135149"
  ],
  "instrument_name": "21T_Agilent",
  "omics_type": {
    "has_raw_value": "Organic Matter Characterization"
  },
  "processing_institution": "EMSL",
  "type": "nmdc:OmicsProcessing",
  "gold_sequencing_project_identifiers": []
}

@turbomam
Copy link
Member

turbomam commented Nov 7, 2023

That doesn't make it into the YAML output of the pure-export script in it's current state. I think the YAML serializer refuses to write keys with empty values, at least in its default ceonfiguration.

Therefore, the nmdc-schema repo isn't ready to check for this sort of thing right now.

@aclum
Copy link
Contributor Author

aclum commented Nov 7, 2023

@turbomam does this mean we should covert this back to a nmdc-runtime issue? Can the pypi package check this?

@eecavanna
Copy link
Collaborator

eecavanna commented Nov 8, 2023

@eecavanna offered to run a check in python to see what slots are populated with empty lists so we can reach out to the folks that write these documents.

I want to clarify the requirements: Generate a list consisting of the id of every document—from every collection—that has a field (any field, at any level of nesting) whose value is an empty list. Is that correct?

Example output:

collection  id
foo_set     foo:1234
foo_set     abc123
bar_set     123
...

Are there any collections you'd be OK with the script ignoring? (The more data there is to process -> the longer the script will take to run... but it might not be a different any of us notices)

@turbomam
Copy link
Member

turbomam commented Nov 8, 2023

does this mean we should covert this back to a nmdc-runtime issue?

I wouldn't object

Can the pypi package check this?

I don't think it would help, at least with the way I was trying to check. I was starting by using the nmdc-schema pure-export command to dump MongoDB contents. That apparently refuses to write False-like values, such as empty lists. The advantage of pure-export is that it wraps the MongoDB contents in the corresponding Database slots.

If we want to use LinkML validation to check for empty lists, I think it would be more helpful for somebody else to write a different dumper that includes the Database slot wrapping.

But we/I should check with other LinkML experts like @cmungall or @pkalita-lbl to see if they have any insights into LinkML's ability to recognize empty lists.

@turbomam
Copy link
Member

turbomam commented Nov 8, 2023

Are there any collections you'd be OK with the script ignoring

At the very minimum, don't bother checking any collection whose name isn't a Database slot.

pure-export has code that addresses the selection of dump-worthy collections, but

@aclum is this discussion relevant to @eecavanna's question?:

@aclum
Copy link
Contributor Author

aclum commented Nov 8, 2023

From what I've seen this issue is limited to external identifier slots
Yes, this should be run against prod.

@eecavanna
Copy link
Collaborator

eecavanna commented Nov 9, 2023

At the very minimum, don't bother checking any collection whose name isn't a Database slot.

Thanks. I can use this snippet to determine the collections that are in both the schema and the database:

from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict

# ...

# Make a list of names of the collections that are in the schema.
nmdc_jsonschema: dict = get_nmdc_jsonschema_dict()
collection_names_in_schema = nmdc_jsonschema["$defs"]["Database"]["properties"].keys()

# Make a list of names of the collections that are in the database.
# Note: `db` is a pymongo reference to the nmdc database
collection_names_in_database: list[str] = db.list_collection_names()

# Make a list of the collection names that are in both of those lists.
collection_names_to_scan = list(set(collection_names_in_schema).intersection(set(collection_names_in_database)))

Here's the list of collections I came up with using that snippet:

Collections to scan (23 collections):
study_set
biosample_set
metagenome_sequencing_activity_set
read_based_taxonomy_analysis_activity_set
read_qc_analysis_activity_set
activity_set
processed_sample_set
metagenome_assembly_set
extraction_set
metagenome_annotation_activity_set
nom_analysis_activity_set
metatranscriptome_activity_set
omics_processing_set
material_sample_set
pooling_set
metabolomics_analysis_activity_set
functional_annotation_agg
mags_activity_set
data_object_set
metaproteomics_analysis_activity_set
library_preparation_set
collecting_biosamples_from_site_set
field_research_site_set

Here are the numbers of documents in each of those collections (as of right now):

study_set (19 documents)
biosample_set (7594 documents)
metagenome_sequencing_activity_set (631 documents)
read_based_taxonomy_analysis_activity_set (3053 documents)
read_qc_analysis_activity_set (3114 documents)
activity_set (0 documents)
processed_sample_set (5750 documents)
metagenome_assembly_set (2940 documents)
extraction_set (2127 documents)
metagenome_annotation_activity_set (2645 documents)
nom_analysis_activity_set (1985 documents)
metatranscriptome_activity_set (55 documents)
omics_processing_set (6214 documents)
material_sample_set (0 documents)
pooling_set (1491 documents)
metabolomics_analysis_activity_set (209 documents)
functional_annotation_agg (11822821 documents)
mags_activity_set (2645 documents)
data_object_set (138120 documents)
metaproteomics_analysis_activity_set (52 documents)
library_preparation_set (2132 documents)
collecting_biosamples_from_site_set (0 documents)
field_research_site_set (110 documents)

@turbomam
Copy link
Member

turbomam commented Nov 9, 2023

That's awesome, @eecavanna . Could you please enhance your report by giving one example of an empty set from each collection? Ideally the enhanced report would list the id of the entity owning an empty list, and the slot that links that entity to the empty list, like

nmdc:sty-99-123456; has_journal_retractions

@eecavanna
Copy link
Collaborator

eecavanna commented Nov 9, 2023

I have a question that will influence the complexity of the search algorithm I use.

Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots?

Here's an example JSON object to illustrate what I mean by "top-level" slots versus "lower-level/nested" slots:

{
  "id": "foo:123",
  "list_e": [],  // <-- empty list in top-level slot
  "obj_e": {},
  "list_f": [ "bar", { "name": "Fred", "age": 123 }, [], 789 ],  // <-- empty list within lower-level/nested slot
}
  "obj_f": { "baz": [] }  // <-- empty list in lower-level/nested slot
}

In that example JSON object:

  • The only top-level slots: id, list_e, obj_e, list_f, and obj_f
  • Some lower-level/nested slots are: list_f[1].name and obj_f.baz

If I only check top-level slots, the algorithm won't involve recursion. If I also check lower-level/nested slots, the algorithm will involve recursion.

@eecavanna
Copy link
Collaborator

eecavanna commented Nov 9, 2023

Could you please enhance your report by giving one example of an empty set from each collection?

Yes, I'll include that info.

I want to clarify that the lists of collections I posted above show all collections that are in both the schema and the database. It was not a report of collections having documents that contain empty lists. I posted that to share how much data the Python script would be searching through, after filtering out the irrelevant collections.

@eecavanna eecavanna changed the title how to deal with empty lists Identify Mongo documents containing values that are empty lists ([]) Nov 9, 2023
@eecavanna eecavanna added the X SMALL Less than 8 hours, less than 1 day label Nov 9, 2023
@eecavanna
Copy link
Collaborator

Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots?

Assuming it is the former (i.e. always in top-level slots): I have generated a report. It is a 7 MB CSV file with 108,850 rows of data in it. Here's a screenshot of the top of the file, to show its structure:

image

I spot checked two rows from the report and found they did, indeed, refer to top-level slots whose values were empty lists. Here's an example:

image

@turbomam and @aclum, I will send you the 7 MB report via Slack.

@turbomam
Copy link
Member

turbomam commented Nov 9, 2023

Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots

We need to check nested slots, too. This might be made more efficient by making a list of multi-valued lists in advance. You could do that by making a SchemaView and iterating over all slots, checking if multivalued is True for each.

@turbomam
Copy link
Member

turbomam commented Nov 9, 2023

Count - document_id own_key_pointing_to_empty_list                
collection_name alternative_identifiers gold_analysis_project_identifiers gold_biosample_identifiers gold_sequencing_project_identifiers gold_study_identifiers has_metabolite_quantifications mags_list part_of Total Result
biosample_set     4007           4007
data_object_set 96203               96203
mags_activity_set             2269   2269
metabolomics_analysis_activity_set           1     1
metagenome_annotation_activity_set   2033             2033
omics_processing_set       3815         3815
read_based_taxonomy_analysis_activity_set               518 518
study_set         4       4
Total Result 96203 2033 4007 3815 4 1 2269 518 108850

@eecavanna
Copy link
Collaborator

I updated the report so it checks nested slots, too (at infinity levels of depth). I also updated it to include the path to the empty slot. For example:

For this document...

{ a: { b: [ {}, { c: [] }, "foo" ] } }

...the path to the empty list would be reported as...

a.b[1].c

The resulting report (CSV file) was 80 MB and contained 709,679 rows, each of which contained a path to an empty list.

I spot checked two rows from the report and found they did, indeed, contain paths to empty lists. Here's an example:

image

image

@turbomam and @aclum, I will send you a 2.5 MB ZIP of the CSV file, via Slack.

@aclum
Copy link
Contributor Author

aclum commented Nov 10, 2023

Unique list of collection, slot
collection_name,path_to_empty_list
biosample_set,gold_biosample_identifiers
data_object_set,alternative_identifiers
mags_activity_set,mags_list
metabolomics_analysis_activity_set,has_metabolite_quantifications
metabolomics_analysis_activity_set,has_metabolite_quantifications[].alternative_identifiers
metagenome_annotation_activity_set,gold_analysis_project_identifiers
metaproteomics_analysis_activity_set,has_peptide_quantifications[].all_proteins
omics_processing_set,gold_sequencing_project_identifiers
read_based_taxonomy_analysis_activity_set,part_of
study_set,gold_study_identifiers


@eecavanna generated the following Markdown table from the above CSV string:

Unique list of collection
(collection_name)
slot
(path_to_empty_list)
biosample_set gold_biosample_identifiers
data_object_set alternative_identifiers
mags_activity_set mags_list
metabolomics_analysis_activity_set has_metabolite_quantifications
metabolomics_analysis_activity_set has_metabolite_quantifications[].alternative_identifiers
metagenome_annotation_activity_set gold_analysis_project_identifiers
metaproteomics_analysis_activity_set has_peptide_quantifications[].all_proteins
omics_processing_set gold_sequencing_project_identifiers
read_based_taxonomy_analysis_activity_set part_of
study_set gold_study_identifiers

@aclum aclum added the backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. label Nov 10, 2023
@aclum
Copy link
Contributor Author

aclum commented Nov 10, 2023

Once scripts that generate json ( the referenced issues) for ingest to mongo prod are updated we can make a plan to have runtime API return that records with empty lists are invalid. Putting this in the backlog until then.

@aclum
Copy link
Contributor Author

aclum commented Oct 1, 2024

processes that submit to mongo continue to do this. The latest ive found this week are ETL code, #373 has been in the backlog and microbiomedata/nmdc_automation#259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. X SMALL Less than 8 hours, less than 1 day
Projects
None yet
Development

No branches or pull requests

3 participants