-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify Mongo documents containing values that are empty lists ([]
)
#1306
Comments
Transferring to nmdc-schema per conversations with Mark. Suggestion was to set min cardinality in the schema. This needs to be coordinated with workflows so code that generates submissions gets updated. @corilo @Michal-Babins @mbthornton-lbl |
@eecavanna offered to run a check in python to see what slots are populated with empty lists so we can reach out to the folks that write these documents. |
This API URL shows an example of a record with an asserted empty |
{
"id": "emsl:739472",
"name": "Brodie_158_MeOH_R3_23Mar19_HESI_Neg",
"description": "High resolution MS spectra only",
"has_input": [
"igsn:IEWFS000S"
],
"has_output": [
"emsl:output_739472"
],
"part_of": [
"gold:Gs0135149"
],
"instrument_name": "21T_Agilent",
"omics_type": {
"has_raw_value": "Organic Matter Characterization"
},
"processing_institution": "EMSL",
"type": "nmdc:OmicsProcessing",
"gold_sequencing_project_identifiers": []
} |
That doesn't make it into the YAML output of the Therefore, the nmdc-schema repo isn't ready to check for this sort of thing right now. |
@turbomam does this mean we should covert this back to a nmdc-runtime issue? Can the pypi package check this? |
I want to clarify the requirements: Generate a list consisting of the Example output:
Are there any collections you'd be OK with the script ignoring? (The more data there is to process -> the longer the script will take to run... but it might not be a different any of us notices) |
I wouldn't object
I don't think it would help, at least with the way I was trying to check. I was starting by using the nmdc-schema If we want to use LinkML validation to check for empty lists, I think it would be more helpful for somebody else to write a different dumper that includes the But we/I should check with other LinkML experts like @cmungall or @pkalita-lbl to see if they have any insights into LinkML's ability to recognize empty lists. |
At the very minimum, don't bother checking any collection whose name isn't a Database slot. pure-export has code that addresses the selection of dump-worthy collections, but
@aclum is this discussion relevant to @eecavanna's question?: |
From what I've seen this issue is limited to external identifier slots |
Thanks. I can use this snippet to determine the collections that are in both the schema and the database: from nmdc_schema.nmdc_data import get_nmdc_jsonschema_dict
# ...
# Make a list of names of the collections that are in the schema.
nmdc_jsonschema: dict = get_nmdc_jsonschema_dict()
collection_names_in_schema = nmdc_jsonschema["$defs"]["Database"]["properties"].keys()
# Make a list of names of the collections that are in the database.
# Note: `db` is a pymongo reference to the nmdc database
collection_names_in_database: list[str] = db.list_collection_names()
# Make a list of the collection names that are in both of those lists.
collection_names_to_scan = list(set(collection_names_in_schema).intersection(set(collection_names_in_database))) Here's the list of collections I came up with using that snippet: Collections to scan (23 collections):
study_set
biosample_set
metagenome_sequencing_activity_set
read_based_taxonomy_analysis_activity_set
read_qc_analysis_activity_set
activity_set
processed_sample_set
metagenome_assembly_set
extraction_set
metagenome_annotation_activity_set
nom_analysis_activity_set
metatranscriptome_activity_set
omics_processing_set
material_sample_set
pooling_set
metabolomics_analysis_activity_set
functional_annotation_agg
mags_activity_set
data_object_set
metaproteomics_analysis_activity_set
library_preparation_set
collecting_biosamples_from_site_set
field_research_site_set Here are the numbers of documents in each of those collections (as of right now): study_set (19 documents)
biosample_set (7594 documents)
metagenome_sequencing_activity_set (631 documents)
read_based_taxonomy_analysis_activity_set (3053 documents)
read_qc_analysis_activity_set (3114 documents)
activity_set (0 documents)
processed_sample_set (5750 documents)
metagenome_assembly_set (2940 documents)
extraction_set (2127 documents)
metagenome_annotation_activity_set (2645 documents)
nom_analysis_activity_set (1985 documents)
metatranscriptome_activity_set (55 documents)
omics_processing_set (6214 documents)
material_sample_set (0 documents)
pooling_set (1491 documents)
metabolomics_analysis_activity_set (209 documents)
functional_annotation_agg (11822821 documents)
mags_activity_set (2645 documents)
data_object_set (138120 documents)
metaproteomics_analysis_activity_set (52 documents)
library_preparation_set (2132 documents)
collecting_biosamples_from_site_set (0 documents)
field_research_site_set (110 documents) |
That's awesome, @eecavanna . Could you please enhance your report by giving one example of an empty set from each collection? Ideally the enhanced report would list the
|
I have a question that will influence the complexity of the search algorithm I use. Are the empty lists you want to find, always in top-level slots; or are they ever in lower-level/nested slots? Here's an example JSON object to illustrate what I mean by "top-level" slots versus "lower-level/nested" slots: {
"id": "foo:123",
"list_e": [], // <-- empty list in top-level slot
"obj_e": {},
"list_f": [ "bar", { "name": "Fred", "age": 123 }, [], 789 ], // <-- empty list within lower-level/nested slot
}
"obj_f": { "baz": [] } // <-- empty list in lower-level/nested slot
} In that example JSON object:
If I only check top-level slots, the algorithm won't involve recursion. If I also check lower-level/nested slots, the algorithm will involve recursion. |
Yes, I'll include that info. I want to clarify that the lists of collections I posted above show all collections that are in both the schema and the database. It was not a report of collections having documents that contain empty lists. I posted that to share how much data the Python script would be searching through, after filtering out the irrelevant collections. |
[]
)
Assuming it is the former (i.e. always in top-level slots): I have generated a report. It is a 7 MB CSV file with 108,850 rows of data in it. Here's a screenshot of the top of the file, to show its structure: I spot checked two rows from the report and found they did, indeed, refer to top-level slots whose values were empty lists. Here's an example: @turbomam and @aclum, I will send you the 7 MB report via Slack. |
We need to check nested slots, too. This might be made more efficient by making a list of multi-valued lists in advance. You could do that by making a SchemaView and iterating over all slots, checking if |
|
I updated the report so it checks nested slots, too (at infinity levels of depth). I also updated it to include the path to the empty slot. For example: For this document... { a: { b: [ {}, { c: [] }, "foo" ] } } ...the path to the empty list would be reported as... a.b[1].c The resulting report (CSV file) was 80 MB and contained 709,679 rows, each of which contained a path to an empty list. I spot checked two rows from the report and found they did, indeed, contain paths to empty lists. Here's an example: @turbomam and @aclum, I will send you a 2.5 MB ZIP of the CSV file, via Slack. |
Unique list of collection, slot ➕ @eecavanna generated the following Markdown table from the above CSV string:
|
Once scripts that generate json ( the referenced issues) for ingest to mongo prod are updated we can make a plan to have runtime API return that records with empty lists are invalid. Putting this in the backlog until then. |
processes that submit to mongo continue to do this. The latest ive found this week are ETL code, #373 has been in the backlog and microbiomedata/nmdc_automation#259 |
When doing the re-iding work I noticed in several places we have documents with empty lists
Ie many omics_processing_set records have gold_sequencing_project_identifiers with a list size of zero.
Should the API reject records with empty lists or continue to accept them?
@dwinston @shreddd @turbomam
tasks
The text was updated successfully, but these errors were encountered: