Failure to grab compound name from metadata dictionary #249

wjcranda · 2024-07-30T06:35:08Z

Version 1.5.0 negative mode query failed at a specific spectra in .mgf file. Spectra prior to and after this spectrum were able to be ran with no issue.

Running the same .mgf file through a previous version (1.2.4) and older zenodo models, all spectra in .mgf were successfully queried. Specifically, this spectrum had an analog match (with low score).

It seems to me that new library additions in the latest zenodo files may have issues. Where this specific spectrum matched to a particular library compound with inadequate metadata to write the output.

Here is the full error message:

KeyError Traceback (most recent call last)
File :55

File :28, in run_ms2query(ion_mode)

File ~\AppData\Roaming\Python\Python310\site-packages\ms2query\run_ms2query.py:97, in run_complete_folder(ms2library, folder_with_spectra, results_folder, settings)
95 if os.path.isfile(file_path):
96 if os.path.splitext(file_name)[1].lower() in {".mzml", ".json", ".mgf", ".msp", ".mzxml", ".usi", ".pickle"}:
---> 97 run_ms2query_single_file(spectrum_file_name=file_name,
98 folder_with_spectra=folder_with_spectra,
99 results_folder=results_folder,
100 ms2library=ms2library, settings=settings)
101 folder_contained_spectrum_file = True
102 else:

File ~\AppData\Roaming\Python\Python310\site-packages\ms2query\run_ms2query.py:143, in run_ms2query_single_file(spectrum_file_name, folder_with_spectra, results_folder, ms2library, settings)
139 spectra = load_matchms_spectrum_objects_from_file(os.path.join(folder_with_spectra, spectrum_file_name))
140 analogs_results_file_name = return_non_existing_file_name(
141 os.path.join(results_folder,
142 os.path.splitext(spectrum_file_name)[0] + ".csv"))
--> 143 ms2library.analog_search_store_in_csv(spectra,
144 analogs_results_file_name,
145 settings)
146 print(f"Results stored in {analogs_results_file_name}")

File ~\AppData\Roaming\Python\Python310\site-packages\ms2query\ms2library.py:211, in MS2Library.analog_search_store_in_csv(self, query_spectra, results_csv_file_location, settings)
205 csv_file.write(",".join(
206 column_names_for_output(True, add_class_annotations, settings.additional_metadata_columns,
207 settings.additional_ms2query_score_columns)) + "\n")
209 results_df_generator = self.analog_search_yield_df(query_spectra, settings)
--> 211 for results_df in results_df_generator:
212 results_df.to_csv(results_csv_file_location, mode="a", header=False, float_format="%.4f", index=False)

File ~\AppData\Roaming\Python\Python310\site-packages\ms2query\ms2library.py:173, in MS2Library.analog_search_yield_df(self, query_spectra, settings, progress_bar)
171 else:
172 results_table = get_ms2query_model_prediction_single_spectrum(results_table, self.ms2query_model)
--> 173 results_df = results_table.export_to_dataframe(
174 settings.nr_of_top_analogs_to_save,
175 settings.minimal_ms2query_metascore,
176 additional_metadata_columns=settings.additional_metadata_columns,
177 additional_ms2query_score_columns=settings.additional_ms2query_score_columns)
178 yield results_df

File ~\AppData\Roaming\Python\Python310\site-packages\ms2query\results_table.py:136, in ResultsTable.export_to_dataframe(self, nr_of_top_spectra, minimal_ms2query_score, additional_metadata_columns, additional_ms2query_score_columns)
134 # For each analog the compound name is selected from sqlite
135 metadata_dict = self.sqlite_library.get_metadata_from_sqlite(list(selected_analogs["spectrum_ids"]))
--> 136 compound_name_list = [metadata_dict[analog_spectrum_id]["compound_name"]
137 for analog_spectrum_id
138 in list(selected_analogs["spectrum_ids"])]
139 smiles_list = [metadata_dict[analog_spectrum_id]["smiles"]
140 for analog_spectrum_id
141 in list(selected_analogs["spectrum_ids"])]
143 # Add inchikey and ms2query model prediction to results df
144 # results_df = selected_analogs.loc[:, ["spectrum_ids", "ms2query_model_prediction", "inchikey"]]

File ~\AppData\Roaming\Python\Python310\site-packages\ms2query\results_table.py:136, in (.0)
134 # For each analog the compound name is selected from sqlite
135 metadata_dict = self.sqlite_library.get_metadata_from_sqlite(list(selected_analogs["spectrum_ids"]))
--> 136 compound_name_list = [metadata_dict[analog_spectrum_id]["compound_name"]
137 for analog_spectrum_id
138 in list(selected_analogs["spectrum_ids"])]
139 smiles_list = [metadata_dict[analog_spectrum_id]["smiles"]
140 for analog_spectrum_id
141 in list(selected_analogs["spectrum_ids"])]
143 # Add inchikey and ms2query model prediction to results df
144 # results_df = selected_analogs.loc[:, ["spectrum_ids", "ms2query_model_prediction", "inchikey"]]

KeyError: 'compound_name'

wjcranda · 2024-07-31T16:58:19Z

I have created a temporary fix in which 'no name' will be placed in compound name column which has allowed the queries to continue without erroring out when they reach a library match missing compound name metadata. This allowed both positive and negative mode scripts to be completed.

The fact that this fix worked means it could just be the 'compound name' metadata information is missing for some of the library entries. however, it is pure chance that my spectra matched to any of these library hits which are missing this.

If you would like the specific smiles and other info of the library matches that do not have a compound name I can give you my outputs as well.

results_table.txt

niekdejonge · 2024-08-06T16:28:29Z

Thanks for the clear issue and thanks for helping out others, while all developers were on holidays :) Missing compound names should indeed not result in breaking changes. I have not seen this issue before and don't know a solution directly, but I will look into it.

wjcranda · 2024-08-06T17:07:10Z

No problem! I really enjoy using these tools in my research. Let me know if you need more information from my side.

rafaypir · 2024-08-07T09:44:36Z

Hi, I'm also facing similar issue causing failure to grab compound name from metadata dictionary. Additionally, some errors in resultant file have been found.

example: cf_kingdom isn't supposed to be giving ['UDFMLCREIMSEIU', '', '', '', '', ''] in the column. Another column cf_direct_parent is found with incorrect information.

Attaching incomplete csv file for reference.

cleaned_query_spectra.csv

niekdejonge · 2024-08-07T09:47:18Z

#251 should fix this issue. I am surprised that this bug did not appear earlier, but it was an easy fix. After fixing @rafaypir s issue I will do a new release.

niekdejonge · 2024-08-07T10:56:49Z

#252 fixes the origin of the issue. New sqlite files still need to be generated and uploaded to zenodo, since the mistake was added to the sqlite files.

MarJakubec · 2024-08-09T09:08:31Z

@niekdejonge thank you for quick fix. Do you know when new libraries (sqlite files) will be generated? For now I need to use files all the way from 2022 to run ms2query (version 2).

niekdejonge · 2024-08-09T10:03:51Z

I did start the run for generating them. But for the positive mode the run crashed overnight, due to some server issues. I will do a new release today, this has the issue fixed with the compound names, so you can run the newest version of MS2Query again with the newest libraries. However, it still has the issue @rafaypir noted, but this is luckily not critical for using MS2Query.

niekdejonge · 2024-08-09T10:12:08Z

A new release has been made. To use it install (or update) to ms2query 1.5.2 you can use the latest model and library. The class annotations issue still happens, but this will be fixed in a later version.

niekdejonge · 2024-08-20T12:37:46Z

With version 1.5.3 also the issue mentioned by @rafaypir should be fixed. Make sure you redownload the library files (only the .sqlite file has been updated).

niekdejonge self-assigned this Aug 6, 2024

niekdejonge mentioned this issue Aug 7, 2024

Handle compound_name is not given in metadata. #251

Merged

niekdejonge closed this as completed Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to grab compound name from metadata dictionary #249

Failure to grab compound name from metadata dictionary #249

wjcranda commented Jul 30, 2024 •

edited

Loading

wjcranda commented Jul 31, 2024

niekdejonge commented Aug 6, 2024

wjcranda commented Aug 6, 2024

rafaypir commented Aug 7, 2024

niekdejonge commented Aug 7, 2024

niekdejonge commented Aug 7, 2024

MarJakubec commented Aug 9, 2024

niekdejonge commented Aug 9, 2024

niekdejonge commented Aug 9, 2024

niekdejonge commented Aug 20, 2024

Failure to grab compound name from metadata dictionary #249

Failure to grab compound name from metadata dictionary #249

Comments

wjcranda commented Jul 30, 2024 • edited Loading

wjcranda commented Jul 31, 2024

niekdejonge commented Aug 6, 2024

wjcranda commented Aug 6, 2024

rafaypir commented Aug 7, 2024

niekdejonge commented Aug 7, 2024

niekdejonge commented Aug 7, 2024

MarJakubec commented Aug 9, 2024

niekdejonge commented Aug 9, 2024

niekdejonge commented Aug 9, 2024

niekdejonge commented Aug 20, 2024

wjcranda commented Jul 30, 2024 •

edited

Loading