-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactors Scores and add querying/sorting options #153
Conversation
@sverhoeven Sorry for the back and forth (draft-not-draft...). Took me some time to find the bugs/issues that were causing weird behavior (they were in Spectrum), but that should be fixed now. Just added a missing unit test and now it should be ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, docstrings render OK, additional tests result in nice coverage.
I like the approach of having a BaseSimilarity.sort()
method. (I might have suggested it before, but good to see that it was also implementable in a unobtrusive way)
Calculation scores between 2 spectra is failing to return a structured array
I tried
In [1]: import numpy as np
...: from matchms import calculate_scores
...: from matchms import Spectrum
...: from matchms.similarity import CosineGreedy
...:
...: spectrum_1 = Spectrum(mz=np.array([100, 150, 200.]),
...: intensities=np.array([0.7, 0.2, 0.1]),
...: metadata={'id': 'spectrum1'})
...: spectrum_2 = Spectrum(mz=np.array([100, 140, 190.]),
...: intensities=np.array([0.4, 0.2, 0.1]),
...: metadata={'id': 'spectrum2'})
In [2]: similarity_measure = CosineGreedy()
In [3]: scores = calculate_scores([spectrum_1],[spectrum_2], similarity_measure)
In [6]: scores._scores
Out[6]: array([[(0.831479419283098, 1)]], dtype=object)
In [7]: list(scores)
Out[7]:
[(<matchms.Spectrum.Spectrum at 0x7fb29ac050d0>,
<matchms.Spectrum.Spectrum at 0x7fb29ac05070>,
0.831479419283098,
1)]
Instead of 0.831479419283098, 1
I expected to get back a dictionary or structured array.
Can you add a test for this use case?
matchms/similarity/BaseSimilarity.py
Outdated
""" | ||
if scores.dtype.names is None: | ||
return scores.argsort()[::-1] | ||
return scores["score"].argsort()[::-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think the base class should look at the incoming dtype. It should sort according to the score_datatype. So it should only consist out of line 81. Line 82 is specific to the Cosine classes it should not be part of the base.
Classes which override score_datatype should have their own sort()
if needed. The `sort() could be implemented in an intermediate abstract class or in the class itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced it to line 81 as suggested. In fact, that even works for all scores we implemented so far (without need to define own sort() method).
Co-authored-by: Stefan Verhoeven <[email protected]>
Kudos, SonarCloud Quality Gate passed! |
Thanks a lot for the reviewing Stefan, that was very helpful! I believe I could address your comments and will merge this PR now. |
Here I changed quite some things to make the API nicer to work with:
Scores.scores
a structured array (only when score contains more than one component, e.g. score + matches)scores_by_reference
andscores_by_query
method (implemented viaBaseSimilarity
)Spectrum()
: getter for metadata entries did not usecopy()
Spectrum()
:__eq__
method compared metadata dictionaries simply with==
. However, that fails when numpy array are added to the metadata (which we do in theadd_fingerprints
filter).In #152 we had also discussed adding
Scores.top_scores_by_query(query, n=np.Inf)
andScores.top_scores_by_reference(reference, n=np.Inf)
, but this is not done here yet. For time reasons I would postpone it for another round of additions.