-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
59 cosine hungarian #40
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition. Good tests!
def get_matching_pairs(): | ||
"""Get pairs of peaks that match within the given tolerance.""" | ||
matching_pairs = collect_peak_pairs(spec1, spec2, self.tolerance, shift=0.0) | ||
matching_pairs = sorted(matching_pairs, key=lambda x: x[2], reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't reuse variable name is posible. In this case you can just do:
return sorted(...)
This saves 1 line and you don't reuse the name. It's even better performance although I'm sure its negligible ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks. Done.
|
||
def calc_score(): | ||
"""Calculate cosine similarity score.""" | ||
used_matches = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These first 5 lines can all move into the True branch of the if statement.
Also, quite a long function. It looks as if you could easily extract a function or 2 to reduce function size, variable in scope, and help the reader a bit with function names.
One obvious one would be:
def normalize_score(score, spec1, spec2):
return score/max(numpy.sum(spec1[:, 1]**2), numpy.sum(spec2[:, 1]**2))
Gets rid of 1 comment, helps readability a bit I'd say if you don't mind 1 liner functions (which I don't).
You could also extract the part that the comment refers to and call it 'solve_hungarian' or something like that.
Another option for function extraction I think would be everything in the True branch of 'if len(matching_pairs) > 0:'. The if statement would then read something like:
score, n_used_matches = calc_score_with_matches(matching_pairs) if len(matching_pairs) > 0 else (0,0)
Maybe that last option is a bit too complicated but I think it's ok. The line would be a bit hard to fit within 79 characters though which would damage readability a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to restructure the score calculation along your suggestions (I picked slightly different sub-functions though). Hope that makes it look better.
@@ -0,0 +1,97 @@ | |||
import numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you some how clearify how you got to the 'expected' values in all the tests? Did you calculate that manually or something like that? I'm just thinking what one would think if any of these tests start to fail at some point and you are not there to explain what you meant.
If the calculation is simple and clear, you may want to include it in the test as opposed to these 'magic' ground truth values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the test cases to simpler ones where I now compare the actual scores with expected values that are derived within the same test (same as for you similar comment in #38 ).
Kudos, SonarCloud Quality Gate passed! 0 Bugs |
Refs #6 |
Thanks @cwmeijer for the review! |
Originally here matchms/matchms-backup#246
So far we implemented a greedy version of the cosine score (which is often used because it is faster and gives the same results in 99,9% (that's a guess) of the actual cases.
The Hungarian algorithm would be the fully correct solution of the optimization problem, so it is nice to have it as well.
Here an example where the greedy scores would fail:
CosineGreedy: (0.5524861878453039, 1)
CosineGreedyNumba: (0.5524861878453039, 1)
CosineHungarian: (0.994475138121547, 2)