Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

59 cosine hungarian #40

Merged
merged 20 commits into from
Jun 2, 2020
Merged

59 cosine hungarian #40

merged 20 commits into from
Jun 2, 2020

Conversation

jspaaks
Copy link
Member

@jspaaks jspaaks commented May 20, 2020

Originally here matchms/matchms-backup#246

So far we implemented a greedy version of the cosine score (which is often used because it is faster and gives the same results in 99,9% (that's a guess) of the actual cases.

The Hungarian algorithm would be the fully correct solution of the optimization problem, so it is nice to have it as well.

Here an example where the greedy scores would fail:

from matchms import Spectrum
from matchms.similarity import CosineGreedy, CosineGreedyNumba, CosineHungarian

test_spectrum1 = Spectrum(mz=np.array([100.005, 100.016]),
                         intensities=np.array([1.0, 0.9]),
                         metadata={})

test_spectrum2 = Spectrum(mz=np.array([100.005, 100.01]),
                         intensities=np.array([0.9, 1.0]),
                         metadata={})

similarity_measure = CosineGreedyNumba(tolerance=0.01)
print("CosineGreedy:", similarity_measure(test_spectrum1, test_spectrum2))

similarity_measure = CosineGreedy(tolerance=0.01)
print("CosineGreedyNumba:", similarity_measure(test_spectrum1, test_spectrum2))

similarity_measure = CosineHungarian(tolerance=0.01)
print("CosineHungarian:", similarity_measure(test_spectrum1, test_spectrum2))

CosineGreedy: (0.5524861878453039, 1)
CosineGreedyNumba: (0.5524861878453039, 1)
CosineHungarian: (0.994475138121547, 2)

@cwmeijer cwmeijer self-requested a review May 29, 2020 08:51
Copy link
Collaborator

@cwmeijer cwmeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition. Good tests!

matchms/similarity/CosineHungarian.py Outdated Show resolved Hide resolved
matchms/similarity/CosineHungarian.py Show resolved Hide resolved
def get_matching_pairs():
"""Get pairs of peaks that match within the given tolerance."""
matching_pairs = collect_peak_pairs(spec1, spec2, self.tolerance, shift=0.0)
matching_pairs = sorted(matching_pairs, key=lambda x: x[2], reverse=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't reuse variable name is posible. In this case you can just do:
return sorted(...)
This saves 1 line and you don't reuse the name. It's even better performance although I'm sure its negligible ;-)

Copy link
Collaborator

@florian-huber florian-huber Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks. Done.


def calc_score():
"""Calculate cosine similarity score."""
used_matches = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These first 5 lines can all move into the True branch of the if statement.

Also, quite a long function. It looks as if you could easily extract a function or 2 to reduce function size, variable in scope, and help the reader a bit with function names.
One obvious one would be:
def normalize_score(score, spec1, spec2):
return score/max(numpy.sum(spec1[:, 1]**2), numpy.sum(spec2[:, 1]**2))

Gets rid of 1 comment, helps readability a bit I'd say if you don't mind 1 liner functions (which I don't).

You could also extract the part that the comment refers to and call it 'solve_hungarian' or something like that.

Another option for function extraction I think would be everything in the True branch of 'if len(matching_pairs) > 0:'. The if statement would then read something like:
score, n_used_matches = calc_score_with_matches(matching_pairs) if len(matching_pairs) > 0 else (0,0)
Maybe that last option is a bit too complicated but I think it's ok. The line would be a bit hard to fit within 79 characters though which would damage readability a bit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to restructure the score calculation along your suggestions (I picked slightly different sub-functions though). Hope that makes it look better.

@@ -0,0 +1,97 @@
import numpy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you some how clearify how you got to the 'expected' values in all the tests? Did you calculate that manually or something like that? I'm just thinking what one would think if any of these tests start to fail at some point and you are not there to explain what you meant.
If the calculation is simple and clear, you may want to include it in the test as opposed to these 'magic' ground truth values.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the test cases to simpler ones where I now compare the actual scores with expected values that are derived within the same test (same as for you similar comment in #38 ).

@florian-huber florian-huber requested a review from cwmeijer June 2, 2020 06:44
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jun 2, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

100.0% 100.0% Coverage
0.0% 0.0% Duplication

@fdiblen
Copy link
Collaborator

fdiblen commented Jun 2, 2020

Refs #6

@florian-huber
Copy link
Collaborator

Thanks @cwmeijer for the review!

@florian-huber florian-huber merged commit 506f035 into master Jun 2, 2020
@florian-huber florian-huber deleted the 59-cosine-hungarian branch June 2, 2020 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants