59 cosine hungarian #40

jspaaks · 2020-05-20T15:17:36Z

Originally here matchms/matchms-backup#246

So far we implemented a greedy version of the cosine score (which is often used because it is faster and gives the same results in 99,9% (that's a guess) of the actual cases.

The Hungarian algorithm would be the fully correct solution of the optimization problem, so it is nice to have it as well.

Here an example where the greedy scores would fail:

from matchms import Spectrum
from matchms.similarity import CosineGreedy, CosineGreedyNumba, CosineHungarian

test_spectrum1 = Spectrum(mz=np.array([100.005, 100.016]),
                         intensities=np.array([1.0, 0.9]),
                         metadata={})

test_spectrum2 = Spectrum(mz=np.array([100.005, 100.01]),
                         intensities=np.array([0.9, 1.0]),
                         metadata={})

similarity_measure = CosineGreedyNumba(tolerance=0.01)
print("CosineGreedy:", similarity_measure(test_spectrum1, test_spectrum2))

similarity_measure = CosineGreedy(tolerance=0.01)
print("CosineGreedyNumba:", similarity_measure(test_spectrum1, test_spectrum2))

similarity_measure = CosineHungarian(tolerance=0.01)
print("CosineHungarian:", similarity_measure(test_spectrum1, test_spectrum2))

CosineGreedy: (0.5524861878453039, 1)
CosineGreedyNumba: (0.5524861878453039, 1)
CosineHungarian: (0.994475138121547, 2)

cwmeijer

Nice addition. Good tests!

matchms/similarity/CosineHungarian.py

cwmeijer · 2020-05-29T09:36:06Z

matchms/similarity/CosineHungarian.py

+        def get_matching_pairs():
+            """Get pairs of peaks that match within the given tolerance."""
+            matching_pairs = collect_peak_pairs(spec1, spec2, self.tolerance, shift=0.0)
+            matching_pairs = sorted(matching_pairs, key=lambda x: x[2], reverse=True)


Don't reuse variable name is posible. In this case you can just do:
return sorted(...)
This saves 1 line and you don't reuse the name. It's even better performance although I'm sure its negligible ;-)

Good point, thanks. Done.

cwmeijer · 2020-05-29T09:57:30Z

matchms/similarity/CosineHungarian.py

+
+        def calc_score():
+            """Calculate cosine similarity score."""
+            used_matches = []


These first 5 lines can all move into the True branch of the if statement.

Also, quite a long function. It looks as if you could easily extract a function or 2 to reduce function size, variable in scope, and help the reader a bit with function names.
One obvious one would be:
def normalize_score(score, spec1, spec2):
return score/max(numpy.sum(spec1[:, 1]**2), numpy.sum(spec2[:, 1]**2))

Gets rid of 1 comment, helps readability a bit I'd say if you don't mind 1 liner functions (which I don't).

You could also extract the part that the comment refers to and call it 'solve_hungarian' or something like that.

Another option for function extraction I think would be everything in the True branch of 'if len(matching_pairs) > 0:'. The if statement would then read something like:
score, n_used_matches = calc_score_with_matches(matching_pairs) if len(matching_pairs) > 0 else (0,0)
Maybe that last option is a bit too complicated but I think it's ok. The line would be a bit hard to fit within 79 characters though which would damage readability a bit.

I tried to restructure the score calculation along your suggestions (I picked slightly different sub-functions though). Hope that makes it look better.

cwmeijer · 2020-05-29T11:31:32Z

tests/test_cosine_hungarian.py

@@ -0,0 +1,97 @@
+import numpy


Could you some how clearify how you got to the 'expected' values in all the tests? Did you calculate that manually or something like that? I'm just thinking what one would think if any of these tests start to fail at some point and you are not there to explain what you meant.
If the calculation is simple and clear, you may want to include it in the test as opposed to these 'magic' ground truth values.

I changed the test cases to simpler ones where I now compare the actual scores with expected values that are derived within the same test (same as for you similar comment in #38 ).

sonarqubecloud · 2020-06-02T12:55:54Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
0 Code Smells

100.0% Coverage
0.0% Duplication

fdiblen · 2020-06-02T14:04:00Z

Refs #6

florian-huber · 2020-06-02T14:05:41Z

Thanks @cwmeijer for the review!

florian-huber added 2 commits May 20, 2020 12:59

add CosineHungarian similarity

fd8d3f8

add numba

aca8038

jspaaks mentioned this pull request May 20, 2020

Adding CosineHungarian similarity score matchms/matchms-backup#246

Closed

florian-huber added 10 commits May 20, 2020 22:02

remove inside function peak normalization

add7aae

add numba to meta.yaml

10fdaaf

Merge branch 'master' into 59-cosine-hungarian

d9a7b16

added numba to environment-dev.yaml

75dd7d1

linting

7db9dbe

move collect_peak_pairs to separate file

e40d72d

Merge branch 'master' into 59-cosine-hungarian

3ed6add

extend docstring

06245ac

add unit tests

0af6052

test linting

8314e00

cwmeijer self-requested a review May 29, 2020 08:51

cwmeijer requested changes May 29, 2020

View reviewed changes

florian-huber added 4 commits May 31, 2020 17:53

make test cases clearer

7962d5d

restructured cosine score calculation

466585a

fix small bug

ab5165c

add test case for higher coverage

91b030e

florian-huber requested a review from cwmeijer June 2, 2020 06:44

cwmeijer approved these changes Jun 2, 2020

View reviewed changes

florian-huber added 4 commits June 2, 2020 11:35

Merge branch 'master' into 59-cosine-hungarian

530f0a7

fix init

b0305af

add addition to changelog

b8da8ac

Merge branch 'master' into 59-cosine-hungarian

c56784f

fdiblen mentioned this pull request Jun 2, 2020

Move cosine_score_hungarian to its own function file under /similarity/ #6

Closed

1 task

florian-huber merged commit 506f035 into master Jun 2, 2020

florian-huber deleted the 59-cosine-hungarian branch June 2, 2020 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

59 cosine hungarian #40

59 cosine hungarian #40

jspaaks commented May 20, 2020 •

edited by florian-huber

Loading

cwmeijer left a comment

cwmeijer May 29, 2020

florian-huber Jun 1, 2020 •

edited

Loading

cwmeijer May 29, 2020

florian-huber Jun 1, 2020

cwmeijer May 29, 2020

florian-huber Jun 1, 2020

sonarqubecloud bot commented Jun 2, 2020

fdiblen commented Jun 2, 2020

florian-huber commented Jun 2, 2020

59 cosine hungarian #40

59 cosine hungarian #40

Conversation

jspaaks commented May 20, 2020 • edited by florian-huber Loading

cwmeijer left a comment

Choose a reason for hiding this comment

cwmeijer May 29, 2020

Choose a reason for hiding this comment

florian-huber Jun 1, 2020 • edited Loading

Choose a reason for hiding this comment

cwmeijer May 29, 2020

Choose a reason for hiding this comment

florian-huber Jun 1, 2020

Choose a reason for hiding this comment

cwmeijer May 29, 2020

Choose a reason for hiding this comment

florian-huber Jun 1, 2020

Choose a reason for hiding this comment

sonarqubecloud bot commented Jun 2, 2020

fdiblen commented Jun 2, 2020

florian-huber commented Jun 2, 2020

jspaaks commented May 20, 2020 •

edited by florian-huber

Loading

florian-huber Jun 1, 2020 •

edited

Loading