-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-design of Scores and similarity classes #135
Comments
I worked on a new implementation in PR #133 but kept feeling that we might be doing things a bit more complex than necessary.
Currently we offer several ways to compute this. I think it would be better to split consistently between single score and array of scores. And I do not see any advantage for a user if his is offered an additional naive implementation, if a more optimized parallel implementation is in place. class BaseSimilarityFunction:
"""Similarity function base class.
When building a custom similarity measure, inherit from this class and implement
the desired methods.
"""
# Set key characteristics as class attributes
is_commutative = True
def compute_score(self, reference: SpectrumType, query: SpectrumType) -> float:
"""Required: Method to calculate the similarity for one input pair.
Parameters
----------
reference
Single reference spectrum.
query
Single query spectrum.
"""
raise NotImplementedError
def compute_score_matrix(self, references: List[SpectrumType], queries: List[SpectrumType],
is_symmetric: bool = False) -> numpy.ndarray:
"""Optional: Provide optimized method to calculate an numpy.array of similarity scores
for given reference and query spectrums. If no method is added here, the following naive
implementation (i.e. a double for-loop) is used.
Parameters
----------
references
List of reference objects
queries
List of query objects
is_symmetric
Set to True when *references* and *queries* are identical (as for instance for an all-vs-all
comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about
2x faster.
"""
n_rows = len(references)
n_cols = len(queries)
scores = numpy.empty([n_rows, n_cols], dtype="object")
for i_ref, reference in enumerate(references[:n_rows]):
if is_symmetric and self.is_commutative:
for i_query, query in enumerate(queries[i_ref:n_cols], start=i_ref):
scores[i_ref][i_query] = self.compute_score(reference, query)
scores[i_query][i_ref] = scores[i_ref][i_query]
else:
for i_query, query in enumerate(queries[:n_cols]):
scores[i_ref][i_query] = self.compute_score(reference, query)
return scores |
class Scores:
...
def calculate(self) -> Scores:
"""
Calculate the similarity between all reference objects v all query objects using
the most suitable available implementation of the given similarity_function.
"""
if self.n_rows == self.n_cols == 1:
self._scores = self.similarity_function.compute_score(self.references, self.queries)
else:
self._scores = self.similarity_function.compute_score_matrix(self.references,
self.queries,
is_symmetric=self.is_symmetric)
return self
... We wouldn't need to do any checks if the method has a parallel implementation. It will always have one, either naive or more optimized. |
Started prototyping/implementing this in PR #139 |
We could drop of class BaseSimilarityFunction:
"""Similarity function base class.
When building a custom similarity measure, inherit from this class and implement
the desired methods.
"""
# Set key characteristics as class attributes
is_commutative = True
def pair(self, reference: SpectrumType, query: SpectrumType) -> float:
"""Required: Method to calculate the similarity for one input pair.
Parameters
----------
reference
Single reference spectrum.
query
Single query spectrum.
"""
raise NotImplementedError
def matrix(self, references: List[SpectrumType], queries: List[SpectrumType],
is_symmetric: bool = False) -> numpy.ndarray:
"""Method to calculate an Scores object of similarity scores
for given reference and query spectrums. If method is not overridden in subclass, a naive
implementation (i.e. a double for-loop) is used.
Parameters
----------
references
List of reference objects
queries
List of query objects
is_symmetric
Set to True when *references* and *queries* are identical (as for instance for an all-vs-all
comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about
2x faster.
"""
n_rows = len(references)
n_cols = len(queries)
scores = numpy.empty([n_rows, n_cols], dtype="object")
for i_ref, reference in enumerate(references[:n_rows]):
if is_symmetric and self.is_commutative:
for i_query, query in enumerate(queries[i_ref:n_cols], start=i_ref):
scores[i_ref][i_query] = self.pair(reference, query)
scores[i_query][i_ref] = scores[i_ref][i_query]
else:
for i_query, query in enumerate(queries[:n_cols]):
scores[i_ref][i_query] = self.pair(reference, query)
return Scores(references, queries, scores)
class Scores:
def __init__(self, references: ReferencesType, queries: QueriesType, scores: numpy.ndarray):
self.n_rows = len(references)
self.n_cols = len(queries)
self.references = numpy.asarray(references).reshape(self.n_rows, 1)
self.queries = numpy.asarray(queries).reshape(1, self.n_cols)
self._scores = scores
self._index = 0
def __iter__():
...
def __next__(self):
...
scores = CosineGreedy(tolerance=0.05).matrix(references, queries) If you want to keep calculate_scores() use def calculate_scores(references: ReferencesType, queries: QueriesType, sim: BaseSimilarityFunction):
return sim.matrix(references, queries)
scores = calculate_scores(references, queries, CosineGreedy(tolerance=0.05)) |
I am a bit undecided here. |
Second thought on it: I think I would prefer not adding Maybe it would be good enough to make it |
Ok, agree that having numpy array as return is beneficial, but in the base class we could have two calculate methods one that returns numpy and one that returns Scores. With this new setup you would need the spec2vec class be a sub-class the base class. So you get the calculates Scores method for free by sub-classing, you just override the pair() and matrix() methods in the spec2vec class. Or don't you want to have the spec2vec class to be a sub-class of BaseSimilarityFunction? Yep, I like making Scores._calculate() private that will reduce the public API and reduce confusion which calculate to use. |
I agree here, I think we should make It seems to come down to two options.
What for me still speaks for (2) is that we can keep all the logic for choosing the right function to call in
|
Mhhh, pylint does not like the |
@sverhoeven I now implemented one version to see how it goes in PR #139 .
|
2nd point: Mabye we can mark Scores.calculate() as deprecated (using something like https://pypi.org/project/deprecation/) so in the next major release we can drop it |
|
This has been addressed (as is now being worked on further in #153). |
There was already a lengthy brainstorming happening in #59, but that got quiet long. So this issue here is meant to pick this up.
Although it started with the simple idea of adding an
is_symmetric
option, it then boiled down to the fact that the current implementation has some redundancies, both in how scores can be calculated and in how they are implemented. I believe that this can make things unnecessarily complex for a user.After all, a user should be mostly concerned with
CosineGreedy
)tolerance=0.05
)And then run the computation with a simple to understand command, such as one of the current ways matchms offers:
The text was updated successfully, but these errors were encountered: