-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Capped distances to evaluate the indices of pairs within a fixed radius #1941
Introduce Capped distances to evaluate the indices of pairs within a fixed radius #1941
Conversation
If all your capped distance methods have the same signature, you can have them in a dictionary which would avoid a long distance_methods = {
'bruteforce': _capped_brutefore,
'pkdt': _capped_pkdt,
'octree': _octree,
}
if method is None:
distance_function = _determine_method(...)
elif method.lower() in distance_methods:
distance_function = distance_methods[method.lower()]
else:
raise ValueError('The is no known method called "{}".'.format(method)) An other benefit is that the method argument can take a callable with the same signature which makes your function extremely easy to play with: if I find an awesome method in an other library, I can write a small wrapper function that use the agreed signature, and I can use my alien function as a method without modifying MDAnalysis code. |
package/MDAnalysis/lib/distances.py
Outdated
|
||
for i, coords in enumerate(reference): | ||
dist = distance_array(coords[None, :], configuration, box=box)[0] | ||
idx = np.where((dist <= max_cutoff) & (dist > min_cutoff))[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Min cutoff here means that it will never return a distance of 0, which sometimes it should. I think it'd be better to have min_cutoff=None
as default and only apply this condition if it is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be None, or we can also put a 0 as a default and dist >= min_cutoff
, which will catch the distance of 0 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would also work. it's also a needless comparison too though (performance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of floating point precision, 0 could very much be -1e-16 which would be lesser than 0.
_check_array(reference, 'reference') | ||
_check_array(configuration, 'configuration') | ||
|
||
for i, coords in enumerate(reference): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this way of iterating over one coordinate set is probably slower than distance_array(ref, conf)
. I think guess_bonds
does it this way to avoid memory problems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. The distance_array
cannot go beyond 5k particles. So unless we are explicitly limiting the brute force algorithm to 5k, shouldn't the more robust way be a part of the calculation. I agree once we put in the rules, the iteration over each particle would be about half as fast as the distance_array and should be replaced.
package/MDAnalysis/lib/distances.py
Outdated
for j in idx: | ||
pairs.append((i,j)) | ||
|
||
return pairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions need to also return the distance values too, it's important
Something which might be interesting, which follows @jbarnoud 's idea of a dict is to register distance calculation methods in a similar way to how we do Readers. So something like: @distance_method('brute_force')
def brute_force():
pass
@distance_method('pkd_tree')
def pkd_tree():
pass Where having the same signature becomes essential. This would let new methods be dropped in without having to mess inside MDA. Not sure how this would work with the guessing methods.... |
The guessing method an only work on the methods that are known as the method is written. The only risk there is if a distance method name is overwritten by the decorator and the guessing method refer to the distance method by shortcut name rather than by callback. If the guessing method refers to the functions directly, I see no problem with the decorator except maybe the shadow of over-engineering flying nearby. |
It is little tricky to work with decorators. I will propose what I understood based on the comments. I can change accordingly in the files. I am also not sure whether it is sensible to place every capped functions in the distances.py or should they be seperated. def distance_method(*outer_args, **outer_kwargs):
def decorator(function):
def new_function(*args, **kwargs):
pairs, distance = function(*args, **kwargs)
return pairs, distance
return new_function
return decorator The brute force for capped distances can be directly wrapped using the decorator as @distance_method('brute_force')
def capped_bruteforce(reference, configuration, max_cutoff, min_cutoff = 0, box = None):
pairs, distance = [], []
if reference.shape == (3,):
reference = reference[None, :]
if configuration.shape == (3,):
configuration = configuration[None, :]
_check_array(reference, 'reference')
_check_array(configuration, 'configuration')
for i, coords in enumerate(reference):
dist = distance_array(coords[None, :], configuration, box=box)[0]
idx = np.where((dist <= max_cutoff) & (dist >= min_cutoff))[0]
for j in idx:
pairs.append((i,j))
distance.append(dist[j])
return pairs, distance
@distance_method('pkd_tree')
def pkd_tree():
pass and then the higher level function to calculate the pairs and distances, which will be called by the user. def capped_distance(reference, configuration, max_cutoff, min_cutoff = 0, box = None, method = None):
if box is not None:
boxtype = _box_check(box)
# Convert [A,B,C,alpha,beta,gamma] to [[A],[B],[C]]
if (boxtype == 'tri_box'):
box = triclinic_vectors(box)
if (boxtype == 'tri_vecs_bad'):
box = triclinic_vectors(triclinic_box(box[0], box[1], box[2]))
distance_methods = {
'brute_force': capped_bruteforce,
'pkdtree': capped_pkdtree
}
if method is None:
distance_function = distance_methods[_determine_method(reference, configuration, max_cutoff, box = box)]
elif method.lower() in distance_methods:
distance_function = distance_methods[method.lower()]
else:
raise ValueError('The is no known method called "{}".'.format(method))
pairs, distance = distance_function(reference, configuration, max_cutoff, min_cutoff = 0, box = box)
return pairs, distance |
|
package/MDAnalysis/lib/distances.py
Outdated
@@ -389,6 +389,118 @@ def self_distance_array(reference, box=None, result=None, backend="serial"): | |||
|
|||
return distances | |||
|
|||
def capped_distance(reference, configuration, max_cutoff, min_cutoff = 0, box = None, method = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the spaces for the kwargs don't follow pep8 here. something like flake8 etc can sort this out
@richardjgowers Did you mean to reply to someone else? 😄 |
package/MDAnalysis/lib/distances.py
Outdated
max_cutoff : float32 | ||
The maximum cutoff distance. Only the pair of indices which | ||
have smaller distance than this cutoff distance will be returned. | ||
min_cutoff : float32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be documented as optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking better but needs some fixes
package/MDAnalysis/lib/distances.py
Outdated
|
||
Returns | ||
------- | ||
indices : list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be returned as numpy arrays
package/MDAnalysis/lib/distances.py
Outdated
provided as ``cutoff`` in arguments. | ||
distances : list | ||
Distances corresponding to each pair of indices. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple usage example here
package/MDAnalysis/lib/distances.py
Outdated
configuration : numpy.array of type numpy.float32 | ||
All the attributes are similar to reference array. Essentially | ||
second group of atoms. | ||
max_cutoff : float32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just a float, it doesn't have to be float32
package/MDAnalysis/lib/distances.py
Outdated
|
||
Parameters | ||
----------- | ||
reference : numpy.array of type numpy.float32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two arrays don't need to be float32, it's just bruteforce which needs that. You can take care of the types in _bruteforce
where it's necessary
package/MDAnalysis/lib/distances.py
Outdated
|
||
Note | ||
----- | ||
Currently only supports brute force. Similar methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seealso distance_array
package/MDAnalysis/lib/distances.py
Outdated
@@ -390,6 +390,131 @@ def self_distance_array(reference, box=None, result=None, backend="serial"): | |||
return distances | |||
|
|||
|
|||
def capped_distance(reference, configuration, max_cutoff, min_cutoff=None, box=None, method=None): | |||
"""Returns the list of indices of reference and configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a clearer one line description
package/MDAnalysis/lib/distances.py
Outdated
An optional keyword for the method is provided to allow usage of other | ||
implemented methods. Currently pkdtree and bruteforce | ||
can be used. Depending on the number of particles | ||
and distribution, the times may vary significantly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not clear what's meant by the times may vary
. Make it clear that there's some automatic guessing of the best method, but users can also override this if they think they know better
|
||
def test_capped_distance_withpbc(): | ||
box = np.array([1., 1., 1., 90., 90., 90.], dtype=np.float32) | ||
point1 = np.array([0.1, 0.1, 0.1], dtype=np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these tests aren't suitable. the coordinates are nearly always going to be arrays of coords (ie shape (n, 3)
) not single points. If you need to, generate ~100 random coords with a fixed seed, do it the slow way with distance_array
then check that this new method agrees. Use a seed which ensures that the pbc option makes a difference and the results have meaningful size (ie not always 0/all)
|
||
pairs, distances = mda.lib.distances.capped_distance(point1, point2, max_cutoff=0.2) | ||
|
||
assert_equal(len(pairs), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is worth testing the corner case of no results though
|
||
assert_equal(len(pairs), 0) | ||
|
||
def test_capped_distance_type(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't be required/forced
package/MDAnalysis/lib/distances.py
Outdated
|
||
Note | ||
----- | ||
Currently only supports brute force (see also distance_array). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ..seealso:: :func:distance_array
(or something like that..) is a documentation markup that creates a link to the other function, sorry I wasn't clearer
package/MDAnalysis/lib/distances.py
Outdated
""" | ||
pairs, distance = [], [] | ||
# Check whether reference and configuration are numpy arrays | ||
if not isinstance(reference, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.asarray
does all the required checking in one line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More tests required!
|
||
|
||
def test_capped_distance_checkbrute(): | ||
box = np.array([1., 1., 1., 90., 90., 90.], dtype=np.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's still more tests required here, eg checking array of points vs array of points, checking triclinic vs orthogonal boxes (can use a pytest.parametrize
for this)
if we're using np.random
we also need to seed the random generator so that the exact same test is ran each time. (otherwise maybe there's a 1 in 1,000,00 chance that the test fails with certain random conditions)
Where should we put the individual capped distances functions for their respective methods. For instance capped_pkdt should be placed in distances.py or in pkdtree.py. |
@ayushsuhane Both options could be argued for. I'll go with 'distances.py' to keep the feature contained. |
52743f1
to
cd3d29c
Compare
How can I get around the circular import i.e. imports of _check_box in pkdtree and importing |
@ayushsuhane you can import inside the function if it resolves a circular import |
These functions are all for any arbitrary set of reference and configuration particles. For PKDTree and/or other methods, how can I check whether the preprocessing data structure (tree with configuration particles is built or not) without using it specifically as an argument. |
I think I see your problem, but I am not completely sure. Could you be more specific? |
So, for instance, the desired attribute is to evaluate the bonds between particles. Every atom has a different radius. Ideally, neighbors of every particles should be first evaluated and then every neighbor should be checked against the sum of radius of particles. Since, the capped distance is written for arbitrary set of reference and configuration of particles, guessing bonds would require repeated calling of capped function with a one query at a time. |
Good, I understood the correct problem from your previous explanation. Assuming that you were not limited in the signatures, how would you solve the problem? |
package/MDAnalysis/lib/distances.py
Outdated
if (boxtype == 'tri_vecs_bad'): | ||
box = triclinic_vectors(triclinic_box(box[0], box[1], box[2])) | ||
if method is None: | ||
method = _determine_method(reference, configuration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could actually just return a function from _determine_method
rather than a string which we later compare. So
method = _determine_method()
pairs, dist = method(reference, configuration, etc)
This is preferable because they you don't have to type out the signature over and over again for each different method.
wrt specifying method, you would then have to change _determine_method
to accept a hint:
def _determine_method(ref, conf, max_cutoff, box=box, method=None):
if method is not None:
if method == 'bruteforce':
return _bruteforce_method
elif: etc etc
package/MDAnalysis/lib/distances.py
Outdated
The shape of the reference array should be ``(N, 3)`` | ||
or a (3, ) coordinate. | ||
configuration : array | ||
Similar to reference array and second group of coordinates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doc line doesn't seem to make sense?
package/MDAnalysis/lib/distances.py
Outdated
return np.array(pairs), np.array(dist) | ||
|
||
|
||
def _determine_method(reference, configuration, cutoff, box): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add min_cutoff to this now, in case it becomes useful later
package/MDAnalysis/lib/distances.py
Outdated
.. SeeAlso:: :func:'MDAnalysis.lib.distances.distance_array' | ||
.. SeeAlso:: :func:'MDAnalysis.lib.pkdtree.PeriodicKDTree' | ||
|
||
Similar methods can be defined and can be directly linked to this function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this level of documentation is made for users, so this line isn't useful and maybe confusing
The added rules in _determine_method are obtained from here |
package/MDAnalysis/lib/distances.py
Outdated
pairs, dist = method(reference, configuration, max_cutoff, | ||
min_cutoff=min_cutoff, box=box) | ||
|
||
return np.array(pairs), np.array(dist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be np.asarray
in case the methods already return an array
@richardjgowers Did you get a chance to look into pkdtree and kdtree? Also, should the self_capped_distance needs to be added in this PR or maybe around selection using capped distance OR should it be left for a different PR, since it is more of an application of the library function? |
Incase we want it here, I quickly changed the method definition in AroundSelection with capped_distance def _apply_KDTree(self, group):
self.method = 'pkdtree'
sel = self.sel.apply(group)
# All atoms in group that aren't in sel
sys = group[~np.in1d(group.indices, sel.indices)]
box = self.validate_dimensions(group.dimensions)
indx, dist = distances.capped_distance(sel.positions, sys.positions ,max_cutoff=self.cutoff, box=box, method=self.method)
return sys[np.unique(indx[:,1])] I ran the tests after this modification, and it is not breaking any of the tests. I was wondering how to deal with the general definition in the if flags['use_KDTree_routines'] in (True, 'fast', 'always'):
self.apply = self._apply_KDTree
else:
self.apply = self._apply_distmat
self.periodic = flags['use_periodic_selections']
# KDTree doesn't support periodic
if self.periodic:
self.apply = self._apply_distmat One possibility is to replace the 'if flags ..' condition directly with |
@@ -43,14 +43,6 @@ def test_transform_StoR_pass(coord_dtype): | |||
assert_equal(original_r, test_r) | |||
|
|||
|
|||
def test_transform_StoR_fail(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is included in test_transform_StoR_pass
using parametrize. I guess someone modified it recently which got reflected here after I performed a pull from the develop branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nearly done
method_1 = ('bruteforce', 'pkdtree') | ||
|
||
np.random.seed(90003) | ||
points = (np.random.uniform(low=0, high=1.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two lines should be inside the test function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and now deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, didn't see it. Sorry about that.
size=(100, 3))*(boxes_1[0][:3])).astype(np.float32) | ||
|
||
|
||
@pytest.mark.parametrize('box, query , method', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you stack the decorators, you don't need the product call, it does it itself, ie
@pytest.mark.parametrize('box', ...)
@pytest.mark.parametrize('query', ...)
@pytest.mark.parametrize('method', ...)
def test_thing(box, query, method):
package/MDAnalysis/lib/distances.py
Outdated
@@ -1,4 +1,4 @@ | |||
# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; -*- | |||
# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this space might even break the emacs format line thing here?
Ok looks good, but needs documenting in the CHANGELOG then we're done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs tests for min_cutoff
Codecov Report
@@ Coverage Diff @@
## develop #1941 +/- ##
===========================================
+ Coverage 88.42% 88.43% +<.01%
===========================================
Files 142 142
Lines 17021 17096 +75
Branches 2595 2617 +22
===========================================
+ Hits 15051 15119 +68
- Misses 1377 1381 +4
- Partials 593 596 +3
Continue to review full report at Codecov.
|
@ayushsuhane the code report thing above has a link to distances.py showing which lines aren't covered by tests. Give it a look and see if you can add some more tests that cover those lines, eg min_distance isn't covered, maybe it doesn't work? |
@ayushsuhane awesome, PR merged! |
As discussed with @richardjgowers and @jbarnoud , the main idea is to create a functionality which opens up the method to evaluate the distances, as an optional keyword, whereas the default way would be to switch to the best method automatically based on the performance rules obtained from the benchmarks. Here, an initial functionality of capped distances is included using brute force method. The rules could be as simple as
if len(selection) > 1000
or more complicated based on the cutoff radius.PR Checklist