Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Capped distances to evaluate the indices of pairs within a fixed radius #1941

Merged
merged 21 commits into from
Jul 12, 2018

Conversation

ayushsuhane
Copy link
Contributor

@ayushsuhane ayushsuhane commented Jun 15, 2018

As discussed with @richardjgowers and @jbarnoud , the main idea is to create a functionality which opens up the method to evaluate the distances, as an optional keyword, whereas the default way would be to switch to the best method automatically based on the performance rules obtained from the benchmarks. Here, an initial functionality of capped distances is included using brute force method. The rules could be as simple as if len(selection) > 1000 or more complicated based on the cutoff radius.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@coveralls
Copy link

coveralls commented Jun 15, 2018

Coverage Status

Coverage decreased (-0.07%) to 89.884% when pulling 1068578 on ayushsuhane:cappedfeature into 1a1b2f1 on MDAnalysis:develop.

@jbarnoud
Copy link
Contributor

If all your capped distance methods have the same signature, you can have them in a dictionary which would avoid a long if...elif...elif...:

distance_methods = {
    'bruteforce': _capped_brutefore,
    'pkdt': _capped_pkdt,
    'octree': _octree,
}
if method is None:
    distance_function = _determine_method(...)
elif method.lower() in distance_methods:
    distance_function = distance_methods[method.lower()]
else:
    raise ValueError('The is no known method called "{}".'.format(method))

An other benefit is that the method argument can take a callable with the same signature which makes your function extremely easy to play with: if I find an awesome method in an other library, I can write a small wrapper function that use the agreed signature, and I can use my alien function as a method without modifying MDAnalysis code.


for i, coords in enumerate(reference):
dist = distance_array(coords[None, :], configuration, box=box)[0]
idx = np.where((dist <= max_cutoff) & (dist > min_cutoff))[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Min cutoff here means that it will never return a distance of 0, which sometimes it should. I think it'd be better to have min_cutoff=None as default and only apply this condition if it is defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be None, or we can also put a 0 as a default and dist >= min_cutoff, which will catch the distance of 0 as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would also work. it's also a needless comparison too though (performance)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of floating point precision, 0 could very much be -1e-16 which would be lesser than 0.

_check_array(reference, 'reference')
_check_array(configuration, 'configuration')

for i, coords in enumerate(reference):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this way of iterating over one coordinate set is probably slower than distance_array(ref, conf). I think guess_bonds does it this way to avoid memory problems

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. The distance_array cannot go beyond 5k particles. So unless we are explicitly limiting the brute force algorithm to 5k, shouldn't the more robust way be a part of the calculation. I agree once we put in the rules, the iteration over each particle would be about half as fast as the distance_array and should be replaced.

for j in idx:
pairs.append((i,j))

return pairs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions need to also return the distance values too, it's important

@richardjgowers
Copy link
Member

Something which might be interesting, which follows @jbarnoud 's idea of a dict is to register distance calculation methods in a similar way to how we do Readers. So something like:

@distance_method('brute_force')
def brute_force():
    pass

@distance_method('pkd_tree')
def pkd_tree():
    pass

Where having the same signature becomes essential. This would let new methods be dropped in without having to mess inside MDA. Not sure how this would work with the guessing methods....

@jbarnoud
Copy link
Contributor

Not sure how this would work with the guessing methods....

The guessing method an only work on the methods that are known as the method is written. The only risk there is if a distance method name is overwritten by the decorator and the guessing method refer to the distance method by shortcut name rather than by callback. If the guessing method refers to the functions directly, I see no problem with the decorator except maybe the shadow of over-engineering flying nearby.

@ayushsuhane
Copy link
Contributor Author

ayushsuhane commented Jun 17, 2018

It is little tricky to work with decorators. I will propose what I understood based on the comments. I can change accordingly in the files. I am also not sure whether it is sensible to place every capped functions in the distances.py or should they be seperated.
First, we need to define the decorator with arguments as:

def distance_method(*outer_args, **outer_kwargs):
    def decorator(function):
        def new_function(*args, **kwargs):
            pairs, distance = function(*args, **kwargs)
            return pairs, distance
        return new_function
    return decorator

The brute force for capped distances can be directly wrapped using the decorator as

@distance_method('brute_force')
def capped_bruteforce(reference, configuration, max_cutoff, min_cutoff = 0, box = None):
    pairs, distance = [], []
    if reference.shape == (3,):
        reference = reference[None, :]
    if configuration.shape == (3,):
        configuration = configuration[None, :]

    _check_array(reference, 'reference')
    _check_array(configuration, 'configuration')
    for i, coords in enumerate(reference):
        dist = distance_array(coords[None, :], configuration, box=box)[0]
        idx = np.where((dist <= max_cutoff) & (dist >= min_cutoff))[0]
        for j in idx:
            pairs.append((i,j))
            distance.append(dist[j])
    return pairs, distance

@distance_method('pkd_tree')
def pkd_tree():
    pass

and then the higher level function to calculate the pairs and distances, which will be called by the user.

def capped_distance(reference, configuration, max_cutoff, min_cutoff = 0, box = None, method = None):
    if box is not None:
        boxtype = _box_check(box)
        # Convert [A,B,C,alpha,beta,gamma] to [[A],[B],[C]]
        if (boxtype == 'tri_box'):
            box = triclinic_vectors(box)
        if (boxtype == 'tri_vecs_bad'):
            box = triclinic_vectors(triclinic_box(box[0], box[1], box[2]))
    
    distance_methods = {
    'brute_force': capped_bruteforce,
        'pkdtree': capped_pkdtree
    }
    
    if method is None:
        distance_function = distance_methods[_determine_method(reference, configuration, max_cutoff, box = box)]
    elif method.lower() in distance_methods:
        distance_function = distance_methods[method.lower()]
    else:
        raise ValueError('The is no known method called "{}".'.format(method))
    
    pairs, distance = distance_function(reference, configuration, max_cutoff, min_cutoff = 0,  box = box)
    return pairs, distance

@richardjgowers
Copy link
Member

richardjgowers commented Jun 18, 2018

@utkbansal @ayushsuhane on second thought, I think the decorators are just overcomplicating it for now, we can leave them out

@@ -389,6 +389,118 @@ def self_distance_array(reference, box=None, result=None, backend="serial"):

return distances

def capped_distance(reference, configuration, max_cutoff, min_cutoff = 0, box = None, method = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spaces for the kwargs don't follow pep8 here. something like flake8 etc can sort this out

@utkbansal
Copy link
Member

@richardjgowers Did you mean to reply to someone else? 😄

max_cutoff : float32
The maximum cutoff distance. Only the pair of indices which
have smaller distance than this cutoff distance will be returned.
min_cutoff : float32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be documented as optional

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking better but needs some fixes


Returns
-------
indices : list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be returned as numpy arrays

provided as ``cutoff`` in arguments.
distances : list
Distances corresponding to each pair of indices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple usage example here

configuration : numpy.array of type numpy.float32
All the attributes are similar to reference array. Essentially
second group of atoms.
max_cutoff : float32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a float, it doesn't have to be float32


Parameters
-----------
reference : numpy.array of type numpy.float32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two arrays don't need to be float32, it's just bruteforce which needs that. You can take care of the types in _bruteforce where it's necessary


Note
-----
Currently only supports brute force. Similar methods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seealso distance_array

@@ -390,6 +390,131 @@ def self_distance_array(reference, box=None, result=None, backend="serial"):
return distances


def capped_distance(reference, configuration, max_cutoff, min_cutoff=None, box=None, method=None):
"""Returns the list of indices of reference and configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a clearer one line description

An optional keyword for the method is provided to allow usage of other
implemented methods. Currently pkdtree and bruteforce
can be used. Depending on the number of particles
and distribution, the times may vary significantly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not clear what's meant by the times may vary. Make it clear that there's some automatic guessing of the best method, but users can also override this if they think they know better


def test_capped_distance_withpbc():
box = np.array([1., 1., 1., 90., 90., 90.], dtype=np.float32)
point1 = np.array([0.1, 0.1, 0.1], dtype=np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests aren't suitable. the coordinates are nearly always going to be arrays of coords (ie shape (n, 3)) not single points. If you need to, generate ~100 random coords with a fixed seed, do it the slow way with distance_array then check that this new method agrees. Use a seed which ensures that the pbc option makes a difference and the results have meaningful size (ie not always 0/all)


pairs, distances = mda.lib.distances.capped_distance(point1, point2, max_cutoff=0.2)

assert_equal(len(pairs), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is worth testing the corner case of no results though


assert_equal(len(pairs), 0)

def test_capped_distance_type():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be required/forced


Note
-----
Currently only supports brute force (see also distance_array).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ..seealso:: :func:distance_array (or something like that..) is a documentation markup that creates a link to the other function, sorry I wasn't clearer

"""
pairs, distance = [], []
# Check whether reference and configuration are numpy arrays
if not isinstance(reference, np.ndarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.asarray does all the required checking in one line

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests required!



def test_capped_distance_checkbrute():
box = np.array([1., 1., 1., 90., 90., 90.], dtype=np.float32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's still more tests required here, eg checking array of points vs array of points, checking triclinic vs orthogonal boxes (can use a pytest.parametrize for this)

if we're using np.random we also need to seed the random generator so that the exact same test is ran each time. (otherwise maybe there's a 1 in 1,000,00 chance that the test fails with certain random conditions)

@ayushsuhane
Copy link
Contributor Author

Where should we put the individual capped distances functions for their respective methods. For instance capped_pkdt should be placed in distances.py or in pkdtree.py.

@jbarnoud
Copy link
Contributor

@ayushsuhane Both options could be argued for. I'll go with 'distances.py' to keep the feature contained.

@ayushsuhane
Copy link
Contributor Author

How can I get around the circular import i.e. imports of _check_box in pkdtree and importing PeriodicKDTree in pdkt_capped function?

@richardjgowers
Copy link
Member

@ayushsuhane you can import inside the function if it resolves a circular import

@ayushsuhane
Copy link
Contributor Author

These functions are all for any arbitrary set of reference and configuration particles. For PKDTree and/or other methods, how can I check whether the preprocessing data structure (tree with configuration particles is built or not) without using it specifically as an argument.

@jbarnoud
Copy link
Contributor

I think I see your problem, but I am not completely sure. Could you be more specific?

@ayushsuhane
Copy link
Contributor Author

So, for instance, the desired attribute is to evaluate the bonds between particles. Every atom has a different radius. Ideally, neighbors of every particles should be first evaluated and then every neighbor should be checked against the sum of radius of particles. Since, the capped distance is written for arbitrary set of reference and configuration of particles, guessing bonds would require repeated calling of capped function with a one query at a time.
While doing so, we don't want to spend time in building the tree again but just checking if a tree exists should be enough. Additionally, since another requirement for the function is to be method independent, passing any arguments would destroy the purpose.
One way could be to define a class object and create the tree in __init__ method, but I am not sure whether that would serve the purpose or if there is any traditional way to handle this.

@jbarnoud
Copy link
Contributor

Good, I understood the correct problem from your previous explanation. Assuming that you were not limited in the signatures, how would you solve the problem?

if (boxtype == 'tri_vecs_bad'):
box = triclinic_vectors(triclinic_box(box[0], box[1], box[2]))
if method is None:
method = _determine_method(reference, configuration,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could actually just return a function from _determine_method rather than a string which we later compare. So

method = _determine_method()

pairs, dist = method(reference, configuration, etc)

This is preferable because they you don't have to type out the signature over and over again for each different method.

wrt specifying method, you would then have to change _determine_method to accept a hint:

def _determine_method(ref, conf, max_cutoff, box=box, method=None):
     if method is not None:
        if method == 'bruteforce':
            return _bruteforce_method
        elif: etc etc    

The shape of the reference array should be ``(N, 3)``
or a (3, ) coordinate.
configuration : array
Similar to reference array and second group of coordinates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doc line doesn't seem to make sense?

return np.array(pairs), np.array(dist)


def _determine_method(reference, configuration, cutoff, box):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add min_cutoff to this now, in case it becomes useful later

.. SeeAlso:: :func:'MDAnalysis.lib.distances.distance_array'
.. SeeAlso:: :func:'MDAnalysis.lib.pkdtree.PeriodicKDTree'

Similar methods can be defined and can be directly linked to this function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this level of documentation is made for users, so this line isn't useful and maybe confusing

@ayushsuhane
Copy link
Contributor Author

The added rules in _determine_method are obtained from here
https://github.com/ayushsuhane/Benchmarks_Distance/blob/master/Notebooks/Capped_Distance.ipynb.
To list them down, the method brute force is chosen if any of the reference or configuration has a length smaller than 5000. Additionally, if the cutoff distance is larger than 10% of the box size, brute force is the preferred method. PKDTree is used in all other cases.
The current tests check whether method = 'pkdtree' and method = 'bruteforce' work independently. Is there a need to have a test case which checks whether method is selected automatically?

pairs, dist = method(reference, configuration, max_cutoff,
min_cutoff=min_cutoff, box=box)

return np.array(pairs), np.array(dist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be np.asarray in case the methods already return an array

@ayushsuhane
Copy link
Contributor Author

@richardjgowers Did you get a chance to look into pkdtree and kdtree?

Also, should the self_capped_distance needs to be added in this PR or maybe around selection using capped distance OR should it be left for a different PR, since it is more of an application of the library function?

@ayushsuhane
Copy link
Contributor Author

ayushsuhane commented Jun 30, 2018

Incase we want it here, I quickly changed the method definition in AroundSelection with capped_distance

 def _apply_KDTree(self, group):
        self.method = 'pkdtree'
        sel = self.sel.apply(group)
        # All atoms in group that aren't in sel
        sys = group[~np.in1d(group.indices, sel.indices)]
        box = self.validate_dimensions(group.dimensions)
        indx, dist = distances.capped_distance(sel.positions, sys.positions ,max_cutoff=self.cutoff, box=box, method=self.method)
        return sys[np.unique(indx[:,1])]

I ran the tests after this modification, and it is not breaking any of the tests. I was wondering how to deal with the general definition in the __init__ function of DistanceSelection class

if flags['use_KDTree_routines'] in (True, 'fast', 'always'):
            self.apply = self._apply_KDTree
        else:
            self.apply = self._apply_distmat

        self.periodic = flags['use_periodic_selections']
        # KDTree doesn't support periodic
        if self.periodic:
            self.apply = self._apply_distmat

One possibility is to replace the 'if flags ..' condition directly with self.apply = self._capped, and define the respective distance treatements in these functions, but all the modifications needs to be done in a single PR.

@@ -43,14 +43,6 @@ def test_transform_StoR_pass(coord_dtype):
assert_equal(original_r, test_r)


def test_transform_StoR_fail():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is included in test_transform_StoR_pass using parametrize. I guess someone modified it recently which got reflected here after I performed a pull from the develop branch.

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nearly done

method_1 = ('bruteforce', 'pkdtree')

np.random.seed(90003)
points = (np.random.uniform(low=0, high=1.0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two lines should be inside the test function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and now deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, didn't see it. Sorry about that.

size=(100, 3))*(boxes_1[0][:3])).astype(np.float32)


@pytest.mark.parametrize('box, query , method',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you stack the decorators, you don't need the product call, it does it itself, ie

@pytest.mark.parametrize('box', ...)
@pytest.mark.parametrize('query', ...)
@pytest.mark.parametrize('method', ...)
def test_thing(box, query, method):

@@ -1,4 +1,4 @@
# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; -*-
# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; -*-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this space might even break the emacs format line thing here?

@richardjgowers
Copy link
Member

Ok looks good, but needs documenting in the CHANGELOG then we're done

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs tests for min_cutoff

@codecov
Copy link

codecov bot commented Jul 11, 2018

Codecov Report

Merging #1941 into develop will increase coverage by <.01%.
The diff coverage is 88%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1941      +/-   ##
===========================================
+ Coverage    88.42%   88.43%   +<.01%     
===========================================
  Files          142      142              
  Lines        17021    17096      +75     
  Branches      2595     2617      +22     
===========================================
+ Hits         15051    15119      +68     
- Misses        1377     1381       +4     
- Partials       593      596       +3
Impacted Files Coverage Δ
package/MDAnalysis/lib/distances.py 87.17% <88%> (+1.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a1b2f1...9d3a2fc. Read the comment docs.

@richardjgowers
Copy link
Member

@ayushsuhane the code report thing above has a link to distances.py showing which lines aren't covered by tests. Give it a look and see if you can add some more tests that cover those lines, eg min_distance isn't covered, maybe it doesn't work?

@richardjgowers richardjgowers merged commit 4be325c into MDAnalysis:develop Jul 12, 2018
@richardjgowers
Copy link
Member

@ayushsuhane awesome, PR merged!

@ayushsuhane ayushsuhane deleted the cappedfeature branch July 15, 2018 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants