Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffusion Map Refactor/Implementation #863

Closed
wants to merge 49 commits into from
Closed
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
c511ead
new analysis feature: diffusion map
Mar 2, 2016
3be892e
description of diffusion map
Mar 2, 2016
3ce705a
diffusionmap minor fix
Mar 2, 2016
328c27a
test for diffusionmap
Mar 2, 2016
d2d4593
diffusionmap faster test
Mar 2, 2016
ed5f4c5
included notes
Mar 11, 2016
2947c0f
initial work for refactor
jdetle May 24, 2016
7e94a58
functional refactor
jdetle May 24, 2016
ab7f93c
updated changes
jdetle May 24, 2016
43144e0
iteration error
jdetle May 29, 2016
18c84ee
refactor of @euhruska diffusion map
jdetle May 29, 2016
0da96fa
removed deco import
jdetle May 29, 2016
2c93f2d
added tests, fixed logger issue
jdetle May 30, 2016
49578a0
style fixes, protection fixes, added docs
jdetle Jun 3, 2016
7ec486a
fix of some broken areas
jdetle Jun 3, 2016
f350eaa
small docs fix
jdetle Jun 4, 2016
245d2e8
Constant epsilon tests
jdetle Jun 6, 2016
17fced2
updated CHANGELOG
jdetle Jun 6, 2016
14b08e4
fixed selections and updated tests accordingly
jdetle Jun 8, 2016
cfcbfe0
switched order of imports [skip ci]
jdetle Jun 8, 2016
07a83d1
Separated calculation of distance matrix from diffusion
jdetle Jun 10, 2016
6faec56
Working refactor
jdetle Jun 10, 2016
48c27a9
Change of direction in API, more work to be done implementing everything
jdetle Jun 16, 2016
fe08866
added timescaling, fixed some parts
jdetle Jun 16, 2016
cafeb68
Automodule added, Changed some object oriented stuff around
jdetle Jun 16, 2016
bb394ec
Changed things around to get embedding working
jdetle Jun 16, 2016
074a950
Docs fixes
jdetle Jun 17, 2016
68f2928
Added tests, found bugs, squashed bugs.
jdetle Jun 17, 2016
98e899b
Eliminated square of distance, random mistake
jdetle Jun 17, 2016
b893244
Now witness the firepower of this fully armed and
jdetle Jun 18, 2016
7b12ece
Satisfying the linter
jdetle Jun 18, 2016
e5a3487
Updated docs
jdetle Jun 18, 2016
cbf685d
Fixed bug in _single_frame()
jdetle Jun 19, 2016
e0ed7b4
Eliminated spectral gap function... for now
jdetle Jun 21, 2016
303b816
Fixed matrix multiplication bug
jdetle Jun 23, 2016
01dc0d5
Better single_frame loop
jdetle Jun 27, 2016
2f2db09
Refactor of code to stop DiffusionMap inheriting from AnalysisBase
jdetle Jun 29, 2016
34b1e7c
Removed epsilon class from pull request, now must supply a constant a…
jdetle Jun 29, 2016
7c56863
Added exception to be thrown for large distance matrices.
jdetle Jun 29, 2016
bd40f38
Updated diffusion map after some test-driven development, added error…
jdetle Jun 29, 2016
420edd0
Fixed some docs [skip ci]
jdetle Jun 30, 2016
c058c6c
Change to warning
jdetle Jun 30, 2016
b6c309a
Updated CHANGELOG
jdetle Jun 30, 2016
8ba1a58
Fixes as suggested to be more pythonic and elegant
jdetle Jul 1, 2016
aa22772
Fixes to docs, Transform function.
jdetle Jul 3, 2016
97c3105
Removed anisotropic kernel creation. Figured out that I was
jdetle Jul 3, 2016
b3b2ce9
Small style change
jdetle Jul 5, 2016
c5d1958
Updated docs to remove references to anisotropy, update tutorial [ski…
jdetle Jul 5, 2016
d7cd454
Added link to jupyter notebook [skip ci]
jdetle Jul 6, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions package/CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The rules for this file:
* release numbers follow "Semantic Versioning" http://semver.org

------------------------------------------------------------------------------
??/??/16 kain88-de,jdetle, fiona-naughton, richardjgowers
??/??/16 kain88-de, fiona-naughton, richardjgowers, euhruska, jdetle

* 0.15.1

Expand All @@ -27,7 +27,9 @@ Changes
* Added protected variable _frame_index to to keep track of frame iteration
number in AnalysisBase
* Added new AlignTraj class for alignment that follows the new analysis API known
as Bauhaus style.
as Bauhaus style. (Issue #845)
* Added new diffusionmap module for dimension reduction that follows the
Bauhaus API. (Issue #857)

Deprecations (Issue #599)
* Use of rms_fit_trj deprecated in favor of AlignTraj class (Issue #845)
Expand Down
371 changes: 371 additions & 0 deletions package/MDAnalysis/analysis/diffusionmap.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,371 @@
# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for a documentation page to get built you also need to add a rst stub in package/doc/sphinx/source/documentation_pages/analysis (look at others in that dir) and a link to the page in documentation_pages/analysis_modules.rst

Each rst file becomes a page at docs.mdanalysis.org, and it's the automodule thing in the rst that basically just points to this source here for content.

# vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4
#
# MDAnalysis --- http://www.MDAnalysis.org
# Copyright (c) 2006-2015 Naveen Michaud-Agrawal, Elizabeth J. Denning, Oliver Beckstein
# and contributors (see AUTHORS for the full list)
#
# Released under the GNU Public Licence, v2 or any higher version
#
# Please cite your use of MDAnalysis in published work:
#
# N. Michaud-Agrawal, E. J. Denning, T. B. Woolf, and O. Beckstein.
# MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations.
# J. Comput. Chem. 32 (2011), 2319--2327, doi:10.1002/jcc.21787
#

"""
Diffusion map --- :mod:`MDAnalysis.analysis.diffusionmap`
=====================================================================

:Authors: Eugen Hruska, John Detlefs
:Year: 2016
:Copyright: GNU Public License v3

The module contains the non-linear dimension reduction method diffusion map.
The diffusion map provides an estimate of the slowest collective
coordinates for a trajectory. This non-linear dimension reduction method
assumes that the trajectory is long enough to represent a probability
distribution of a protein close to the equilibrium. Furthermore, the diffusion
map assumes that the diffusion coefficients associated with the dynamical
motion of molecules in the system are constant. The eigenvectors with
the largest eigenvalues are the more dominant collective coordinates. Assigning
phyiscal meaning of the 'collective coordinates' is a fundamentally difficult
problem. The time complexity of the diffusion map is O(N^3), where N is the
number of frames in the trajectory, and the in-memory storage complexity is
O(N^2). Instead of a single trajectory a sample of protein structures
can be used. The sample should be equiblibrated, at least locally. Different
weights can be used to determine the anisotropy of the diffusion kernel.
The motivation for the creation of an anisotropic kernel is given on page 14 of
[Lafon1]_. The order of the sampled structures in the trajectory is irrelevant.

The :ref:`Diffusion-Map-tutorial` shows how to use diffusion map for dimension
reduction.

More details about diffusion maps are in [Lafon1]_ , [Ferguson1]_, and
[Clementi1]_.

.. _Diffusion-Map-tutorial:

Diffusion Map tutorial
--------------------

The example uses files provided as part of the MDAnalysis test suite
(in the variables :data:`~MDAnalysis.tests.datafiles.PSF` and
:data:`~MDAnalysis.tests.datafiles.DCD`). Notice that this trajectory
does not represent a sample from equilibrium. This violates a basic
assumption of the anisotropic kernel. The anisotropic kernel is created to act
as a discrete approximation for the Fokker-Plank equation from statistical
mechanics, results from a non-equilibrium trajectory shouldn't be interpreted
for this reason. This tutorial shows how to use the diffusionmap function.
First load all modules and test data ::

>>> import MDAnalysis
>>> import numpy as np
>>> import MDAnalysis.analysis.diffusionmap as diffusionmap
>>> from MDAnalysis.tests.datafiles import PSF,DCD

Given a universe or atom group, we can calculate the diffusion map from
that trajectory using :class:`DiffusionMap`:: and get the corresponding
eigenvalues and eigenvectors.

>>> u = MDAnalysis.Universe(PSF,DCD)

We leave determination of the appropriate scale parameter epsilon to the user,
[Clementi1]_ uses a complex method involving the k-nearest-neighbors of a
trajectory frame, whereas others simple use a trial-and-error approach with
a constant epsilon. Users wishing to use a complex method to determine epsilon
will have to initialize a distance matrix and scale it appropriately themselves
and then pass the new scaled distance matrix as a parameter.

>>> dmap = diffusionmap.DiffusionMap(dist_matrix, epsilon =2)
>>> dmap.run()
>>> eigenvalues = dmap.eigenvalues
>>> eigenvectors = dmap.eigenvectors

From here we can perform an embedding onto the k dominant eigenvectors. This
is similar to the idea of a transform in Principal Component Analysis, but the
non-linearity of the map means there is no explicit relationship between the
lower dimensional space and our original trajectory. However, this is an
isometry (distance preserving map), which means that points close in the lower
dimensional space are close in the higher-dimensional space and vice versa.
In order to embed into the most relevant low-dimensional space, there should
exist some number of dominant eigenvectors, whose corresponding eigenvalues
diminish at a constant rate until falling off, this is referred to as a
spectral gap and should be somewhat apparent for a system at equilibrium with a
high number of frames.

>>> num_eigenvectors = # some number less than the number of frames-1
>>> dmap.transform(num_eigenvectors)

From here it can be difficult to interpret the data, and is left as a task
for the user. The `diffusion distance` between frames i and j is best
approximated by the euclidean distance between rows i and j of
self.diffusion_space. A Jupyter notebook providing an analysis of protein
opening and closing is provided [here](#TODO url to notebook)

Classes
-------

.. autoclass:: DiffusionMap
.. autoclass:: DistMatrix

References
---------

If you use this Dimension Reduction method in a publication, please
reference:
..[Ferguson1] Ferguson, A. L.; Panagiotopoulos, A. Z.; Kevrekidis, I. G.;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the title of this paper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nonlinear dimensionality reduction in molecular simulation: The diffusion map approach

Debenedetti, P. G. Chem. Phys. Lett. 2011, 509, 1−11
..[Lafon1] Coifman, Ronald R., Lafon, Stephane (2006) Diffusion maps.
Appl. Comput. Harmon. Anal. 21, 5–30.
..[Clementi1] Rohrdanz, M. A, Zheng, W, Maggioni, M, & Clementi, C. (2013)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check your references. For example this paper is from 2011

Determination of reaction coordinates via locally scaled
diffusion map. Journal of Chemical Physics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can also cite the 'introduction to diffusion maps` paper from de la Porte. That paper is to me the clearest explanation I found so far.


.. If you choose the default metric, this module uses the fast QCP algorithm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this chown?

[Theobald2005]_ to calculate the root mean square distance (RMSD) between
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check that this is correctly cross referenced in the produced docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [Theobald2005]_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Max, I'm currently working on that 1 page paper. Will do.

two coordinate sets (as implemented
in :func:`MDAnalysis.lib.qcprot.CalcRMSDRotationalMatrix`).
When using this module in published work please cite [Theobald2005]_.


"""
from six.moves import range
import logging
import warnings

import MDAnalysis as mda
import numpy as np

from .rms import rmsd
from .base import AnalysisBase

logger = logging.getLogger("MDAnalysis.analysis.diffusionmap")


class DistanceMatrix(AnalysisBase):
""" Calculate the pairwise distance between each frame in a trajectory using
a given metric

Attributes
----------
atoms : AtomGroup
Selected atoms in trajectory subject to dimension reduction
dist_matrix : array
Array of all possible ij metric distances between frames in trajectory.
This matrix is symmetric with zeros on the diagonal.

Methods
-------
save(filename)
Save the `dist_matrix` to a given filename
"""
def __init__(self, u, select='all', metric=rmsd, cutoff=1E0-5,
weights=None, start=None, stop=None, step=None):
"""
Parameters
----------
u : trajectory `~MDAnalysis.core.AtomGroup.Universe`
The MD Trajectory for dimension reduction, remember that
computational cost of eigenvalue decomposition
scales at O(N^3) where N is the number of frames.
Cost can be reduced by increasing step interval or specifying a
start and stop.
select: str, optional
Any valid selection string for
:meth:`~MDAnalysis.core.AtomGroup.AtomGroup.select_atoms`
This selection of atoms is used to calculate the RMSD between
different frames. Water should be excluded.
metric : function, optional
Maps two numpy arrays to a float, is positive definite and
symmetric, Default: metric is set to rms.rmsd().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should expand what a metric is and the API it follows.

cutoff : float, optional
Specify a given cutoff for metric values to be considered equal,
Default: 1EO-5
weights : array, optional
Weights to be given to coordinates for metric calculation
start : int, optional
First frame of trajectory to analyse, Default: 0
stop : int, optional
Last frame of trajectory to analyse, Default: -1
step : int, optional
Step between frames to analyse, Default: 1
"""

self._u = u
traj = self._u.trajectory
self.atoms = self._u.select_atoms(select)
self._metric = metric
self._cutoff = cutoff
self._weights = weights
self._calculated = False
# remember that this must be called before referencing self.nframes
self._setup_frames(traj, start, stop, step)

def _prepare(self):
self.dist_matrix = np.zeros((self.nframes, self.nframes))

def _single_frame(self):
iframe = self._ts.frame
i_ref = self.atoms.positions
# diagonal entries need not be calculated due to metric(x,x) == 0 in
# theory, _ts not updated properly. Possible savings by setting a
# cutoff for significant decimal places to sparsify matrix
for j, ts in enumerate(self._u.trajectory[iframe:self.stop:self.step]):
self._ts = ts
j_ref = self.atoms.positions
dist = self._metric(i_ref, j_ref, weights=self._weights)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can allow optional arguments for the metric. Because some might want to center as well.

self.dist_matrix[self._frame_index, j+self._frame_index] = (
dist if dist > self._cutoff else 0)
self.dist_matrix[j+self._frame_index, self._frame_index] = (
self.dist_matrix[self._frame_index, j+self._frame_index])
self._ts = self._u.trajectory[iframe]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make a comment why you are resetting the trajectory


def _conclude(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't good. If I want to change my choice of epsilon I still have to recalculate the complete distance matrix! Or close knowledge about the inner working of this class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all this valuable input @kain88-de! Perhaps you're right and I went the wrong way. I want to allow the function to be tuneable for custom purposes such that the way epsilon is calculated is left up to the user. In addition, it would be nice if someone working in a scripting environment could just write a function in there script and pass it as an argument. Given some explicit instructions, a user passes epsilon=function_written_for_choice_of_epsilon and the arguments necessary to execute the function. (Would this be in the form of **kwargs?, I am not entirely sure of the formalism). Of course, many epsilon functions would take the distance matrix as a parameter. Hopefully that makes sense and indicates why I think this wouldn't require recalculating the complete distance matrix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write up some usage examples that you are thinking of. They can later be used as documentation as well. Then it becomes clearer what you want to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, good suggestion thanks.

self._calculated = True

def save(self, filename):
np.save(filename, self.dist_matrix)
logger.info("Wrote the distance-squared matrix to file %r", filename)


class DiffusionMap(object):
"""Non-linear dimension reduction method

Dimension reduction with diffusion mapping of selected structures in a
trajectory.

Attributes
----------
eigenvalues: array
Eigenvalues of the diffusion map
eigenvectors: array
Eigenvectors of the diffusion map
diffusion_space : array
After calling `transform(num_eigenvectors)` the diffusion map embedding
into the lower dimensional diffusion space will exist here.

Methods
-------
run()
Constructs an anisotropic diffusion kernel and performs eigenvalue
decomposition on it.
transform(num_eigenvectors)
Perform an embedding of a frame into the eigenvectors representing
the collective coordinates.
"""

def __init__(self, u, epsilon=1, manifold_density=None,
timescale=1, **kwargs):
"""
Parameters
-------------
u : MDAnalysis Universe or DistanceMatrix object
Can be a Universe, in which case one must supply kwargs for the
initialization of a DistanceMatrix. Otherwise, this can be a
DistanceMatrix already initialized. Either way, this will be made
into a diffusion kernel.
epsilon : Float
Specifies the method used for the choice of scale parameter in the
diffusion map. More information in [1], [2] and [3], Default: 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a default of 1? What unit is the epsilon?

manifold_density: list, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a manifold density? What would be good weights? Why would a user change the weight of a frame? From which paper did you get that this should be done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool result from Lafon's Diffusion Maps Paper, Page 14. He provides a proof to show how diffusion maps can reflect a system that is modeled by the Fokker-Plank equation. (Does that sentence make sense?) If I'm not mistaken MD simulations are one such area, and it is taken into consideration in Dr. Clementi's locally scaled diffusion and free energy landscapes paper. I will gladly admit that these more advanced topics are a bit beyond me.

I can't tell if you want to see this gotten rid of or if I should work on a better description of this in the documentation. I can provide a link to my blog in the docs where I talked about this derivation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah Focker-Plank equation and diffusion are related I know (also good the correspondence can be shown for diffusion maps). Yes MD-simulations can be modeled with Fokker-Plank equations. What I would like to know is what is a manifold_density? Your answer doesn't give me the description. I still don't know what this parameter does from reading the docs nor do I know what sensible values are as a user nor could I guess them. Please explain what the manifold_density is for the diffusion maps and how they influence the result. If we can't explain it think about leaving it out.

Please don't link to blog posts. We don't know how long they will be around and MDAnalysis will likely be around for another 10 years so I rather keep everything in the docs self contained. References to publications are OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right, I will do some research and get back to you with a
satisfying explanation or eliminate it as a parameter.
On Fri, Jul 1, 2016 at 1:18 PM kain88-de [email protected] wrote:

In package/MDAnalysis/analysis/diffusionmap.py
#863 (comment):

  • """
  • def init(self, u, epsilon=1, manifold_density=None,
  •             timescale=1, **kwargs):
    
  •    """
    
  •    Parameters
    

  •    u : MDAnalysis Universe or DistanceMatrix object
    
  •        Can be a Universe, in which case one must supply kwargs for the
    
  •        initialization of a DistanceMatrix. Otherwise, this can be a
    
  •        DistanceMatrix already initialized. Either way, this will be made
    
  •        into a diffusion kernel.
    
  •    epsilon : Float
    
  •        Specifies the method used for the choice of scale parameter in the
    
  •        diffusion map. More information in [1], [2] and [3], Default: 1.
    
  •    manifold_density: list, optional
    

Yeah Focker-Plank equation and diffusion are related I know (also good the
correspondence can be shown for diffusion maps). Yes MD-simulations can be
modeled with Fokker-Plank equations. What I would like to know is what
is a manifold_density?
Your answer doesn't give me the description. I
still don't know what this parameter does from reading the docs nor do I
know what sensible values are as a user nor could I guess them. Please
explain what the manifold_density is for the diffusion maps and how they
influence the result. If we can't explain it think about leaving it out.

Please don't link to blog posts. We don't know how long they will be
around and MDAnalysis will likely be around for another 10 years so I
rather keep everything in the docs self contained. References to
publications are OK.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/MDAnalysis/mdanalysis/pull/863/files/b6c309a5dac5b2a5fc816fb0c2eccfd1a4a81b93#r69350871,
or mute the thread
https://github.com/notifications/unsubscribe/AKcARv5Jw8X2-0Ck7TR5MuS9Tcw5abKjks5qRXYagaJpZM4IpYIS
.

Have a wonderful day,
John J. Detlefs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so with 10 minutes of reading another Lafon paper he gives the density at a point x to be equal to $\int d(x,y) Pr{y} dy $ Pr{y} is a probability measure equal to e^U(y), where U is the potential function on the diffusion process. Does this make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that is a better answer. So you are saying you want to implement a anisotropic algorithm for the diffusion maps. I have so far not found that information in the docstrings for this class.

As I read the paper they provide an equation (11) for p_{epsilon} that can be calculated from the data. And I assume you are referring to the first unnumbered equations in chapter 4. anisotropic diffusion. Then the integral you have written is wrong. Second from the equation in the paper I read $\int d(y) k_{epsilon}(x,y) Pr{y} dy $ that the probability to be in state x is equal to the "sum" of all transitions to x from any point y considering the probability to be in y. Which sounds unnecessary confusing.

Or do you mean equation 13? Even in that case we can hardly give any estimate because the p_{epsilon} is some normalization and not a probability density. Giving this as a user is really hard, because having a good estimate of \mu is hard in high dimensions (more then 3) and would help us solve the complete state equation to a large extend.

I currently think your implementation has algorithms from different papers mixed up. Please once write up in latex the formulas you use. What each symbol stands for and from what paper you have copied the algorithm. Since they all provide test-case data we can compare against that.

Copy link
Member

@kain88-de kain88-de Jul 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we have any more math discussion I think it's the best when you once write down what exactly you are doing on one page in latex. That page will then be used as a reference to check your implementation.

At least to remove the confusion in my head in regard to what you are doing ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I really appreciate the help. When are you leaving for wedding prep by the way? I just posted a copy of the equations I am trying to follow by the way:
anisotropicdiffusion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prepare the one sheet paper with citations. Then I can say more where I have questions about the details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this certainly the best medium for an easy-to-follow dialogue.

The list has to have the same length as the trajectory.
With 'None' the weight of each frame of the trajectory will be the
same, Default : None
timescale: int, optional
The number of steps in the random walk, large t reflects global
structure whereas small t indicates local structure, Default: 1
"""
if isinstance(u, mda.Universe):
self._dist_matrix = DistanceMatrix(u, **kwargs)
elif isinstance(u, DistanceMatrix):
self._dist_matrix = u
else:
raise ValueError("U is not a Universe or DistanceMatrix and"
" so the DiffusionMap has no data to work with.")
self._epsilon = epsilon
# important for transform function and length of .run() method
self._nframes = self._dist_matrix.nframes
if self._nframes > 2000:
warnings.warn("The distance matrix is very large, and can "
"be very slow to compute. Consider picking a larger "
"step size in distance matrix initialization.")

# determines length of diffusion process
self._t = timescale
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the same names as the input variables. This makes them easier to look up later.


if manifold_density is None:
# weights do not apply to metric but density of data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something you can read about in page 14 of the Lafon Paper or my diffusion maps blog post. The weight quantity is supposed to reflect how crowded points are on the free energy surface in the case of the adk simulation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well good if I'd know that my problem would be solved. The density is what determines the free energy landscape and hard to calculate accurately, especially if you talk about the density in the full conformation space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is not documented which would be incredible important here since the meaning of weight deviates from the usual meaning of weight inside of MDAnalysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll go back and have a look at the literature before I make any changes. I'll email @euhruska to see if he has any examples of literature using a density estimate.

self._weights_ker = np.ones((self._nframes,))
else:
if manifold_density.shape[0] != self._nframes:
raise ValueError("The weight should have the same length as "
'the trajectory')
else:
# density weights are constructed as relative to the mean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

self._weights_ker = (np.asarray(manifold_density,
dtype=np.float64) /
np.mean(manifold_density))

def run(self):
# will only run if distance matrix not already calculated
self._dist_matrix.run()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't this trigger a rerun of the distmatrix? There is nothing in the AnalysisBase code or DistMatrix that would stop it from being run again. You have to do the check here manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally right, I don't know why these mistakes are popping up, I'll try to be more careful.

self._scaled_matrix = self._dist_matrix.dist_matrix / self._epsilon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distance squared

Copy link
Contributor Author

@jdetle jdetle Jul 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you looking at the Lafon paper and seeing h(||x-y||^2/ epsilon)? This is the thing you made me fix a while back. ||x-y||^2 is the metric that they use in the paper, but if I am not mistaken our metric d(x,y) satisfies this kernel property because it is symmetric, positive-definite. It looks like I am wrong, and it is supposed to be squared, yes.

# this should be a reference to the same object as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure when this comment used to make sense/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relic of Epsilon class, would permanently change dist_matrix.dist_matrix to reflect scaling. There weren't two matrices in memory.

# self.dist_matrix.dist_matrix
# take negative exponent of scaled matrix to create Isotropic kernel
self._kernel = np.exp(-self._scaled_matrix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why saving the _scaled_matrix and not just calculate the kernel directly. Let cpython reclaim some memory. I think cpython is also good with temporary objects.

# define an anistropic diffusion term q
q_vector = np.zeros((self._nframes, ))
# weights should reflect the density of the points on the manifold
for i in range(self._nframes):
q_vector[i] = np.dot(self._kernel[i, :], self._weights_ker)

# Form a new kernel from the anisotropic diffusion term q
self._kernel /= np.sqrt(q_vector[:, np.newaxis].dot(q_vector[np.newaxis]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From which paper is this. I'm not familiar what q_vector is doing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lafon page 14

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are saying this q_vector comes from this paper?

Coifman, R. R., & Lafon, S. (2006). Diffusion maps. Applied and Computational Harmonic Analysis, 21(1), 5–30. http://doi.org/10.1016/j.acha.2006.04.006

This wouldn't calculate a kernel but some d vector according to that page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we looking at the same thing?
anisotropicdiffusion


# Weighted Graph Laplacian normalization on this new graph
d_vector = np.zeros((self._nframes, ))
for i in range(self._nframes):
d_vector[i] = np.dot(self._kernel[i, :], self._weights_ker)

for i in range(self._nframes):
self._kernel[i, :] = self._kernel[i, :] * self._weights_ker

# Define anisotropic transition by dividing kernel by this term
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are creating a row-stochastic matrix here that "defines" a discrete markov process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't mutually exclusive terms. If you think that better indicates the underlying framework of diffusion maps I'd probably agree, but these comments were just left by me to indicate how I was following the algorithm supplied in page 14 of Lafon's paper for the construction of an anisotropic kernel.

self._kernel /= np.sqrt(d_vector[:, np.newaxis].dot(d_vector[np.newaxis]))

# Apply timescaling
for i in range(self._t):
if i > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for i in range(1, self._t)

Another thing. Isn't there a matrix power somewhere in numpy/scipy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Urgh I don't know why I didn't find this but there is numpy.linalg.matrix_power. Will change.

self._kernel = self._kernel.dot(self._kernel)

eigenvals, eigenvectors = np.linalg.eig(self._kernel)
eg_arg = np.argsort(eigenvals)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can already change the sorting direction here. Since these are indices it would also be nice if the name reflects that.

self.eigenvalues = eigenvals[eg_arg[::-1]]
self.eigenvalues = self.eigenvalues[1:]
self.eigenvectors = eigenvectors[eg_arg[::-1], :]
self.eigenvectors = self.eigenvectors[1:]
self._calculated = True

def transform(self, num_eigenvectors):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't check that decomposition was called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how would someone perform such a check? (Asking for a friend...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two ways. Either check that variables exists that are only created in decompose or create an is_decomposed boolean in __init__ which is set to True after decompose finishes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted thanks

""" Embeds a trajectory via the diffusion map

Parameter
---------
num_eigenvectors : int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standard naming for number of something is n_something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea, thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what we use in MDAnalysis. See n_atoms, n_frames, n_residues.

The number of dominant eigenvectors to be used for
diffusion mapping

Return
------
diffusion_space : array
The diffusion map embedding as defined by [Ferguson1]_.
This isn't a linear transformation, but an isometry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this explanaition doesn't belong here. The first sentence is OK at this point but the rest should be at some other place. Either the general module docstring or in the class docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

between the higher dimensional space and the space spanned by
the eigenvectors.
"""
self.diffusion_space = (self.eigenvectors.T[:,:num_eigenvectors] *
self.eigenvalues[:num_eigenvectors])
return self.diffusion_space
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to a straight

return self.eigenvectors.T[:, :n_eigenvectors] * self.eigenvalues[:n_eigenvectors]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes total sense, will fix.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.. automodule:: MDAnalysis.analysis.diffusionmap
Loading