Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving average #381

Merged
merged 24 commits into from
Sep 14, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
809e0c6
Add moving_average function for visualization and convergence testing
jaclark5 Jul 24, 2024
ffb6df0
Update versionadded
jaclark5 Jul 24, 2024
5cba22e
Run Black
jaclark5 Jul 24, 2024
8023c03
Bug fix bar_.py states
jaclark5 Jul 24, 2024
917e6ff
Update Changelog
jaclark5 Jul 24, 2024
f8aff24
Update the docs
jaclark5 Jul 24, 2024
c739c0e
Add tests
jaclark5 Jul 24, 2024
b55552f
Formatting to align with Black
jaclark5 Jul 25, 2024
509b95a
Update tests
jaclark5 Jul 25, 2024
10ee4bc
Merge branch 'master' into moving_average
orbeckst Aug 26, 2024
5f9fff7
Refactor moving_average to align with forward_backward_convergence fu…
jaclark5 Aug 27, 2024
a54850f
Merge branch 'moving_average' of github.com:jaclark5/alchemlyb into m…
jaclark5 Aug 27, 2024
04718d0
Update tests
jaclark5 Aug 28, 2024
75aa16e
Update test_convergence and lambda tests in convergence.moving_average
jaclark5 Aug 28, 2024
ca391c3
Adjust convergence.py and tests for codecoverage
jaclark5 Aug 28, 2024
e2c94f6
black
jaclark5 Aug 28, 2024
ce88763
Update moving_average to block_average for more accurate descriptive …
jaclark5 Aug 30, 2024
fc3e4e8
Address reviewer comments
jaclark5 Sep 3, 2024
c286a6a
Update test to align with changed handling of dfs of different length…
jaclark5 Sep 3, 2024
f51d390
Remove incorrect popagation of error in BAR
jaclark5 Sep 10, 2024
a179048
Add tests and error catch for ill constructed BAR input, u_nk
jaclark5 Sep 10, 2024
3599d92
Merge branch 'master' into moving_average
jaclark5 Sep 10, 2024
6ecbe02
black
jaclark5 Sep 10, 2024
750cb0d
Updated version comments
jaclark5 Sep 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGES
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,14 @@ The rules for this file:
* release numbers follow "Semantic Versioning" https://semver.org

------------------------------------------------------------------------------
??/??/2024 jaclark5

* 2.4.0

Enhancements
- Addition of `block_average` function in both `convergence` and
`visualization` (Issue #380, PR #381)


08/24/2024 xiki-tempula

Expand Down
25 changes: 25 additions & 0 deletions docs/convergence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,31 @@ is, where 0 fully-unequilibrated and 1.0 is fully-equilibrated. ::
>>> value = A_c(dhdl_list, tol=2)
0.7085

Moving Average
--------------
If one obtains suspicious results from the forward / backward convergence plot,
it may be useful to view the block averages of the change in free energy using
:func:`~alchemlyb.convergence.block_average` and
:func:`~alchemlyb.visualisation.plot_block_average` over the course of each
step in lambda individually, the following example is for :math:`\lambda` = 0

>>> from alchemtest.gmx import load_benzene
>>> from alchemlyb.parsing.gmx import extract_u_nk
>>> from alchemlyb.visualisation import plot_block_average
>>> from alchemlyb.convergence import block_average

>>> bz = load_benzene().data
>>> data_list = [extract_u_nk(xvg, T=300) for xvg in bz['Coulomb']]
>>> df = block_average(data_list, 'mbar')
>>> ax = plot_block_average(df)
>>> ax.figure.savefig('dF_t_block_average.png')

Will give a plot looks like this

.. figure:: images/dF_t_block_average.png

A convergence plot of showing that the forward and backward has converged
jaclark5 marked this conversation as resolved.
Show resolved Hide resolved
fully.

Convergence functions
---------------------
Expand Down
2 changes: 2 additions & 0 deletions docs/convergence/alchemlyb.convergence.convergence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ All convergence functions are located in this submodule but for convenience they
.. autofunction:: alchemlyb.convergence.fwdrev_cumavg_Rc

.. autofunction:: alchemlyb.convergence.A_c

.. autofunction:: alchemlyb.convergence.block_average
Binary file added docs/images/dF_t_block_average.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/visualisation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Plotting Functions
plot_ti_dhdl
plot_dF_state
plot_convergence
plot_block_average

.. _plot_overlap_matrix:

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
alchemlyb.visualisation.plot\_block\_average
=============================================

.. currentmodule:: alchemlyb.visualisation

.. autofunction:: plot_block_average
2 changes: 1 addition & 1 deletion src/alchemlyb/convergence/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
from .convergence import forward_backward_convergence, fwdrev_cumavg_Rc, A_c
from .convergence import forward_backward_convergence, fwdrev_cumavg_Rc, A_c, block_average
135 changes: 132 additions & 3 deletions src/alchemlyb/convergence/convergence.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ def forward_backward_convergence(
Parameters
----------
df_list : list
List of DataFrame of either dHdl or u_nk.
List of DataFrame of either dHdl or u_nk, where each represents a
different value of lambda.
estimator : {'MBAR', 'BAR', 'TI'}
Name of the estimators.
See the important note below on the use of "MBAR".
Expand Down Expand Up @@ -94,7 +95,16 @@ def forward_backward_convergence(
# select estimator class by name
my_estimator = estimators_dispatch[estimator](**kwargs)
logger.info(f"Use {estimator} estimator for convergence analysis.")


# Check that each df in the list has only one value of lambda
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry do you mind remind me of why one cannot have more than one value of lambda?
I might be wrong but I think in principle, one could do forward_backward_convergence of more than one lambda?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this function is assess whether a particular production run has converged. The lambda state of that system must be constant throughout a dataframe for this assessment. If the lambda state changes later on in the trajectory (toward the bottom of the rows of the dataframe), the result of this function would not make sense or be useful. A user might eventually find their mistake, or they may think that their trajectory is not long enough. This check will help a user quickly produce useful results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarification. I think there might be some misunderstanding here. It seems that you're referring to lambda dynamics, where a lambda value is constantly changing at different points in the trajectory. I agree that this isn't supported in this repository.

However, the use case I'm referring to involves MD engines like Gromacs, where multiple windows with different lambda values can run simultaneously. For example, you might have windows such as (coul lambda=0, vdw lambda=0), (coul lambda=0.5, vdw lambda=0), (coul lambda=1, vdw lambda=0), (coul lambda=1, vdw lambda=0.5), and (coul lambda=1, vdw lambda=1). Each lambda window represents an independent simulation, and within each simulation, the lambda value does not change.

The function in question would, for example, take the first 10% of data from all the windows to derive an MBAR estimate, then take the first 20% of data from all the windows to derive another MBAR estimate. This approach ensures that each independent lambda window is appropriately considered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and the way the forward_backward_convergence function handles that is to have each of those windows provided as a separate DataFrame in the df_list. This section is meant to ensure that each DataFrame has a constant set of lambda values independently, not that all the provided DataFrames contains the same set of lambda values.

Copy link
Collaborator

@xiki-tempula xiki-tempula Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean now. But what does

ind = [j for j in range(len(lambda_values[0])) if len(list(set([x[j] for x in lambda_values]))) > 1][0]

Do? I guess if you want the lambda value to be the same then

if len(set(df.reset_index('time').index))) > 1:
    raise Exception

Should be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make this flexible for the number of indices available in dataframe, as either fep-lambda, vdw-lambda, or coul-lambda could have multiple values. Given that lambda_values is a list of unique lambda sets, e.g., [[1]], [[0],[1]], [[0,0]], or [[0,0], [0,1]]. This line will identify the index that is changing so for [[0],[1]], ind=0, and for [[0,0], [0,1]], ind=1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more complicated than I thought. I think my assumption is that for each column there will only be one float for either fep-lambda, vdw-lambda, or coul-lambda. Is lampps giving this kind of output?

Copy link
Contributor Author

@jaclark5 jaclark5 Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree that each DataFrame will have lambda columns that each contain a single value of lambda, but that additional DataFrames may be added to the list with different lambda values.

The complication arises from the case where u_nk columns, vdw-lambda and coul-lambda are present, so len(df.index[0]) == 3, at this point it wouldn't matter which simulation engine was used to create u_nk.

for i, df in enumerate(df_list):
lambda_values = list(set([x[1:] for x in df.index.to_numpy()]))
if len(lambda_values) > 1:
ind = [j for j in range(len(lambda_values[0])) if len(list(set([x[j] for x in lambda_values]))) > 1][0]
raise ValueError(
"Provided DataFrame, df_list[{}] has more than one lambda value in df.index[{}]".format(i, ind)
)

logger.info("Begin forward analysis")
forward_list = []
forward_error_list = []
Expand Down Expand Up @@ -262,7 +272,7 @@ def fwdrev_cumavg_Rc(series, precision=0.01, tol=2):
float
Convergence time fraction :math:`R_c` [Fan2021]_
:class:`pandas.DataFrame`
The DataFrame with moving average. ::
The DataFrame with block average. ::

Forward Backward data_fraction
0 3.016442 3.065176 0.1
Expand Down Expand Up @@ -389,3 +399,122 @@ def A_c(series_list, precision=0.01, tol=2):
d_R_c = sorted_array[-i] - sorted_array[-i - 1]
result += d_R_c * sum(R_c_list <= element) / n_R_c
return result


def block_average(df_list, estimator="MBAR", num=10, **kwargs):
"""Free energy estimate for portions of the trajectory.

Generate the free energy estimate for a series of blocks in time,
with the specified number of equally spaced points.
For example, setting `num` to 10 would give the block averages
which is the free energy estimate from the first 10% alone, then the
next 10% ... of the data.

Parameters
----------
df_list : list
List of DataFrame of either dHdl or u_nk, where each represents a
different value of lambda.
estimator : {'MBAR', 'BAR', 'TI'}
Name of the estimators.
See the important note below on the use of "MBAR".
jaclark5 marked this conversation as resolved.
Show resolved Hide resolved
num : int
The number of time points.
kwargs : dict
Keyword arguments to be passed to the estimator.

Returns
-------
:class:`pandas.DataFrame`
The DataFrame with estimate data. ::

FE FE_Error
0 3.016442 0.052748
1 3.078106 0.037170
2 3.072561 0.030186
3 3.048325 0.026070
4 3.049769 0.023359
5 3.034078 0.021260
6 3.043274 0.019642
7 3.035460 0.018340
8 3.042032 0.017319
9 3.044149 0.016405


.. versionadded:: 2.4.0

"""
logger.info("Start block averaging analysis.")
logger.info("Check data availability.")
if estimator not in (FEP_ESTIMATORS + TI_ESTIMATORS):
msg = f"Estimator {estimator} is not available in {FEP_ESTIMATORS + TI_ESTIMATORS}."
logger.error(msg)
raise ValueError(msg)
else:
# select estimator class by name
estimator_fit = estimators_dispatch[estimator](**kwargs).fit
logger.info(f"Use {estimator} estimator for convergence analysis.")

# Check that each df in the list has only one value of lambda
for i, df in enumerate(df_list):
lambda_values = list(set([x[1:] for x in df.index.to_numpy()]))
if len(lambda_values) > 1:
ind = [j for j in range(len(lambda_values[0])) if len(list(set([x[j] for x in lambda_values]))) > 1][0]
raise ValueError(
"Provided DataFrame, df_list[{}] has more than one lambda value in df.index[{}]".format(i, ind)
)

if estimator in ["BAR"] and len(df_list) > 2:
raise ValueError(
"Restrict to two DataFrames, one with a fep-lambda value and one its forward adjacent state for a "
"meaningful result."
)

# Choose length of comparison trajectory
lx_lambdas = [len(x) for x in df_list]
if len(set(lx_lambdas)) > 1:
jaclark5 marked this conversation as resolved.
Show resolved Hide resolved
lx = np.min( lx_lambdas)
warn(
"Not all trajectories for each lambda value are the same length, using minimum length for analysis: {}".format(
" ".join([f"len(df[{i}])={len(df_list[i])}" for i in range(len(df_list))])
))
else:
lx = len(df_list[0])

logger.info("Begin Moving Average Analysis")
average_list = []
average_error_list = []
for i in range(1, num):
logger.info("Moving Average Analysis: {:.2f}%".format(100 * i / num))
ind1, ind2 = lx // num * (i - 1), lx // num * i
sample = []
for data in df_list:
sample.append(data[ind1:ind2])
sample = concat(sample)
result = estimator_fit(sample)

average_list.append(result.delta_f_.iloc[0, -1])
if estimator.lower() == "bar":
xiki-tempula marked this conversation as resolved.
Show resolved Hide resolved
error = np.sqrt(
sum(
[
result.d_delta_f_.iloc[i, i + 1] ** 2
for i in range(len(result.d_delta_f_) - 1)
]
)
)
average_error_list.append(error)
else:
average_error_list.append(result.d_delta_f_.iloc[0, -1])
logger.info(
"{:.2f} +/- {:.2f} kT".format(average_list[-1], average_error_list[-1])
)

convergence = pd.DataFrame(
{
"FE": average_list,
"FE_Error": average_error_list,
}
)
convergence.attrs = df_list[0].attrs
return convergence
13 changes: 7 additions & 6 deletions src/alchemlyb/estimators/bar_.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ def fit(self, u_nk):
# sort by state so that rows from same state are in contiguous blocks
u_nk = u_nk.sort_index(level=u_nk.index.names[1:])

# get a list of the lambda states
# get a list of the lambda states that are sampled
self._states_ = u_nk.columns.values.tolist()

# group u_nk by lambda states
Expand All @@ -97,18 +97,21 @@ def fit(self, u_nk):
(len(groups.get_group(i)) if i in groups.groups else 0)
for i in u_nk.columns
]

states = [x for i, x in enumerate(self._states_) if N_k[i] > 0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a test to show why this is needed. So we would always preserve this behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to test if this is needed or not so changed the way this is defined to more directly pull the lambda states in the indices and throw an error if those lambda states aren't represented in the columns of u_nk. This error is now tested with test_block_average_error_3_bar

# Now get free energy differences and their uncertainties for each step
deltas = np.array([])
d_deltas = np.array([])
for k in range(len(N_k) - 1):
if N_k[k] == 0 or N_k[k + 1] == 0:
xiki-tempula marked this conversation as resolved.
Show resolved Hide resolved
continue
# get us from lambda step k
uk = groups.get_group(self._states_[k])
# get w_F
w_f = uk.iloc[:, k + 1] - uk.iloc[:, k]

# get us from lambda step k+1
uk1 = groups.get_group(self._states_[k + 1])

# get w_R
w_r = uk1.iloc[:, k] - uk1.iloc[:, k + 1]

Expand Down Expand Up @@ -150,13 +153,11 @@ def fit(self, u_nk):
ad_delta += np.diagflat(np.array(dout), k=j + 1)

# yield standard delta_f_ free energies between each state
self._delta_f_ = pd.DataFrame(
adelta - adelta.T, columns=self._states_, index=self._states_
)
self._delta_f_ = pd.DataFrame(adelta - adelta.T, columns=states, index=states)

# yield standard deviation d_delta_f_ between each state
self._d_delta_f_ = pd.DataFrame(
np.sqrt(ad_delta + ad_delta.T), columns=self._states_, index=self._states_
np.sqrt(ad_delta + ad_delta.T), columns=states, index=states
xiki-tempula marked this conversation as resolved.
Show resolved Hide resolved
)
self._delta_f_.attrs = u_nk.attrs
self._d_delta_f_.attrs = u_nk.attrs
Expand Down
21 changes: 20 additions & 1 deletion src/alchemlyb/estimators/mbar_.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ class MBAR(BaseEstimator, _EstimatorMixOut):
.. versionchanged:: 2.3.0
The new default is now "BAR" as it provides a substantial speedup
over the previous default `None`.


method : str, optional, default="robust"
The optimization routine to use. This can be any of the methods
Expand Down Expand Up @@ -135,6 +135,25 @@ def fit(self, u_nk):
)
bar.fit(u_nk)
initial_f_k = bar.delta_f_.iloc[0, :]
states = [
x
for i, x in enumerate(self._states_[:-1])
if N_k[i] > 0 and N_k[i + 1] > 0
]
if len(bar.delta_f_.iloc[0, :]) != len(self._states_):
states = [
x
for i, x in enumerate(self._states_[:-1])
if N_k[i] > 0 and N_k[i + 1] > 0
]
initial_f_k = pd.Series(
[
initial_f_k.loc(x) if x in states else np.nan
for x in self._states_
],
index=self._states_,
dtype=float,
)
else:
initial_f_k = self.initial_f_k

Expand Down
Loading
Loading