Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAMD - decorrelation and equilibration detection fails in v0.7 and 1.0 #274

Closed
EzryStIago opened this issue Nov 16, 2022 · 7 comments · Fixed by #275
Closed

NAMD - decorrelation and equilibration detection fails in v0.7 and 1.0 #274

EzryStIago opened this issue Nov 16, 2022 · 7 comments · Fixed by #275

Comments

@EzryStIago
Copy link

Recent updates (to preprocessing.py, I believe) have broken our scripts' ability to use decorrelation and equilibration detection for NAMD output. I have attached an MWE that includes some sample data from alchemtest. The most recent version that works for these test cases is Alchemlyb 0.6.0. I have included sample logs for runs using 0.6, 0.7, and 1.0

Finally, it looks like decorrelation should work on the entire dataframe based on the documentation, but we have had to separate the dataframe by fep-lambda. Otherwise, decorrelate_u_nk returns just fep-lambda=0.

Any clarification or advice would be helpful!

MWE: alchemlyb_crash_MWE.zip

@EzryStIago EzryStIago changed the title Decorrelation and Equilibrium Detection of Alchemtest NAMD Data NAMD - decorrelation and equilibration detection fails in v0.7 and 1.0 Nov 16, 2022
@xiki-tempula
Copy link
Collaborator

xiki-tempula commented Nov 16, 2022

@EzryStIago Hi Ezry, many thanks for reporting.

This is a tricky issue, part of the decorrelate_u_nk function is the sanitisation of the data, which drops all the rows that have NaN. Before 0.7 the code base for subsampling is a mess where some parts would drop the rows with NaN while others won't, the 0.7 standardised the subsampling module such that they are all behaving the same now, which means all functions would implicitly drop the rows that have NaN now.

In this case, the data frame is

                    0.0       0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1.0
time    fep-lambda                                                            
5000.0  0.0         0.0 -3.696651  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
5010.0  0.0         0.0 -3.643142  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
5020.0  0.0         0.0 -3.214734  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
5030.0  0.0         0.0 -4.236102  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
5040.0  0.0         0.0 -4.873010  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
...                 ...       ...  ...  ...  ...  ...  ...  ...  ...  ...  ...
49960.0 0.0         0.0 -2.994324  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
49970.0 0.0         0.0 -3.477750  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
49980.0 0.0         0.0 -3.645658  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
49990.0 0.0         0.0 -3.753682  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
50000.0 0.0         0.0 -3.558601  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
[4501 rows x 11 columns]

So this dropping NaN empties the data frame and cause the error.

Sorry, I'm not too familiar with the NAMD, I wonder if you mind giving some context of why are there so many NaNs?

@dotsdl I noticed that you have added the line of dropping the rows with NaN df = df.dropna() into slicing in a655f3a#diff-82d2833cef709c2bd46823ce8a352cb3baac5d0a77386953ab3f538d908dd7a0R29 I wonder why did you do this? Thank you.

@EzryStIago
Copy link
Author

Hi @xiki-tempula, thank you for your response.

NAMD doesn’t calculate dE for all lambda pairs, just the adjacent lambdas; those NaNs correspond to the remaining pairs. Of course that restricts us to BAR estimation.

@xiki-tempula
Copy link
Collaborator

xiki-tempula commented Nov 20, 2022

@EzryStIago Many thanks for the explanation. I have done a fix where the pre-processing will no longer drop the NaN rows. Do you mind having a test to see if this fits your purpose? #275

Finally, it looks like decorrelation should work on the entire dataframe based on the documentation, but we have had to separate the dataframe by fep-lambda. Otherwise, decorrelate_u_nk returns just fep-lambda=0.

Sorry, I don't quite understand this part. Do you mean that you have to make sure each dataframe only has one fixed lambda value? Instead of multiple different lambdas in one dataframe.

@EzryStIago
Copy link
Author

Thank you @xiki-tempula!
#275 does indeed behave as expected. I'm getting slightly different results for my free energy estimates (up to a tenth of a kcal/mol) and slightly larger equilibrated/decorrelated dataframes. They're still within error of the original estimates, though.

Do you mean that you have to make sure each dataframe only has one fixed lambda value? Instead of multiple different lambdas in one dataframe.

Exactly, perhaps this a peculiarity of NAMD, but all data are output to the same file by default. To use decorrelation or equilibrium detection, I first split the dataframe up by lambda, process, and reassemble. Is that the intended usage? If so, this is a non-issue.

@xiki-tempula
Copy link
Collaborator

@EzryStIago

Exactly, perhaps this a peculiarity of NAMD, but all data are output to the same file by default. To use decorrelation or equilibrium detection, I first split the dataframe up by lambda, process, and reassemble. Is that the intended usage? If so, this is a non-issue.

I see. This is kind of a tricky thing. We are currently only supporting a single lambda in subsampling.decorrelate_u_nk. I guess I need to state this in the documentation.

For NAMD, if all energy files are dumped to the same file, then they need to be separated and decoorelated separately. I think the best solution for this would be alchemlyb.parsing.namd.extract_u_nk to support an optional keyword which would return a list of dataframes instead of one dataframe.

I don't really know anything about the NAMD. I noticed that @jhenin and @ttjoseph has previously contributed to the NAMD parser. I wonder if there are any advice on what is the best way forward? Thanks.

@jhenin
Copy link
Contributor

jhenin commented Nov 23, 2022

Thanks for your suggestion @xiki-tempula ! That seems doable, however it involves a change in the extract_u_nk API. A lighter process would be for the code calling extract_u_nk to split the dataframe into single lambdas, pass them to decorrelate_u_nk and then concat them back. We could add a wrapper function alchemlyb.parsing.namd.decorrelate that does this. Even better maybe, this step could be integrated into decorrelate_u_nk to make the whole process transparent and robust to users with custom workflows passing multi-lambda dataframes.

orbeckst pushed a commit that referenced this issue Dec 5, 2022
- Fix #274
- removed dropping of rows with NaN in the pre-processing slicing() functions (functionality was not documented
   and lead to incorrect behavior with NAMD data) 
- update tests
- update CHANGES
@EzryStIago
Copy link
Author

Thank you @xiki-tempula!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants