AMBER parser had a "too strict" regex to extract values #273

DrDomenicoMarson · 2022-11-14T09:48:37Z

This PR addresses issue #272.
This issue exposed a nasty bug inside the regex pattern we used to extract sections in the amber output file.

we are trying to match a sequence of type

filed = int/float

extracting the int/float corresponding to the field. Up till now, the pattern was

fr' {field}\s+=\s+(\*+|{_FP_RE}|\d+)'

were field was the string we want to extract the value for, and _FP_RE is defined to extract the float/int after the equal.

The problem arises when the field and/or the value are not separated from the = sign by a space (as was the case in the issue). It's strange this didn't appear before!

I just changed the + in \s+=\s+ to *.

Thinking about it, this should not break anything else, but I can't be 100% sure and the tests should catch any "macro" problems eventually introduced with this PR.

codecov · 2022-11-14T09:55:22Z

Codecov Report

Merging #273 (362b2a3) into master (6e19d4e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #273   +/-   ##
=======================================
  Coverage   98.69%   98.69%           
=======================================
  Files          26       26           
  Lines        1759     1762    +3     
  Branches      378      379    +1     
=======================================
+ Hits         1736     1739    +3     
  Misses          3        3           
  Partials       20       20

Impacted Files	Coverage Δ
src/alchemlyb/parsing/amber.py	`99.15% <100.00%> (+0.01%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

xiki-tempula · 2022-11-14T10:01:19Z

Many thanks for the quick fix. Do you mind add a new file to alchemtest to make sure that we don't break it in the future? Thanks. A sample file as short as

          -------------------------------------------------------
          Amber 20 PMEMD                              2020
          -------------------------------------------------------

--------------------------------------------------------------------------------
   2.  CONTROL  DATA  FOR  THE  RUN
--------------------------------------------------------------------------------

BioSimSpace Syst                                                                

General flags:
     imin    =       0, nmropt  =       0

Replica exchange
     numexchg=     100, rem=       3

Nature and format of input:
     ntx     =       1, irest   =       0, ntrx    =       1

Nature and format of output:
     ntxo    =       2, ntpr    =     200, ntrx    =       1, ntwr    =   25000
     iwrap   =       1, ntwx    =   25000, ntwv    =       0, ntwe    =       0
     ioutfm  =       1, ntwprt  =       0, idecomp =       0, rbornstat=      0

Potential function:
     ntf     =       1, ntb     =       2, igb     =       0, nsnb    =      25
     ipol    =       0, gbsa    =       0, iesp    =       0
     dielc   =   1.00000, cut     =   8.00000, intdiel =   1.00000

Frozen or restrained atoms:
     ibelly  =       0, ntr     =       0

Molecular dynamics:
     nstlim  =      1250, nscm    =      1000, nrespa  =         1
     t       =   0.00000, dt      =   0.00400, vlimit  =  -1.00000

Langevin dynamics temperature regulation:
     ig      =  771064
     temp0   = 298.00000, tempi   = 298.00000, gamma_ln=   2.00000

Pressure regulation:
     ntp     =       1
     pres0   =   1.01325, comp    =  44.60000, taup    =   1.00000
     Monte-Carlo Barostat:
     mcbarint  =     100

SHAKE:
     ntc     =       2, jfastw  =       0
     tol     =   0.00001

Free energy options:
     icfe    =       1, ifsc    =       1, klambda =       1
     clambda =  0.0000, scalpha =  0.5000, scbeta  =  1.0000
     sceeorder =       2
     dynlmb =  0.0000 logdvdl =       1

FEP MBAR options:
     ifmbar  =       1,  bar_intervall =      100
     mbar_states =       4

| Intermolecular bonds treatment:
|     no_intermolecular_bonds =       1

| Energy averages sample interval:
|     ene_avg_sampling =     200

Ewald parameters:
     verbose =       0, ew_type =       0, nbflag  =       1, use_pme =       1
     vdwmeth =       1, eedmeth =       1, netfrc  =       1
     Box X =   42.856   Box Y =   42.856   Box Z =   42.856
     Alpha =   90.000   Beta  =   90.000   Gamma =   90.000
     NFFT1 =   48       NFFT2 =   48       NFFT3 =   48
     Cutoff=    8.000   Tol   =0.100E-04
     Ewald Coefficient =  0.34864
     Interpolation order =    4

| PMEMD ewald parallel performance parameters:
|     block_fft =    0
|     fft_blk_y_divisor =    2
|     excl_recip =    0
|     excl_master =    0
|     atm_redist_freq =  320
     TI Mask 1 @1-33; matches     33 atoms
     TI Mask 2 @34-70; matches     37 atoms
     TI region 1:    7780 atoms
     TI region 2:    7784 atoms
     SC Mask 1 @29-33; matches      5 atoms
     SC Mask 2 @62-70; matches      9 atoms

    MBAR - lambda values considered:
       4 total:  0.0000 0.3333 0.6667 1.0000
    Extra energies will be computed     12 times.

|--------------------------------------------------------------------------------------------
| Extra TI control variables
|     gti_add_sc     =   5, gti_ele_gauss  =   0, gti_auto_alpha =   0, gti_scale_beta =   1
|     gti_ele_exp    =   2, gti_vdw_exp    =   2, gti_ele_sc     =   1, gti_vdw_sc     =   1
|     gti_cut        =   1, gti_cut_sc     =   2
|     gti_cut_sc_on    =  6.0000, gti_cut_sc_off    =  8.0000
|--------------------------------------------------------------------------------------------

| MONTE CARLO BAROSTAT IMPORTANT NOTE:
|   The Monte-Carlo barostat does not require the virial to adjust the system volume.
|   Since it is an expensive calculation, it is skipped for efficiency. A side-effect
|   is that the reported pressure is always 0 because it is not calculated.

--------------------------------------------------------------------------------
   3.  ATOMIC COORDINATES AND VELOCITIES
--------------------------------------------------------------------------------

BioSimSpace Syst                                                                
 begin time read from input coords =     0.000 ps

 
| REMD: Pressure/volume correction to exchange calc active for TREMD/HREMD.
     Molecule     1 is partially softcore
     Molecule     2 is partially softcore
 Number of triangulated 3-point waters found:     2581
 Number of shake restraints removed in TI region  1 :        0
 Number of shake restraints removed in TI region  2 :        0

     Sum of charges for TI region  1 =  -0.00000004
     Skip neutralizing charges...


     Sum of charges for TI region  2 =  -0.00000000
     Skip neutralizing charges...

| Dynamic Memory, Types Used:
| Reals              664467
| Integers           534749

| Nonbonded Pairs Initial Allocation:     1782276

| GPU memory information (estimate):
| KB of GPU memory in use:     52270
| KB of CPU memory in use:     17998

| Running AMBER/MPI version on    1 MPI task

 
--------------------------------------------------------------------------------
   4.  RESULTS
--------------------------------------------------------------------------------

 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

MBAR Energy analysis:
Energy at 0.0000 =    -24991.812173
Energy at 0.3333 =    -24988.328441
Energy at 0.6667 =    -23067.164349
Energy at 1.0000 =     31035.752870
 ------------------------------------------------------------------------------

 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

MBAR Energy analysis:
Energy at 0.0000 =    -25070.041628
Energy at 0.3333 =    -25073.157930
Energy at 0.6667 =    -25042.604724
Energy at 1.0000 =    -24808.803339
 ------------------------------------------------------------------------------

Would be fine.

DrDomenicoMarson · 2022-11-14T10:13:13Z

I don't think we need this file in particular. As it is, it's just like some of the "right" files in the current alchemtest.

The thing is, before this fix to extract the value from a field = value couple, a \s before and after the equal sign were mandatory. With the suggested new regex, the \s are no more mandatory, and so the case filed=value is caught by the regex.

I don't know which tests could be created to specifically check this as if this fails, almost all the tests should fail directly.

xiki-tempula · 2022-11-14T10:17:30Z

@DrDomenicoMarson So one purpose of the test is to ensure that the fix works.
The other purpose of the test is to ensure that it will work in the future. Say ten years into the future and some guy will come in and say the current amber parser is not good enough and need a refactor. We need to make sure that this new refactored parser would handle all the difficult cases that we had trouble before. Having a test would ensure that this case would be handled correctly in the future.

In this case, people would look at the current fr' {field}\s+=\s+(\*+|{_FP_RE}|\d+)' and say it looks good, we now know that this is not good enough and need a test case to show the future people why this won't work.

DrDomenicoMarson · 2022-11-14T10:30:55Z

Yeah, now I understand better, but I think that the file that needs to be added should have somewhere in the format "field =value" and/or "field=value" and/or "field= value", as these are the cases that are not covered by the current tests.
I agree this should be added, I didn't get it from the proposed test file :-)

xiki-tempula · 2022-11-14T10:46:57Z

but I think that the file that needs to be added should have somewhere in the format "field =value" and/or "field=value" and/or "field= value", as these are the cases that are not covered by the current tests.

Exactly, sorry for the confusion, I just find a random amber output file in my hand to show that in this case, one don't need the whole amber output file and a few lines showing the important bit is fine.

DrDomenicoMarson · 2022-11-14T10:52:25Z

Great, I added PR 86 in alchemtest, adding the two files needed to cover the two cases:
a "field=value" without spaces around the "=", and I think it would be good to raise an error when no "starting time" can be read, instead of just failing down the line, so I added the relevant test and files.

xiki-tempula · 2022-11-14T10:58:50Z

Many thanks for the fix, so for the PR which involves the edits in alchemtest, the recommended practice is to change the line

https://github.com/alchemistry/alchemlyb/blob/master/.github/workflows/ci.yaml#L58

to your PR in alchemtest https://github.com/DrDomenicoMarson/alchemtestDM/archive/refs/heads/master.zip

Then the test should all pass and you will ask for a review of the PR in both alchemtest and alchemlyb.
This should make the review process much easier. For this PR you would need a change log as well.

DrDomenicoMarson · 2022-11-14T11:20:28Z

I tried to follow the suggestion, I hope now it will work fine!
I added the relevant CHANGELOG, but I don't know if the format I used is correct.

src/alchemlyb/tests/parsing/test_amber.py

.github/workflows/ci.yaml

Co-authored-by: Zhiyi Wu <[email protected]>

xiki-tempula · 2022-11-14T13:02:55Z

Sorry didn't realise that the output is a dictionary, you could check if both dhdl and u_nk are dataframe

xiki-tempula

LGTM. Thanks for the quick fix.

@DrDomenicoMarson

- header - add @DrDomenicoMarson from PR #273

changed the regex

2cfd253

DrDomenicoMarson mentioned this pull request Nov 14, 2022

AMBER dVdl convert to pandas error #272

Closed

raise an error when no t0, and added the tests

12399fd

DrDomenicoMarson added 2 commits November 14, 2022 12:13

edited to target the new PR in alchemtest

c4a4e77

added the changelog for the new regex

07289c4

xiki-tempula requested changes Nov 14, 2022

View reviewed changes

src/alchemlyb/tests/parsing/test_amber.py Outdated Show resolved Hide resolved

xiki-tempula reviewed Nov 14, 2022

View reviewed changes

.github/workflows/ci.yaml Outdated Show resolved Hide resolved

xiki-tempula and others added 2 commits November 14, 2022 11:59

use the upstream alchemtest

ee051d5

Update src/alchemlyb/tests/parsing/test_amber.py

a6d4abd

Co-authored-by: Zhiyi Wu <[email protected]>

fixedtest

362b2a3

xiki-tempula approved these changes Nov 14, 2022

View reviewed changes

xiki-tempula merged commit 96d5a50 into alchemistry:master Nov 14, 2022

DrDomenicoMarson deleted the better_regex_for_extract_section branch November 14, 2022 15:11

orbeckst added a commit that referenced this pull request Nov 14, 2022

add 1.0.1 header to CHANGES

0916cba

- header - add @DrDomenicoMarson from PR #273

orbeckst added this to the 1.0.1 milestone Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMBER parser had a "too strict" regex to extract values #273

AMBER parser had a "too strict" regex to extract values #273

DrDomenicoMarson commented Nov 14, 2022

codecov bot commented Nov 14, 2022 •

edited

Loading

xiki-tempula commented Nov 14, 2022

DrDomenicoMarson commented Nov 14, 2022

xiki-tempula commented Nov 14, 2022 •

edited

Loading

DrDomenicoMarson commented Nov 14, 2022

xiki-tempula commented Nov 14, 2022

DrDomenicoMarson commented Nov 14, 2022

xiki-tempula commented Nov 14, 2022

DrDomenicoMarson commented Nov 14, 2022

xiki-tempula commented Nov 14, 2022

xiki-tempula left a comment

AMBER parser had a "too strict" regex to extract values #273

AMBER parser had a "too strict" regex to extract values #273

Conversation

DrDomenicoMarson commented Nov 14, 2022

codecov bot commented Nov 14, 2022 • edited Loading

Codecov Report

xiki-tempula commented Nov 14, 2022

DrDomenicoMarson commented Nov 14, 2022

xiki-tempula commented Nov 14, 2022 • edited Loading

DrDomenicoMarson commented Nov 14, 2022

xiki-tempula commented Nov 14, 2022

DrDomenicoMarson commented Nov 14, 2022

xiki-tempula commented Nov 14, 2022

DrDomenicoMarson commented Nov 14, 2022

xiki-tempula commented Nov 14, 2022

xiki-tempula left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 14, 2022 •

edited

Loading

xiki-tempula commented Nov 14, 2022 •

edited

Loading