Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read APBS .dx.gz generated grids #70

Merged
merged 11 commits into from
Nov 7, 2019

Conversation

eloyfelix
Copy link
Contributor

@eloyfelix eloyfelix commented Oct 30, 2019

APBS can write gz compressed dx files: https://apbs-pdb2pqr.readthedocs.io/en/latest/apbs/input/elec/write.html
It's quite convenient when dealing with very big grids as it needs less disk space/RAM.

This PR adds DXGZ input format functionality.

@eloyfelix
Copy link
Contributor Author

Travis shows 'pending' state in github but it already finished.

This could be used like this:

from gridData import Grid

grid = Grid("data.dx.gz", file_format='DXGZ')

Is this something that you would merge into master?

Thanks

data.dx.gz

@kain88-de
Copy link
Member

Hi @eloyfelix this looks like a nice addition. Definitely good for merging. Can you add a test for this and add yourself to the authors?

I wonder if we should just allow this for all files. If it ends in .gz we just assume it is zipped.

@codecov
Copy link

codecov bot commented Nov 5, 2019

Codecov Report

Merging #70 into master will increase coverage by 0.3%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master      #70     +/-   ##
=========================================
+ Coverage   87.03%   87.34%   +0.3%     
=========================================
  Files           5        5             
  Lines         779      798     +19     
  Branches      113      118      +5     
=========================================
+ Hits          678      697     +19     
  Misses         60       60             
  Partials       41       41
Impacted Files Coverage Δ
gridData/OpenDX.py 82.88% <100%> (+0.66%) ⬆️
gridData/core.py 94.16% <100%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 818c8da...5dbd3e2. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 5, 2019

Codecov Report

Merging #70 into master will increase coverage by 0.27%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #70      +/-   ##
==========================================
+ Coverage   87.03%   87.31%   +0.27%     
==========================================
  Files           5        5              
  Lines         779      796      +17     
  Branches      113      115       +2     
==========================================
+ Hits          678      695      +17     
  Misses         60       60              
  Partials       41       41
Impacted Files Coverage Δ
gridData/OpenDX.py 82.41% <100%> (+0.19%) ⬆️
gridData/core.py 94.33% <100%> (+0.29%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 818c8da...40abc1b. Read the comment docs.

@eloyfelix
Copy link
Contributor Author

eloyfelix commented Nov 5, 2019

Added capability to write .dx.gz files and tests

@orbeckst
Copy link
Member

orbeckst commented Nov 5, 2019

The functionality is very useful – not sure why I didn't put it in the first place because I gzip my dx files, too.

But I don't like additional kwargs to indicate compression. I agree with @kain88-de that this should be universal and should just be based on the extension.

I wonder if we should just allow this for all files. If it ends in .gz we just assume it is zipped.

We have functionality in MDAnalysis openany() https://github.com/MDAnalysis/mdanalysis/blob/7f82b88f769f3882e44b5121463c93b899ef5b55/package/MDAnalysis/lib/util.py#L265-L454 to do this seamlessly. The problem is that MDA is GPL2 and GridDataFormats is LGPL. My understanding is that we cannot copy code from GPL to LGPL. Instead I have to find out who wrote the code, get everyone who contributed to agree to relicense, and then we can take it.

@orbeckst
Copy link
Member

orbeckst commented Nov 5, 2019

@eloyfelix you can either wait and see if we can just take the openany() code, which will just solve the problem, or if you are keen to move forward, change your code so that it looks at the filename extension to decide if it should uncompress or compress. In this way, the API will stay the same even if we add openany() later.

@orbeckst
Copy link
Member

orbeckst commented Nov 5, 2019

Relicensing of openany() under LGPL

From blame: lib/util.py L265-L454: Do you agree to relicense openany() (see link for exact lines of code) under LGPL? Please reply below (you can also tick the box but there must be a reply with your handle in this thread):

Thanks!

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for your contribution. As already mentioned in the comments, we would like this to work by filename extension instead of explicit kwargs.

  • Please change your code to use filename extension for deciding how to read/write the file.
  • Simplify the tests by running the same tests for compressed and uncompressed files: use pytest.mark.parametrize('filename', [datafiles.DX, datafiles.DXGZ]) and pytest.mark.parametrize('outfile', ["grid.dx", "grid.dx.gz"])
  • add an entry to CHANGELOG

Many thanks!

gridData/OpenDX.py Outdated Show resolved Hide resolved
gridData/OpenDX.py Outdated Show resolved Hide resolved
gridData/OpenDX.py Show resolved Hide resolved
gridData/core.py Outdated Show resolved Hide resolved
gridData/core.py Outdated Show resolved Hide resolved
gridData/core.py Outdated Show resolved Hide resolved
gridData/tests/datafiles/__init__.py Outdated Show resolved Hide resolved
gridData/tests/test_dx.py Outdated Show resolved Hide resolved
gridData/tests/test_dx.py Outdated Show resolved Hide resolved
gridData/tests/test_dx.py Outdated Show resolved Hide resolved
@eloyfelix
Copy link
Contributor Author

eloyfelix commented Nov 5, 2019

Many thanks for your suggestions @orbeckst. I'll take a look at it asap and try to make the changes to better fit to your existing codebase.

@eloyfelix
Copy link
Contributor Author

this should be, at least, a bit closer :P

@eloyfelix eloyfelix requested a review from orbeckst November 6, 2019 00:06
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. Some additional things

for l in in_file:
out_file.write(l)
os.remove(filename)
if ext == '.gz':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compressed writing should be handled in dx.write() instead of working on the uncompressed file. I would generally try to avoid any intermediate file writing.

Sorry, I didn't notice on the first round of reviews.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the logic to dx.write but given how it was originally implemented it was a bit tricky. I needed to add few isinstance to check if I needed to write string or bytes depending on the stream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needed to implement a simple version of your 'openany' to deal with it:

https://github.com/eloyfelix/GridDataFormats/blob/DXGZ_input_format/gridData/OpenDX.py#L486

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I don't particularly like having to repeat the same code everywhere. But that's because of the poor structure of the original code (i.e., my fault ;-) where writing is actually done in each element of the DXfield. Can you think of a better code structure where we only need to do the encoding etc once? E.g., instead of having each element write, return a text representation that is then concatenated at the top level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added write_line function: https://github.com/eloyfelix/GridDataFormats/blob/DXGZ_input_format/gridData/OpenDX.py#L193

It's probably not optimal but it's the best that I could do without doing major codebase changes (which I'm not sure I'm ready to do)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your approach is fine, I just added a few minor change requests to clean up the code. Importantly, this is really cleaning up my mess – I would be very grateful if you could do it as part of this PR!


def test_read_dxgz():
g = Grid(datafiles.DXGZ, file_format='DXGZ')
@pytest.mark.parametrize("infile", [datafiles.DX, datafiles.DX+'.gz'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be explicit and use datafiles.DXGZ

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you also need to

  • add a test.dx.gz to test files
  • add DXGZ to datafiles/__init__.py

I don't know if one can make a fixture work in which you generate the gzipped file on the fly. I can think of the first part, but not how to use the fixture together with datafiles.DX in a parameterized test or fixture. But in any case, here's the code I can think of (uses tmpdir_factory because the temporary file should be created in a temporary location but the tmpfile fixture is not module level)

@pytest.fixture(scope="module")
def dxgz_file(tmpdir_factory, src=datafiles.DX):
    basename = os.path.basename(src) + '.gz'
    fn = tmpdir_factory.mktemp("compressed").join(basename)
    with open(src, 'rb') as inp:
        with gzip.open(str(fn), "wb") as out:
             out.write(inp.read())
    return str(fn)

# use dxgz_file fixture when needed ... although I am actually not sure how 
# you use a fixture as part of a parameterized fixture.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added test.dx.gz file and DXGZ to datafiles/init.py but I'm not sure about the fixture...

CHANGELOG Outdated Show resolved Hide resolved
gridData/tests/test_dx.py Show resolved Hide resolved
@utkbansal
Copy link
Member

I'm okay with the changes.

@eloyfelix
Copy link
Contributor Author

hope we are getting closer with the latest changes

gridData/OpenDX.py Outdated Show resolved Hide resolved
@kain88-de
Copy link
Member

LGTM. @orbeckst I leave you to merge it

@eloyfelix Nice work. Thanks

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good in principle, but it would be really nice if you could make some things clearer and cleaner. (The latter is just cleaning up my old crappy code – I would be very grateful if you could do this as part of polishing the PR.) — see comments.

You also need to make sure that your test file is bundled in the python package, so you need to add a pattern to setup.py near

GridDataFormats/setup.py

Lines 41 to 42 in 818c8da

package_data={'gridData': ['tests/datafiles/*.dx', 'tests/datafiles/*.ccp4',
'tests/datafiles/*.plt']},
e.g., something like "tests/datafiles/*.gz".

gridData/OpenDX.py Outdated Show resolved Hide resolved
gridData/OpenDX.py Outdated Show resolved Hide resolved
gridData/OpenDX.py Outdated Show resolved Hide resolved
gridData/OpenDX.py Outdated Show resolved Hide resolved
gridData/OpenDX.py Outdated Show resolved Hide resolved
gridData/OpenDX.py Outdated Show resolved Hide resolved
gridData/OpenDX.py Outdated Show resolved Hide resolved
gridData/core.py Outdated Show resolved Hide resolved
for l in in_file:
out_file.write(l)
os.remove(filename)
if ext == '.gz':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your approach is fine, I just added a few minor change requests to clean up the code. Importantly, this is really cleaning up my mess – I would be very grateful if you could do it as part of this PR!

gridData/tests/test_dx.py Show resolved Hide resolved
@orbeckst orbeckst self-assigned this Nov 7, 2019
@eloyfelix
Copy link
Contributor Author

Thanks for the review, I believe I did all the proposed changes. The whole thing looks definitely cleaner now.
Hope to see a release soon! :P

@orbeckst orbeckst merged commit 37e6d0c into MDAnalysis:master Nov 7, 2019
@orbeckst
Copy link
Member

orbeckst commented Nov 7, 2019

Thank you for all the work and extra work addressing the review comments. This is a nice addition!

@eloyfelix
Copy link
Contributor Author

eloyfelix commented Nov 7, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants