Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writes compressed output of given format #2221

Merged
merged 22 commits into from
Apr 5, 2019

Conversation

fenilsuchak
Copy link
Member

@fenilsuchak fenilsuchak commented Mar 16, 2019

Fixes #2216

Changes made in this Pull Request:

  • Preferring file_format argument over filename in get_writer_for .
  • The compressed form of atomgroups can be written as ag.write(filename.writer.gz)
  • Tests are written

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@codecov
Copy link

codecov bot commented Mar 16, 2019

Codecov Report

Merging #2221 into develop will decrease coverage by 1.14%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2221      +/-   ##
===========================================
- Coverage    90.71%   89.57%   -1.15%     
===========================================
  Files           15      158     +143     
  Lines         1982    19686   +17704     
  Branches         0     2769    +2769     
===========================================
+ Hits          1798    17633   +15835     
- Misses         184     1458    +1274     
- Partials         0      595     +595
Impacted Files Coverage Δ
package/MDAnalysis/coordinates/GRO.py 93.87% <100%> (ø)
package/MDAnalysis/core/groups.py 98.25% <100%> (ø)
package/MDAnalysis/analysis/nuclinfo.py 90.14% <100%> (ø)
package/MDAnalysis/lib/util.py 88.47% <100%> (ø)
package/MDAnalysis/coordinates/base.py 93.3% <0%> (ø)
package/MDAnalysis/analysis/helanal.py 85.19% <0%> (ø)
package/MDAnalysis/topology/MinimalParser.py 100% <0%> (ø)
package/MDAnalysis/visualization/__init__.py 100% <0%> (ø)
package/MDAnalysis/core/qcprot.py 100% <0%> (ø)
... and 138 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9428866...2c92916. Read the comment docs.

@fenilsuchak
Copy link
Member Author

Not sure why codecov is failing

@fenilsuchak
Copy link
Member Author

I guess the test is already written for this in Class TestGetWriterFor in test_util.py, heres the test

def test_compressed_extension(self):
        for ext in ('.gz', '.bz2'):
            fn = 'test.gro' + ext
            writer = mda.coordinates.core.get_writer_for(filename=fn)
            assert writer == mda.coordinates.GRO.GROWriter
            # Make sure ``get_writer_for`` works with compressed file file names
def test_compressed_extension_fail(self):
        for ext in ('.gz', '.bz2'):
            fn = 'test.unk' + ext
            # Make sure ``get_writer_for`` fails if an unknown format is compressed
            with pytest.raises(TypeError):
                mda.coordinates.core.get_writer_for(filename=fn)

@jbarnoud
Copy link
Contributor

Thank you for your contribution!

Codecov complains because you added code which is not tested. You need to add a test that calls ag.write with a file name ending on .gz to go through the lines you added.

From what I see, your change would indeed allow to write a gziped file. Though, as you noticed, get_writer_for is already dealing with compressed file extensions. Therefore, I think a better solution would be to get rid of the redundant format logic in write rather than adding to it like you did.

@richardjgowers, it seems you wrote the lines around

format = os.path.splitext(filename)[1][1:] # strip initial dot!
. Do you remember why you extracted the format there rather than leaving get_writer_for deal with it?

@fenilsuchak
Copy link
Member Author

fenilsuchak commented Mar 18, 2019

@jbarnoud, ohh ok, I'll add such a test. from what I understood get_writer_for gets the relevant writer( PDB etc.), based on the file extension that is stored in format which is provided as an argument to it. I think format logic would be necessary as it needs to handle three cases

  1. Usual File Extension (Which has a writer)
  2. GZipped File
  3. No File extension
    format I guess is a single variable for selecting between these. I couldn't think of a better solution given the code structure. But if there is something I'm missing, I'm willing to change the code accordingly to remove redundancy, I'll add tests after the changes are confirmed.

@jbarnoud
Copy link
Contributor

@Fenil3510 The format discovery logic has to exist, indeed. My point is that it is already implemented in get_writer_for that is called in ag.write. In its current version, ag.write overwrite the format discovery logic and explicitly ask for a file format, instead of letting get_writer_for figure out what format to use.

The user should still be able to ask for a specific format by providing the format keyword argument to ag.write, though. But just passing the format argument untouched should be enough.

@fenilsuchak
Copy link
Member Author

fenilsuchak commented Mar 19, 2019

Oh, I get it @jbarnoud you are right. get_writer_for does handle compression thing with function calls get_ext and check_compressed_format, making the thing I did redundant. We can just pass format to writer(filename, format = None) and its done. I guess there is no need for format variable then.
I've made changes. Thanks

@jbarnoud
Copy link
Contributor

This is what I had in mind.

But, you are breaking the file_format keyword argument of ag.write. This keyword argument should allow to bypass the format detection, and it is rather handy from time to time. The default value is "PDB", which is confusing because we actually default to guessing... The default should probably be None. This change should be documented, and we need to test what happens when the format cannot be guessed.

@fenilsuchak
Copy link
Member Author

fenilsuchak commented Mar 19, 2019

But, you are breaking the file_format keyword argument of ag.write. This keyword argument should allow to bypass the format detection, and it is rather handy from time to time.

@jbarnoud yes that makes sense. Currently when the user has given argument filename.extenstion and if the extension is valid then the file_format keyword would be redundant and preference would be given to the former to identify the format. This is what is happening currently if I'm right. So giving preference to file_format over extraction from filename is what you mean? ie. reversing the preference.

The issue then will be that if neither file_format nor extraction from filename can guess the format, in that case, we could use a default value like PDB?

@jbarnoud
Copy link
Contributor

Look at the signature of get_writer_for.

def get_parser_for(filename, format=None):

It takes an optional format. The guessing is only done if that format is not explicitly provided.

# Only guess if format is not provided

This is good practice because, if the user tells the function to do something, then the function should do what the user asked.

What you did in your last commit is to disregard an explicit format that the user may have passed with the file_format argument of ag.write.

In the ideal scenario for ag.write:

  • the format is guessed by default
  • if a format is explicitly provided, then that format is used

@fenilsuchak
Copy link
Member Author

fenilsuchak commented Mar 19, 2019

Ok, @jbarnoud I understand what you say but my confusion is in get_writer_for vs get_selection_writer_for.
Supposedly user provides ag.write with arguments filename(without extension) , file_format = GRO so this calls get_writer_for which throws an error (current implmentation) so get_selection_writer_for gets the control and writes in GRO format. So if it is done such that get_writer_for handles all possible scenarios, then get_selection_writer_for would be redundant, wouldn't it? Or is get_selection_writer_for for some different purpose?

@jbarnoud
Copy link
Contributor

get_selection_writer_for is about getting a selection writer. It is a different things, but ag.write can write either a structure (writer) or a selection (selection writer).

@fenilsuchak
Copy link
Member Author

fenilsuchak commented Mar 19, 2019

In the ideal scenario for ag.write:

  • the format is guessed by default
  • if a format is explicitly provided, then that format is used

Ok, thanks for the clarification about get_selection_writer. I will make changes as quoted above.

@fenilsuchak
Copy link
Member Author

fenilsuchak commented Mar 20, 2019

@jbarnoud a doubt, if the user writes ag.write(filename.pdb.gz, file_format = 'GRO') this results in a conflict in the things I'm about to change. Do these cases need to be take care of? If file_format is to be given preference as you said then it must be saved as filename.gro.gz ? or we throw an error?

@jbarnoud
Copy link
Contributor

As a general rule, it is very frustrating for users when a software tries to outsmart them. Indeed, the example you give presents a contradiction, but the request is clear: the user wants to write a GRO file named "filename.pdb.gz"; therefore it is what we should generate.

@richardjgowers
Copy link
Member

format is meant to overrule everything, the filename is just a (often very good) hint at the format.

I would have thought that you couldn't compress the file frame by frame, but instead you'd need to see the entire file before compressing. It would be useful to see a test that writes a 'test.pdb.gz', then a 'test.pdb' which gets compressed and check the relative file sizes? Also just a simple round trip test, so make a test.pdb.gz, create a new Universe from this file and check that positions & names match.

@fenilsuchak
Copy link
Member Author

As a general rule, it is very frustrating for users when a software tries to outsmart them. Indeed, the example you give presents a contradiction, but the request is clear: the user wants to write a GRO file named "filename.pdb.gz"; therefore it is what we should generate.

Noted @jbarnoud, thanks

@fenilsuchak
Copy link
Member Author

format is meant to overrule everything, the filename is just a (often very good) hint at the format.

Also just a simple round trip test, so make a test.pdb.gz, create a new Universe from this file and check that positions & names match.

Ok @richardjgowers . I'll write the test as suggested.

@fenilsuchak
Copy link
Member Author

fenilsuchak commented Mar 21, 2019

@jbarnoud @richardjgowers I've made a fix. But this is by no means robust. There will be a bit of an issue in the compression part. How would the user indicate that a compressed format is needed? This would be via extraction filename itself, right?. The file_format argument is only to specify writers according to the current docstring. Should we extend the argument file_format to take values like pdb.gz and gro.gz?
Currently if filename = name.pdb.gz and file_format = pdb then gz(extracted from filename) file is created, but as a user, I think its a bit confusing or can we change the argument name from file_format to file_writer, because it essentially selects writer currently.

@fenilsuchak
Copy link
Member Author

The build is failing as get_selection_writer_for is expecting file_format to be a default one(previously 'PDB'). As there is no default_format now, should I change get_selection_writer_for to use PDB if file_format is None or let it throw error?

@orbeckst
Copy link
Member

I haven't reviewed the code but I note that although the tests pass now, the coverage is still low. Do you still have to write tests?

@fenilsuchak
Copy link
Member Author

@orbeckst, yes haven't added the tests yet. Will do as soon as changes are reviewed. There are some issues that need to be addressed that I've mentioned in the previous comments before tests be written.

@fenilsuchak
Copy link
Member Author

@Fenil3510 write the tests for what you have – people don't always have time to do reviews, and when they can make time then it's better that everything is already there, even if that means that you have to change things after the review.

I'm sorry, will follow it. I'm writing tests.

@fenilsuchak
Copy link
Member Author

fenilsuchak commented Apr 2, 2019

It would be useful to see a test that writes a 'test.pdb.gz', then a 'test.pdb' which gets compressed and check the relative file sizes? Also just a simple round trip test, so make a test.pdb.gz, create a new Universe from this file and check that positions & names match.

Tests were written as per this and from what I understood. Are any other tests expected?

I got a weird failure in PR MDAnalysis#2231 in Travis job
https://travis-ci.com/MDAnalysis/mdanalysis/jobs/189553940 :

   Exception occurred:
     File "/home/travis/miniconda/envs/test/lib/python3.5/site-packages/sphinx/ext/napoleon/docstring.py", line 703, in _parse_raises_section
       m = self._name_rgx.match(_type).groupdict()
   AttributeError: 'NoneType' object has no attribute 'groupdict'

and the docs build find locally (under Python 3.6)
@orbeckst
Copy link
Member

orbeckst commented Apr 2, 2019

The failure of the doc build https://travis-ci.com/MDAnalysis/mdanalysis/jobs/189731062 is not your fault, I just had the same thing in PR #2231 . Maybe my patch there will fix this.

If you haven't added your name yet to AUTHORS then do that, too.

Also add an entry to CHANGELOG.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding tests. See comments (formatting, tests for capitalized extensions). Also add yoursel to AUTHORS and and entry to CHANGELOG.

@@ -326,7 +326,7 @@ def __init__(self, filename, convert_units=None, n_atoms=None, **kwargs):
w.write(u.atoms)

"""
self.filename = util.filename(filename, ext='gro')
self.filename = util.filename(filename, ext='gro', keep = True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8; no space around = in args.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we also restricting the number of characters per line to 80/79 as per pep8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the Style Guide: yes, we try to keep within 79 characters. (Not all our code rigorously adheres to it, but that's what we're striving for)

@@ -3054,8 +3054,7 @@ def write(self, filename=None, file_format="PDB",
if filename is None:
trjname, ext = os.path.splitext(os.path.basename(trj.filename))
filename = filenamefmt.format(trjname=trjname, frame=trj.frame)
filename = util.filename(filename, ext=file_format.lower(), keep=True)

filename = util.filename(filename, ext= file_format or 'PDB', keep=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test that you can omit .lower()??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like file_format or 'PDB'. Better be explicit so that only None really gets defaulted:

file_format if file_format is not None else 'PDB'

package/MDAnalysis/core/groups.py Outdated Show resolved Hide resolved
@@ -167,7 +167,27 @@ def test_write_frame_none(self, u, tmpdir, extension):
u.atoms.positions[None, ...], new_positions, decimal=2
)

@pytest.mark.parametrize('extension', ('xtc', 'dcd', 'pdb', 'xyz'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Se should capitalized extensions, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't understand. Should I capitalize all the extensions?. I've used the parameters from the previous test already written.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test for a capitalized extension 'PDB'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, clearly I should proof-read what I am typing... yes, test for capitalized extensions was what I was looking for.

testsuite/MDAnalysisTests/core/test_atomgroup.py Outdated Show resolved Hide resolved
@orbeckst orbeckst requested a review from jbarnoud April 2, 2019 20:42
@orbeckst
Copy link
Member

orbeckst commented Apr 2, 2019

@jbarnoud can you please also have a quick peek? Thanks.

@orbeckst
Copy link
Member

orbeckst commented Apr 3, 2019

I merged the latest develop into your branch, which should fix the doc build failure (via 4f8dc17).

@orbeckst
Copy link
Member

orbeckst commented Apr 3, 2019

oops, I killed travis #2233 ... fixing (so that the tests ACTUALLY run)

@orbeckst orbeckst mentioned this pull request Apr 3, 2019
4 tasks
fenilsuchak and others added 5 commits April 3, 2019 10:13
* pin the version of sphinx used
in Travis doc builds until upstream
issues are resolved in version 2.0
@fenilsuchak
Copy link
Member Author

oops, I killed travis #2233 ... fixing (so that the tests ACTUALLY run)

Would need to commit again for Travis to build? As I see it is fixed now.

@orbeckst
Copy link
Member

orbeckst commented Apr 4, 2019

Merged develop to get the travis fix from PR #2234.

@orbeckst orbeckst merged commit 1be9f73 into MDAnalysis:develop Apr 5, 2019
@orbeckst
Copy link
Member

orbeckst commented Apr 5, 2019

Congratulations @Fenil3510 to your first commit. Thank you for your effort.

@orbeckst
Copy link
Member

orbeckst commented Apr 5, 2019

@Fenil3510 I sent you a GitHub invite so that you can be part of @MDAnalysis/contributors.

@fenilsuchak
Copy link
Member Author

Happy to be a part of the team!

orbeckst added a commit that referenced this pull request Dec 21, 2023
Update of AUTHORS and CHANGELOG with inferred author contributions.

*  Removed duplicate mattwthompson in 0.20.0 changelog entry.: mattwthompson was placed twice by accident, this removes this duplication.

* Addition of missing authors.

   An retrospective analysis of the authors found via `git shortlog -s -n --email --branches="develop"` found several commits by authors which were not present in the `AUTHORS.md` file.

    - Zhenbo Li: commited via PR: Started tests for gnm. #803 and Make Travis run tests on OSX. #771,
    - Jenna M. Swarthout via PR Update CoC according to suggestions from current CoC committee #4289 and Point to new reporting form link (owned by [email protected]) #4298,
    - Bradley Dice via PR   Fix GSD Windows compatibility #2384 ,
    - David Minh via PR #2668

   There seemed to be no indications in the above mentioned PRs that the author did not want to be in the authors file, it looks like it was just forgotten.

* Addition of missing entries from the changelog

   Continued cross referencing of the git shortlog output but also accounting for the changelog identified several individuals that had not been included in the changelog entries for the release they contributed under. They were added to the relevant entry of the changelog based on the merge commit date.

   Please note that for Tone Bengsten, we a) had no github handle (so they were assigned @tbengtsen), and b) no specific commit. Given we know that this individual was including alongside the encore merge, they were assigned to the 0.16.0 release.

* Update CHANGELOG
* PR #1218
* PR #1284 and #1408
* PR #4109
* based on git history
* PRs #803 and #771 (author addition, changelog addition)
* PR #2255 and #2221
* PR #1225
* PR #4289 and #4298
* PR #4031
* PR #4085
* PR #3635
* PR #2356
* PR #2559
* No GH handle - Encore author addition @tbengtsen
* PR #4184
* PR #2614
* PR #2219
* PR #2384
* PR #2668
* Add missing entry for Jenna

Thanks to @fiona-naughton for helping out with digging into this data :)

Co-authored-by: Fiona Naughton <[email protected]>
Co-authored-by: Oliver Beckstein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants