Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the BioPython PDBParser for topology and trajectory readers #777

Closed
kain88-de opened this issue Mar 17, 2016 · 25 comments
Closed

Remove the BioPython PDBParser for topology and trajectory readers #777

kain88-de opened this issue Mar 17, 2016 · 25 comments

Comments

@kain88-de
Copy link
Member

In the discussion in #775 it became clear that the BioPython Parser should be removed completely.

@jdetle
Copy link
Contributor

jdetle commented Mar 28, 2016

I will go ahead and take care of this in the coming days.

@jbarnoud
Copy link
Contributor

I do not know if anybody calls the PDB parser directly, but it may be best to have a deprecation period for it. Anyway, no other part of the code should call it, and the documentation should be updated.

@kain88-de
Copy link
Member Author

You should also deprecate the permissive flag in the universe creation.

@richardjgowers
Copy link
Member

I think the point here is we're trying to make our parser do everything the other one does, so it doesn't need deprecating as no functionality is lost?

@jdetle as step 0 for this, I'd try reading every .pdb file we have in the test suite with PrimitivePDBParser and see if you get any failures. Then try comparing to the results of PDBParser

@kain88-de
Copy link
Member Author

@richardjgowers does our testsuite even use the BioPython Trajectory reader. As far as I can tell from the coordinates tests only the primitive reader is used. I actually started to notice a bunch of PDB related issues when I tried to use the new common-API reader tests class defined in MDAnalysisTests/coordinates/base.

@richardjgowers
Copy link
Member

This is getting a little confusing, I thought we were talking about the Parser (reads topology). Either way, I find the whole permissive/primitive thing confusing and don't really understand what one does that the other doesn't

@kain88-de
Copy link
Member Author

Oh sorry I meant the trajectory reader, I never looked at the topology parsers so I don't have a clear idea about them

@richardjgowers
Copy link
Member

@kain88-de there's a similar split and it could do with removing too. I guess you could rename this issue to cover both

@orbeckst orbeckst changed the title Remove the BioPython PDBParser Remove the BioPython PDBParser for topology and trajectory readers Mar 31, 2016
@orbeckst
Copy link
Member

We want to get rid of the Bio.PDB.PDBParser which is being used when using Universe(..., permissive=False). The permissive keyword was supposed to indicate that the original Bio.PDB parser is very strict about the PDB format whereas our PrimitivePDBReader is supposedly more lenient.

@jdetle
Copy link
Contributor

jdetle commented Mar 31, 2016

Duly noted, I am working on what @richardjgowers said earlier. I think in in these initial stages everything is going to take longer than I think it should because I'm getting familiar with the codebase and python itself. I just ran into an interesting error that I'm investigating regarding object equality. By my understanding:

def setUp(self):
   self.universe = mda.Universe(PDB_small, permissive=True)
    # 3 decimals in PDB spec
    # http://www.wwpdb.org/documentation/format32/sect9.html#ATOM
   self.prec = 3
def test_PDB(self):
        from MDAnalysis.coordinates.PDB import PrimitivePDBReader
        primitiveReader = PrimitivePDBReader(PDB_small, n_atoms=3341)
        print("checking equal:" + str(self.universe.trajectory == primitiveReader))
        assert_equal(self.universe.trajectory, primitiveReader)

the print statement should yield "checking equal: True" but instead yields False, furthermore the test doesn't pass:

======================================================================
FAIL: test_PDB (MDAnalysisTests.coordinates.test_pdb.TestPrimitivePDBReader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jdetlefs/github/MDAnalysis/testsuite/MDAnalysisTests/coordinates/test_pdb.py", line 126, in test_PDB
    assert_equal(self.universe.trajectory, primitiveReader)
  File "/usr/lib/python2.7/dist-packages/numpy/testing/utils.py", line 317, in assert_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
 ACTUAL: <PrimitivePDBReader /home/jdetlefs/github/MDAnalysis/testsuite/MDAnalysisTests/data/adk_open.pdb with 1 frames of 3341 atoms>
 DESIRED: <PrimitivePDBReader /home/jdetlefs/github/MDAnalysis/testsuite/MDAnalysisTests/data/adk_open.pdb with 1 frames of 3341 atoms>

Is this done intentionally for some reason or are we missing some __eq__(object, other): methods in places?

@jbarnoud
Copy link
Contributor

jbarnoud commented Apr 1, 2016

I am not sure readers implement any comparison operator. If so, the python documentation tells us that objects are compared by their identity so == behave like is:

If no cmp(), eq() or ne() operation is defined, class instances are compared by object identity (“address”).

This means an object will only be equal to itself. In your example the two objects represent the same thing (a PDB reader for the same file), but they are different instances: i.e. id(self.universe.trajectory) != id(primitiveReader).

In that case, the distinction is important, because a reader store a state and changing the sate of a reader does not change the state of an identical reader.

@kain88-de
Copy link
Member Author

We don't implement a comparison for the readers. It also wouldn't make much sense to have that. From your code snippet I guess you want to check if the reader attached to the universe is of type PrimitivePDBReader that can be done with the isinstance function.

@jdetle
Copy link
Contributor

jdetle commented Apr 2, 2016

@richardjgowers maybe I misunderstood, but I went ahead and tried to start testing the parsers, however I ran into an issue that I don't know how to get around. all the modules are cythonized

from MDAnalysis.topology import PDBParser, PrimitivePDBParser
......
BioPParser = PDBParser(filename)

throws the exception "Module not callable" how do I go about testing the .py versions of the modules? Does this require changing how my packages are linked something far simpler?

@dotsdl
Copy link
Member

dotsdl commented Apr 2, 2016

@jdetle none of the parsers are written in Cython; that's not the issue. The problem here is that if have a look at the source tree, PDBParser and PrimitivePDBParser are modules, which is what the exception tells you. You will need to something like:

from MDAnalysis.topology.PDBParser import PDBParser

to make this work.

@jdetle
Copy link
Contributor

jdetle commented Apr 2, 2016

@dotsdl ah okay! Thank you.

@jdetle
Copy link
Contributor

jdetle commented Apr 2, 2016

@dotsdl Did you write the test_topology file? It seems like you actually have already done most of what @richardjgowers suggested I do with the _TestTopology and TestPDB classes in test_topology.py, I wrote a kind of shoddy script from inspecting failures with this

class testParsers(object):
    def test_PrimitiveAndBioPython(self):
        files = [PDB_NAMD,
        PDB_small,
        PDB_closed,
        PDB_multiframe,
        PDB_helix,
        PDB_conect,
        PDB_full,
        PDB_HOLE,
        unordered_res]

        for filename in files:
            print(filename)
            ParsePrim = PrimitivePDBParser(filename).parse()
            ParseBIO = PDBParser(filename).parse()

            if(filename == PDB_multiframe or filename == PDB_conect
                or filename == PDB_full):
                print('skip')
                #do nothing
            else:
                assert_equal(ParsePrim['atoms'] == ParseBIO['atoms'], True, "Atoms not equal")

My gut tells me that I should be separating this out into a test for each filename so that it would be three failures and the rest pass, I can do that but since we are removing the BioPython parser anyways it seems moot. The parsed atom arrays are not equal for PDB_multiframe, PDB_conect and PDB_full. Given that the PrimitiveParser is tested extensively in test_topology, I think its safe to say that these are instances in which the strict parser fails and further reinforce why we are getting rid of it. If this conclusion seems valid I'll go ahead and delete BioPython parser where it comes up and make a pull request.

@richardjgowers
Copy link
Member

@jdetle so what you've described is called a test generator. An example of using them is here:

https://github.com/MDAnalysis/mdanalysis/blob/develop/testsuite/MDAnalysisTests/test_util.py#L643

Where the for loop goes over the list called formats and yields many tests

So a simpler example is...

class TestAddition(object):  # can't use TestCase!
    def _check_addition(self, a, b, ref):
        assert_equal(a+b, ref)

    def test_addition(self):
        for x, y, z in ((1, 2, 3), (7, 10, 1), (3, 4, 7)):
            yield self._check_addition, x, y, z

So the yield statement is creating tests, 3 tests will get created, with the 2nd test failing but the 3rd still running

@jdetle
Copy link
Contributor

jdetle commented Apr 2, 2016

@richardjgowers Awesome, I will certainly use that for a problem in the future. Big question: Do we still need the BioPython trajectory writer? If this is not true, by my understanding, we could get rid of the permissive flag in its entirety, but the pull request would be somewhat more substantial.

@richardjgowers
Copy link
Member

Yeah I think we're trying to get rid of permissive variants of everything. But it's sometimes a lot easier for 1 issue to have 3 PRs if you can split it up nicely.

@jdetle
Copy link
Contributor

jdetle commented Apr 2, 2016

Sorry I think I'm having a jargon issue, my understanding is that BioPython.PDBReader is a strict reader, in that it is not 'permissive' of weird pdb formats. We are getting rid of permissive = False variants of everything by my understanding. Oh I think I also forgot that there are more than just PDB Readers, so I'm no longer sure if we could get rid of the permissive flag.

@richardjgowers
Copy link
Member

Yeah I get them confused too. We're killing the bio versions whatever
they're called

On Sat, 2 Apr 2016 23:07 John Detlefs, [email protected] wrote:

Sorry I think I'm having a jargon issue, my understanding is that
BioPython is a strict reader, in that it is not 'permissive' of weird pdb
formats. We are getting rid of permissive = False variants of everything by
my understanding.


You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub
#777 (comment)

@jdetle
Copy link
Contributor

jdetle commented Apr 2, 2016

Yea okay so now I am fairly certain that we could get rid of the 'permissive_pdb_reader' flag in MDAnalyis/core/init.py'

@kain88-de
Copy link
Member Author

Yes we can get rid of that

orbeckst pushed a commit that referenced this issue Apr 22, 2016
- 'permissive'=False has no effect anymore
- Added deprecation warnings for Primitive Readers/Writers and Parsers.
- Changed doc strings to eliminate references to
  BioPython Reader/Writer.
- Updated CHANGELOG to reflect changes.
orbeckst pushed a commit that referenced this issue Apr 22, 2016
- Changed tests to eliminate known failures caused by BioPython
- Updated CHANGELOGs and fixed wrong version number in AtomGroup.py,
- fixed indentation issue in PDB.py
- Fixed doc references to version strings and the permissive flags,
- got rid of extraneous text in PrimitivePDBParser, fixed scope of warnings
- Used boolean property of collections as suggested by QC
orbeckst pushed a commit that referenced this issue Apr 22, 2016
- 'permissive'=False has no effect anymore
- Added deprecation warnings for Primitive Readers/Writers and Parsers.
- Changed doc strings to eliminate references to
  BioPython Reader/Writer.
- Updated CHANGELOG to reflect changes.
orbeckst pushed a commit that referenced this issue Apr 22, 2016
- Changed tests to eliminate known failures caused by BioPython
- Updated CHANGELOGs and fixed wrong version number in AtomGroup.py,
- fixed indentation issue in PDB.py
- Fixed doc references to version strings and the permissive flags,
- got rid of extraneous text in PrimitivePDBParser, fixed scope of warnings
- Used boolean property of collections as suggested by QC
richardjgowers added a commit that referenced this issue Apr 24, 2016
@orbeckst
Copy link
Member

@jdetle well done, that was sizable chunk of work. I look forward to your GSoC contribution!

@jdetle
Copy link
Contributor

jdetle commented Apr 24, 2016

@orbeckst Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants