Consolidate atom collection classes into a single class #378

claudejrogers · 2022-03-25T23:45:52Z

claudejrogers
Mar 25, 2022

Hello,
I noticed a very minor bug in .pdb file output and investigated how much work it would be to fix and was struck at the complexity having three atom containers introduces in the code, especially in the structure io module. To me, the only difference between the three classes is the shape of the coords array, since an Atom is an AtomArray of coordinate shape (1, 3) or (3,), and an AtomArrayStack is, or could be, an AtomArray of shape (m, n, 3). I implemented an example of a single atom container in this gist.

For simplicity, I consolidated the non-coordinate atom data in a numpy structured array and ignored for now boxes and bonds. There is an internal _is_stack attribute to simplify indexing. I added to_pdb() and from_pdb() for the sake of illustration. The to_pdb method supports putting multiple models in a single file, adding TER lines, and aligns atom names.

Is this of any interest? I understand it would be a massive undertaking to implement even if it was.

padix-key · 2022-03-27T11:23:02Z

padix-key
Mar 27, 2022
Maintainer

Hi. Indeed this would be quite a large endeavour to implement your proposed change or a similar consolidation of all containers into a single one. Hence, I think this thread is a good place to compare the pros and cons of the proposed change. Here are my initial thoughts:

In retrospective one class could have probably fit all three containers, similar to ndarray, that is able to cover data of any dimensionality.
The AtomArray and AtomArrayStack are at the heart of biotite.structure and some extension packages. Almost all functions in this subpackage accept one of these are parameters. Hence reworking all functionalities for the new class would be considerable work, as you already stated. Furthermore, this would be an API change that would affect almost all Biotite users. However, careful design of the API of the new class could give some compatibility to the current AtomArray and AtomArrayStack, mitigating the work and incompatible API changes at least to a certain extent.
Some functionalities only work for AtomArray or AtomArrayStack. Therefore specifying the input type to one of these gives the user a clear message, whether the function accepts only a single-model structure, only a multi-model structure or both.
I think the overhead of handling all 3 containers is not that high: Effectively Atom is only used to create a handcrafted AtomArray or to print single atoms and does not play a major role in the package. AtomArray and AtomArrayStack have a quite similar API. The annotation arrays and coordinates can be obtained by the same attribute names, only the returned dimensions of the coordinates are different. Also most methods are the same, since they originate from _AtomArrayBase. I agree with you that in PDBFile and some other File classes the complexity is quite high, since both classes are handled separately, which also makes implementing Support writing multiple AtomArray to structure file #131 more tedious. However, the main reason for the current separation when writing a PDBFile is, that largest part of an ATOM and HETATM record is equal for all models in an AtomArrayStack, only the coordinates differ. Therefore the common part of the record is reused, increasing the performance. However, this also required (from my point of view) the separate implementation, increasing the complexity. Without considering performance, the PDB writer could have been implemented in a nested loop for both, AtomArray and AtomArrayStack, analogous to the implementation in your gist.

Based on these considerations I am currently in favor of the current data model, though I am open to a discussion about this topic. Nevertheless I think that we could have a look again at PDBFile and other classes in structure.io to see if there is potential to reduce complexity. For instance, although the performance was the reason for handling AtomArray and AtomArrayStack in PDBFile separately, as pointed out, I never actually benchmarked it in comparison to the alternative.

0 replies

claudejrogers · 2022-03-27T20:42:31Z

claudejrogers
Mar 27, 2022
Author

These are all good points, but some things to keep in mind include:

As long as the exposed API of a new class is similar, the impact to users could be minimal.
Since the APIs among the existing classes are already similar, changing the implementation would probably be less painful than it seems (that said, I haven't used a lot of the non-PDB-facing code, so I could be totally wrong...).
Deleting 2 classes means there is less code to maintain.
I find aspects of the current API non-intuitive. For example, why does calling set_structure on a PDBFile with a single model return a AtomArrayStack by default? Why doesn't calling read parse the file? Why should't users be able to write an Atom to a PDB file? The .get_array() method accepts 0 as an index, but PDBFile.get_structure doesn't.
In my opinion, checking the shape or ndim attribute of an ndarray is preferable to calling isinstance. This could be swapped now, mitigating a future transition, if it happened. In other words, I think it would be possible to transition the existing code to decouple the atom data from the implementation.
The overhead of three atom collections may not be high in terms of performance, but it does impact code complexity/maintainability. For example, PDBFile.set_structure has a minor bug where atom names aren't properly aligned. To fix that requires editing two identical blocks of code to generate the lines. Moreover, reasonable feature requests are painful to implement.
For functions that would expect an (n, 3) coordinate array (e.g., PDB writing, Kabsch-based transformations, etc.), I think adding an extra dimension on the coordinates and doing a one-iteration outer for-loop (as I did in my to_pdb example) is worth the tiny impact on performance (i.e., the inner loop is only called once for an (n, 3) coordinate array) if it means not having to repeat large blocks of code. For 2D transformations, code could look like:

# ...
    coords = atoms.coords
    if coors.ndim == 2:
        coords = coords[np.newaxis, ...]
    transformed = []
    for mol in coords:
        # this loop is called once for 2D coordinates, but in exchange the same code works
        # for single molecules or collections
        transformed.append(func_taking_2darray_returning_2darray(mol))
    atoms.coords = np.array(transformed).squeeze()
# ...

Note, that in my gist example the non-coordinate atom data was only read/stored once.
I don't think the PDBFile code is particularly performant, as it stores the entire contents of the input file in memory, iterates through .lines at least 4 or 5 times when parsing, and does lots of string concatenation when setting the structure. I'm pretty sure my example is faster/more efficient.

If we could agree what an acceptable API for a single atom collection might look like, I think it would be possible to transition the code without introducing a disruptive change.

1 reply

claudejrogers Mar 27, 2022
Author

Also, the code could also contain "type aliases" to preserve the user API, e.g.:

class AtomArray(Atoms):
    pass


class AtomArrayStack(Atoms):
    pass

padix-key · 2022-03-29T17:21:18Z

padix-key
Mar 29, 2022
Maintainer

Due to the impact of such change, I think such a change would need a longer discussion preferably with additional input of other Biotite users. In my opinion both, the current implementation and your proposition, especially differ in how the user (and also library code) separates single-model and multi-model structures. Currently, this is done on the class level (isinstance()) and in your proposition it would be done by checking the shape of the coord (and box). Therefore, I think the decision mostly depends on personal flavor.

Here are some replies to your comments:

Deleting 2 classes means there is less code to maintain.

The common functionality of the two classes is already consolidated into _AtomArrayBase. The individual code primarily focuses on the different kind of indexing.

I find aspects of the current API non-intuitive. For example, why does calling set_structure on a PDBFile with a single model return a AtomArrayStack by default?

This behavior was implemented to give the user a defined output type (or coordinate shape in your approach) independent of the input file.

Why doesn't calling read parse the file?

In all of Biotite's file classes read() means loading data into memory without to much time consuming computations. In the context of text files this usually means separating the lines of the file. Then the user can choose which information should be extracted from the file.

Why should't users be able to write an Atom to a PDB file?

This is a good point in principle. However, I think the use case of this is quite small.

The .get_array() method accepts 0 as an index, but PDBFile.get_structure doesn't.

This is a little bit inconsistent indeed. At the time it was implemented, this type of indexing was chosen to be consistent with the model naming in PDB and CIF files. However, this inconsistency is not mitigated with a new data model.

In my opinion, checking the shape or ndim attribute of an ndarray is preferable to calling isinstance. This could be swapped now, mitigating a future transition, if it happened. In other words, I think it would be possible to transition the existing code to decouple the atom data from the implementation.

I think for future code this is already OK, since it makes the same separation as isinstance().

The overhead of three atom collections may not be high in terms of performance, but it does impact code complexity/maintainability. For example, PDBFile.set_structure has a minor bug where atom names aren't properly aligned. To fix that requires editing two identical blocks of code to generate the lines. Moreover, reasonable feature requests are painful to implement.

It is possible to reuse code for both AtomArray and AtomArrayStack as demonstrated in PR #380.

0 replies

claudejrogers · 2022-03-29T17:54:52Z

claudejrogers
Mar 29, 2022
Author

Yes, I think you're right. Having worked with the library a bit more over the past few days, working with multiple atom collection classes is not as cumbersome as I initially thought. Plus, since my use cases are small compared to the scope of the library, I'm sure I'm not fully appreciating the utility of some of the design choices.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate atom collection classes into a single class #378

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Consolidate atom collection classes into a single class #378

claudejrogers Mar 25, 2022

Replies: 4 comments · 1 reply

padix-key Mar 27, 2022 Maintainer

claudejrogers Mar 27, 2022 Author

claudejrogers Mar 27, 2022 Author

padix-key Mar 29, 2022 Maintainer

claudejrogers Mar 29, 2022 Author

claudejrogers
Mar 25, 2022

Replies: 4 comments 1 reply

padix-key
Mar 27, 2022
Maintainer

claudejrogers
Mar 27, 2022
Author

claudejrogers Mar 27, 2022
Author

padix-key
Mar 29, 2022
Maintainer

claudejrogers
Mar 29, 2022
Author