Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor clean up of PDBFile.set_structure #380

Merged
merged 2 commits into from
Mar 29, 2022

Conversation

claudejrogers
Copy link
Contributor

Removed duplicated code for writing AtomArray vs AtomArrayStack data,
fixed a minor bug affecting atom name alignment.

Removed duplicated code for writing AtomArray vs AtomArrayStack data,
fixed a minor bug affecting atom name alignment.
@padix-key
Copy link
Member

Thank you for the cleanup. In my opinion the changes increase the readability and maintainability of the structure.io.pdb module and also facilitates implementation of #131.

I also benchmarked the changes and found a significantly decreased performance for writing PDB files. I tested it on the multi-model structure 1GYA.

import timeit
import biotite.structure.io.pdb as pdb


FILE_NAME = "path/to/1gya.pdb"
N = 100


pdb_file = pdb.PDBFile.read(FILE_NAME)

time = timeit.timeit(
    "pdb_file.get_structure()",
    "from __main__ import pdb_file",
    number=N
)
print(f"Reading PDB: {time * 1e3 / N :.2f} ms")


atoms = pdb_file.get_structure()

time = timeit.timeit(
    "pdb_file.set_structure(atoms)",
    "from __main__ import pdb_file, atoms",
    number=N
)
print(f"Writing PDB: {time * 1e3 / N :.2f} ms")

Output prior to change:

Reading PDB: 78.35 ms
Writing PDB: 109.76 ms

Output after change:

Reading PDB: 76.75 ms
Writing PDB: 187.55 ms

Nevertheless, I am in favor of this change, since in my opinion a maintainable code is more important than performance, if the performance penalty is in the demonstrated order of magnitude, especially since fastpdb can be alternatively used, if high performance is required.

I am also in favor of the atom name alignment change. Basically it implements this sentence from the PDB specification:

Alignment of one-letter atom name such as C starts at column 14, while two-letter atom name such as FE starts at column 13.

Finally, you could add yourself to the __author__ attribute of structure/io/pdb/file.py, as I consider this PR quite a large contribution to the module.

@claudejrogers
Copy link
Contributor Author

I think I could recover the performance by converting the non-coordinate atom data to numpy.char.array objects, then adding the first half of the pdb data and second half of the pdb data outside the loop. Then the loop could become:

is_stack = coords.shape[0] > 1
for model_num, coord_i in enumerate(coords, start=1):
    # for an ArrayStack, this is run once
    # only add model lines if is_stack
    if is_stack:
        self.lines.append(f"MODEL     {model_num:4}")
    coordinates = np.char.array(
        [f"   {x:>8.3f}{y:>8.3f}{z:>8.3f}" for (x, y, z) in coord_i]
    )
   self.lines.extend((first_half + coordinates + second_half).tolist())
   if is_stack:
       self.lines.append("ENDMDL")

Using the wisdom of the previous approach, the non-coordinate array data
was concatenated together using np.char.array objects to speed up
set_structure.
@padix-key
Copy link
Member

I ran the benchmark on your recent change and the runtime it improved measurably.

Reading PDB: 74.86 ms
Writing PDB: 130.27 ms

Are you finished with the changes, so I can merge this PR?

@claudejrogers
Copy link
Contributor Author

Well, to be honest I'm not super happy that the code is slower. I made a small example of a cython version here: https://github.com/claudejrogers/bitotite_test.

A truncated set_structure call goes from ~160 ms to ~40 ms on my system. Do you think it's worth it? In my opinion, the code is still readable.

@padix-key
Copy link
Member

I am not sure how safe the raw pointer handling is in your Cython prototype, especially if the input arrays are somehow malformed. Furthermore, I think your pure Python alternative is more clear. Since the performance improvement is still 'only' 3x times the pure Python version, and a safe and fast Rust implementation exists with fastpdb (although it requires installation of an extra package), I rather prefer the PR as it is.

@claudejrogers
Copy link
Contributor Author

Fair enough. Feel free to merge.

Safer alternatives to some of the c string functions exist, e.g., strlcpy and strlcat, but I don't know how portable they are. Since the public interfaces (non cdef functions) take numpy inputs, it may be possible to restrict the inputs to prevent buffer overflows and guarantee all inputs are null terminated, e.g.:

>>> import numpy as np
>>> np.array(["aaaaaaaaaaaaaaaaaaaaaaaa"], dtype="S1")
array([b'a'], dtype="|S1") 

That said, probably not worth the effort for 100 ms.

@padix-key padix-key merged commit 8794704 into biotite-dev:master Mar 29, 2022
@padix-key
Copy link
Member

OK, thank you very much for the PR. Due to the cleaned structure of the module I will probably also work on #131 in the next days.

@claudejrogers
Copy link
Contributor Author

Wasn't #131 implemented already (including before my PR), at least for PDB files? Maybe I'm not understanding the issue correctly, but all the models in an AtomArrayStack will get written to a pdb file currently.

And, not to belabor a dead issue, but I updated my example repo to show that all inputs can be trusted (numpy sanitizes the strings and I check array sizes). I don't think it's possible to cause a buffer overflow or access an incorrect portion of memory with the public functions.

@padix-key
Copy link
Member

Wasn't #131 implemented already (including before my PR), at least for PDB files? Maybe I'm not understanding the issue correctly, but all the models in an AtomArrayStack will get written to a pdb file currently.

Each model in an AtomArrayStack has different coordaintes, but the same atoms in the same order. However, currently it is not possible to give multiple models with different atoms to a single PDBFile, i.e. a list of AtomArray. Issue #130 addresses this shortcoming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants