-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor clean up of PDBFile.set_structure #380
Conversation
Removed duplicated code for writing AtomArray vs AtomArrayStack data, fixed a minor bug affecting atom name alignment.
Thank you for the cleanup. In my opinion the changes increase the readability and maintainability of the I also benchmarked the changes and found a significantly decreased performance for writing PDB files. I tested it on the multi-model structure 1GYA. import timeit
import biotite.structure.io.pdb as pdb
FILE_NAME = "path/to/1gya.pdb"
N = 100
pdb_file = pdb.PDBFile.read(FILE_NAME)
time = timeit.timeit(
"pdb_file.get_structure()",
"from __main__ import pdb_file",
number=N
)
print(f"Reading PDB: {time * 1e3 / N :.2f} ms")
atoms = pdb_file.get_structure()
time = timeit.timeit(
"pdb_file.set_structure(atoms)",
"from __main__ import pdb_file, atoms",
number=N
)
print(f"Writing PDB: {time * 1e3 / N :.2f} ms") Output prior to change:
Output after change:
Nevertheless, I am in favor of this change, since in my opinion a maintainable code is more important than performance, if the performance penalty is in the demonstrated order of magnitude, especially since fastpdb can be alternatively used, if high performance is required. I am also in favor of the atom name alignment change. Basically it implements this sentence from the PDB specification:
Finally, you could add yourself to the |
I think I could recover the performance by converting the non-coordinate atom data to is_stack = coords.shape[0] > 1
for model_num, coord_i in enumerate(coords, start=1):
# for an ArrayStack, this is run once
# only add model lines if is_stack
if is_stack:
self.lines.append(f"MODEL {model_num:4}")
coordinates = np.char.array(
[f" {x:>8.3f}{y:>8.3f}{z:>8.3f}" for (x, y, z) in coord_i]
)
self.lines.extend((first_half + coordinates + second_half).tolist())
if is_stack:
self.lines.append("ENDMDL") |
Using the wisdom of the previous approach, the non-coordinate array data was concatenated together using np.char.array objects to speed up set_structure.
I ran the benchmark on your recent change and the runtime it improved measurably.
Are you finished with the changes, so I can merge this PR? |
Well, to be honest I'm not super happy that the code is slower. I made a small example of a cython version here: https://github.com/claudejrogers/bitotite_test. A truncated set_structure call goes from ~160 ms to ~40 ms on my system. Do you think it's worth it? In my opinion, the code is still readable. |
I am not sure how safe the raw pointer handling is in your Cython prototype, especially if the input arrays are somehow malformed. Furthermore, I think your pure Python alternative is more clear. Since the performance improvement is still 'only' 3x times the pure Python version, and a safe and fast Rust implementation exists with fastpdb (although it requires installation of an extra package), I rather prefer the PR as it is. |
Fair enough. Feel free to merge. Safer alternatives to some of the c string functions exist, e.g., >>> import numpy as np
>>> np.array(["aaaaaaaaaaaaaaaaaaaaaaaa"], dtype="S1")
array([b'a'], dtype="|S1") That said, probably not worth the effort for 100 ms. |
OK, thank you very much for the PR. Due to the cleaned structure of the module I will probably also work on #131 in the next days. |
Wasn't #131 implemented already (including before my PR), at least for PDB files? Maybe I'm not understanding the issue correctly, but all the models in an AtomArrayStack will get written to a pdb file currently. And, not to belabor a dead issue, but I updated my example repo to show that all inputs can be trusted (numpy sanitizes the strings and I check array sizes). I don't think it's possible to cause a buffer overflow or access an incorrect portion of memory with the public functions. |
Each model in an |
Removed duplicated code for writing AtomArray vs AtomArrayStack data,
fixed a minor bug affecting atom name alignment.