Minor clean up of PDBFile.set_structure #380

claudejrogers · 2022-03-28T00:31:00Z

Removed duplicated code for writing AtomArray vs AtomArrayStack data,
fixed a minor bug affecting atom name alignment.

Removed duplicated code for writing AtomArray vs AtomArrayStack data, fixed a minor bug affecting atom name alignment.

padix-key · 2022-03-28T16:12:09Z

Thank you for the cleanup. In my opinion the changes increase the readability and maintainability of the structure.io.pdb module and also facilitates implementation of #131.

I also benchmarked the changes and found a significantly decreased performance for writing PDB files. I tested it on the multi-model structure 1GYA.

import timeit
import biotite.structure.io.pdb as pdb


FILE_NAME = "path/to/1gya.pdb"
N = 100


pdb_file = pdb.PDBFile.read(FILE_NAME)

time = timeit.timeit(
    "pdb_file.get_structure()",
    "from __main__ import pdb_file",
    number=N
)
print(f"Reading PDB: {time * 1e3 / N :.2f} ms")


atoms = pdb_file.get_structure()

time = timeit.timeit(
    "pdb_file.set_structure(atoms)",
    "from __main__ import pdb_file, atoms",
    number=N
)
print(f"Writing PDB: {time * 1e3 / N :.2f} ms")

Output prior to change:

Reading PDB: 78.35 ms
Writing PDB: 109.76 ms

Output after change:

Reading PDB: 76.75 ms
Writing PDB: 187.55 ms

Nevertheless, I am in favor of this change, since in my opinion a maintainable code is more important than performance, if the performance penalty is in the demonstrated order of magnitude, especially since fastpdb can be alternatively used, if high performance is required.

I am also in favor of the atom name alignment change. Basically it implements this sentence from the PDB specification:

Alignment of one-letter atom name such as C starts at column 14, while two-letter atom name such as FE starts at column 13.

Finally, you could add yourself to the __author__ attribute of structure/io/pdb/file.py, as I consider this PR quite a large contribution to the module.

claudejrogers · 2022-03-28T19:55:10Z

I think I could recover the performance by converting the non-coordinate atom data to numpy.char.array objects, then adding the first half of the pdb data and second half of the pdb data outside the loop. Then the loop could become:

is_stack = coords.shape[0] > 1
for model_num, coord_i in enumerate(coords, start=1):
    # for an ArrayStack, this is run once
    # only add model lines if is_stack
    if is_stack:
        self.lines.append(f"MODEL     {model_num:4}")
    coordinates = np.char.array(
        [f"   {x:>8.3f}{y:>8.3f}{z:>8.3f}" for (x, y, z) in coord_i]
    )
   self.lines.extend((first_half + coordinates + second_half).tolist())
   if is_stack:
       self.lines.append("ENDMDL")

Using the wisdom of the previous approach, the non-coordinate array data was concatenated together using np.char.array objects to speed up set_structure.

padix-key · 2022-03-29T16:35:12Z

I ran the benchmark on your recent change and the runtime it improved measurably.

Reading PDB: 74.86 ms
Writing PDB: 130.27 ms

Are you finished with the changes, so I can merge this PR?

claudejrogers · 2022-03-29T17:35:19Z

Well, to be honest I'm not super happy that the code is slower. I made a small example of a cython version here: https://github.com/claudejrogers/bitotite_test.

A truncated set_structure call goes from ~160 ms to ~40 ms on my system. Do you think it's worth it? In my opinion, the code is still readable.

padix-key · 2022-03-29T19:16:29Z

I am not sure how safe the raw pointer handling is in your Cython prototype, especially if the input arrays are somehow malformed. Furthermore, I think your pure Python alternative is more clear. Since the performance improvement is still 'only' 3x times the pure Python version, and a safe and fast Rust implementation exists with fastpdb (although it requires installation of an extra package), I rather prefer the PR as it is.

claudejrogers · 2022-03-29T20:04:24Z

Fair enough. Feel free to merge.

Safer alternatives to some of the c string functions exist, e.g., strlcpy and strlcat, but I don't know how portable they are. Since the public interfaces (non cdef functions) take numpy inputs, it may be possible to restrict the inputs to prevent buffer overflows and guarantee all inputs are null terminated, e.g.:

>>> import numpy as np
>>> np.array(["aaaaaaaaaaaaaaaaaaaaaaaa"], dtype="S1")
array([b'a'], dtype="|S1")

That said, probably not worth the effort for 100 ms.

padix-key · 2022-03-29T20:44:24Z

OK, thank you very much for the PR. Due to the cleaned structure of the module I will probably also work on #131 in the next days.

claudejrogers · 2022-03-30T16:14:50Z

Wasn't #131 implemented already (including before my PR), at least for PDB files? Maybe I'm not understanding the issue correctly, but all the models in an AtomArrayStack will get written to a pdb file currently.

And, not to belabor a dead issue, but I updated my example repo to show that all inputs can be trusted (numpy sanitizes the strings and I check array sizes). I don't think it's possible to cause a buffer overflow or access an incorrect portion of memory with the public functions.

padix-key · 2022-03-30T16:31:40Z

Wasn't #131 implemented already (including before my PR), at least for PDB files? Maybe I'm not understanding the issue correctly, but all the models in an AtomArrayStack will get written to a pdb file currently.

Each model in an AtomArrayStack has different coordaintes, but the same atoms in the same order. However, currently it is not possible to give multiple models with different atoms to a single PDBFile, i.e. a list of AtomArray. Issue #130 addresses this shortcoming.

Minor clean up of PDBFile.set_structure

b1ab87d

Removed duplicated code for writing AtomArray vs AtomArrayStack data, fixed a minor bug affecting atom name alignment.

Precalculate non-coordinate sections of PDB lines

dd82791

Using the wisdom of the previous approach, the non-coordinate array data was concatenated together using np.char.array objects to speed up set_structure.

padix-key merged commit 8794704 into biotite-dev:master Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor clean up of PDBFile.set_structure #380

Minor clean up of PDBFile.set_structure #380

claudejrogers commented Mar 28, 2022

padix-key commented Mar 28, 2022

claudejrogers commented Mar 28, 2022

padix-key commented Mar 29, 2022

claudejrogers commented Mar 29, 2022

padix-key commented Mar 29, 2022

claudejrogers commented Mar 29, 2022

padix-key commented Mar 29, 2022

claudejrogers commented Mar 30, 2022

padix-key commented Mar 30, 2022

Minor clean up of PDBFile.set_structure #380

Minor clean up of PDBFile.set_structure #380

Conversation

claudejrogers commented Mar 28, 2022

padix-key commented Mar 28, 2022

claudejrogers commented Mar 28, 2022

padix-key commented Mar 29, 2022

claudejrogers commented Mar 29, 2022

padix-key commented Mar 29, 2022

claudejrogers commented Mar 29, 2022

padix-key commented Mar 29, 2022

claudejrogers commented Mar 30, 2022

padix-key commented Mar 30, 2022