Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor _read_into_buffer #47

Merged
merged 12 commits into from
Feb 18, 2022
Merged

Conversation

rhpvorderman
Copy link
Collaborator

When making #39 _read_into_buffer was begging for a refactoring, but that would mess up the diff immensely.

So here is my refactor that I did over the weekend. I realize that I have been bombarding this project with PRs, but the good news is: this is the last one. Everything I have to add is now a PR or an issue.

This refactor:

  • self.c_buf -> self.buffer . self.buf and self.buf_view are now gone. There is now only a C buffer. The original code had three representations of the same buffer and I thought that was a bit confusing.
  • self.bufend -> self.bytes_in_buffer. A more clear description of the variable meaning.
  • PyMem_Realloc instead of creating a new bytesarray object when the buffer is too small. This uses less code and is easier to understand.
  • memmove instead of bytearray semantics. This is equally understandable.
  • self.record_start is now set to 0 just below the memmove part. This makes the code easier to follow. Previously it was all the way down at the end of the function because its value was used by the EOF checks.
  • Clearer EOF checks. The common non-error case is now at the top. The checks have more self-explanatory code.

The one disadvantage is that readinto cannot be used for C buffers, so this uses read and memcpy. This does not matter for speed because io.BufferedReader and gzip.GzipFile use the same implementation of readinto (and these cover >99% of our input use cases). It is a bit more verbose though than readinto, but quite understandable. The advantage is that we read a filechunk and that we can use filechunk_size to see if we reached EOF. That is a very idiomatic python thing to do.

There are no speed advantages or disadvantages to this PR. It does the same thing semantically as the old code. There is however a size advantage:

x before after
_core.c size 852K 488K
_core.c lines 22555 12530
_core.*.so 1672K 852K
dnaio wheel 516K 272K

This is because cython does not generate all sorts of generic memoryview and bytearray method code.

self.record_start = 0
self.file = file
if buffer_size < 1:
raise ValueError("Starting buffer size too small")

def __dealloc__(self):
if self.buffer != NULL:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, you don’t need to explicitly check for a null pointer because PyMem_Free will just not do anything if it gets passed one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

Copy link
Owner

@marcelm marcelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t gone through everything in detail, but I like the individual improvements. I’ll trust the tests to have caught any problems this time. I added a single comment, but that’s essentially just a FYI and you don’t need to change it if you prefer. Please merge this one yourself!

@rhpvorderman
Copy link
Collaborator Author

I haven’t gone through everything in detail, but I like the individual improvements. I’ll trust the tests to have caught any problems this time.

Your test suite is pretty thorough, which is why I dared to refactor like this in the first place. Thanks for the review!

@rhpvorderman rhpvorderman merged commit faba811 into marcelm:main Feb 18, 2022
@rhpvorderman rhpvorderman deleted the refactorbuffer branch February 18, 2022 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants