Refactor _read_into_buffer #47

rhpvorderman · 2022-02-15T12:24:20Z

When making #39 _read_into_buffer was begging for a refactoring, but that would mess up the diff immensely.

So here is my refactor that I did over the weekend. I realize that I have been bombarding this project with PRs, but the good news is: this is the last one. Everything I have to add is now a PR or an issue.

This refactor:

self.c_buf -> self.buffer . self.buf and self.buf_view are now gone. There is now only a C buffer. The original code had three representations of the same buffer and I thought that was a bit confusing.
self.bufend -> self.bytes_in_buffer. A more clear description of the variable meaning.
PyMem_Realloc instead of creating a new bytesarray object when the buffer is too small. This uses less code and is easier to understand.
memmove instead of bytearray semantics. This is equally understandable.
self.record_start is now set to 0 just below the memmove part. This makes the code easier to follow. Previously it was all the way down at the end of the function because its value was used by the EOF checks.
Clearer EOF checks. The common non-error case is now at the top. The checks have more self-explanatory code.

The one disadvantage is that readinto cannot be used for C buffers, so this uses read and memcpy. This does not matter for speed because io.BufferedReader and gzip.GzipFile use the same implementation of readinto (and these cover >99% of our input use cases). It is a bit more verbose though than readinto, but quite understandable. The advantage is that we read a filechunk and that we can use filechunk_size to see if we reached EOF. That is a very idiomatic python thing to do.

There are no speed advantages or disadvantages to this PR. It does the same thing semantically as the old code. There is however a size advantage:

x	before	after
_core.c size	852K	488K
_core.c lines	22555	12530
_core.*.so	1672K	852K
dnaio wheel	516K	272K

This is because cython does not generate all sorts of generic memoryview and bytearray method code.

marcelm · 2022-02-18T09:29:51Z

src/dnaio/_core.pyx

        self.record_start = 0
        self.file = file
        if buffer_size < 1:
            raise ValueError("Starting buffer size too small")

+    def __dealloc__(self):
+        if self.buffer != NULL:


Strictly speaking, you don’t need to explicitly check for a null pointer because PyMem_Free will just not do anything if it gets passed one.

Fixed. Thanks!

marcelm

I haven’t gone through everything in detail, but I like the individual improvements. I’ll trust the tests to have caught any problems this time. I added a single comment, but that’s essentially just a FYI and you don’t need to change it if you prefer. Please merge this one yourself!

rhpvorderman · 2022-02-18T11:25:15Z

I haven’t gone through everything in detail, but I like the individual improvements. I’ll trust the tests to have caught any problems this time.

Your test suite is pretty thorough, which is why I dared to refactor like this in the first place. Thanks for the review!

rhpvorderman added 11 commits February 13, 2022 06:21

Rename c_buf to buffer

78378df

Rename self.bufend -> self.bytes_in_buffer

93cda92

Refactor _read_into_buffer to use memory directly

a8b6c0c

Refactor comment

2dadab2

Free buffer when FastqIter object is garbage collected

520ee09

Slightly more clear code

7aa1a7b

Typo

cea5641

remove unused variables

55cd723

Better comment

52782e3

Guard against corrupt read implementations

2f95d2f

Merge branch 'marcelm:main' into refactorbuffer

bf95b04

marcelm mentioned this pull request Feb 18, 2022

Reduce size of wheels by stripping debug information pycompression/python-isal#108

Merged

2 tasks

marcelm reviewed Feb 18, 2022

View reviewed changes

marcelm approved these changes Feb 18, 2022

View reviewed changes

Do not check for NULL pointer on freeing

c9d2c57

rhpvorderman merged commit faba811 into marcelm:main Feb 18, 2022

rhpvorderman deleted the refactorbuffer branch February 18, 2022 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor _read_into_buffer #47

Refactor _read_into_buffer #47

rhpvorderman commented Feb 15, 2022

marcelm Feb 18, 2022

rhpvorderman Feb 18, 2022

marcelm left a comment

rhpvorderman commented Feb 18, 2022

Refactor _read_into_buffer #47

Refactor _read_into_buffer #47

Conversation

rhpvorderman commented Feb 15, 2022

marcelm Feb 18, 2022

Choose a reason for hiding this comment

rhpvorderman Feb 18, 2022

Choose a reason for hiding this comment

marcelm left a comment

Choose a reason for hiding this comment

rhpvorderman commented Feb 18, 2022