-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for streaming through the input arrays of a pipeline #182
Conversation
This seems nice. I'm thinking about this vs. the existing PDAL/Python streaming interface (executeStream). It's been a long time since I've looked at this code but I think that what you've done will limit memory consumption in Python by filling output arrays (going to PDAL) in pieces, but it's still running in PDAL standard mode because it runs within the context of "execute" instead of "executeStream", which means that all the data will get loaded into PDAL memory. I can see a benefit to this for execution of some PDAL pipeline that will only run in standard mode, but in the example (LAS -> LAS), this isn't the case. Anyway, interested in your thoughts. Perhaps there's a way to support this in PDAL stream mode pretty easily since it's already handling Numpy arrays one-at-a-time, but now they all need to be available at the start of execution. |
In the example snippet I provided both pipeline executions will use under the hood the streaming interface you mentioned (executeStream) :
So data loaded in memory can already be capped based on the value chosen for each chunk sizes ( For each numpy array included as input of the Pipeline, a My proposal extends this logic to support passing alternatively a pair
Prior to this solution I've attempted to find a way which wouldn't require to change the existing Pipeline interface but ended up not finding any better solution. Happy to hear your thoughts and alternative approaches. |
I haven't forgotten this. I need time to look again. Thanks for the above note. |
Seeing this PR reminded me that I've had in mind to allow the caller to control which dimensions pipelines should use to control memory usage in situations where you don't want everything. It is available in #184. Having this might have gotten you through without having to do all this work (which is very much appreciated!). When I merge your PR into #184, I can't get all the tests to pass, however. I'm not sure why, but I didn't spend much time investigating yet. |
b379d7b
to
aa3f7b6
Compare
Feature #184 seems a useful addition. I'm afraid it wouldn't got me through though because my input data, read from a 3rd party lib, couldn't fit in memory :/
I've just rebased my branch to try it locally and all tests seem to be passing. Happy to assist on any necessary changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good. You've written doc/examples, but I'm not sure it's in the actual documentation.
I think almost all my comments are opinion that you can ignore. I prefer to have explicit ownership (avoid shared_ptr) unless necessary because it makes me wonder how things are being shared. And I'm not a huge std::optional fan. But you can disagree and do what you think is best. :)
throw pdal_error("Unable to create numpy iterator."); | ||
std::optional<int> stream_chunk_size = std::nullopt; | ||
if (m_stream_handler) { | ||
stream_chunk_size = (*m_stream_handler)(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The handler returns uint64_t, but assigns to int.
Is there any reason for optional? Seems like initializing to 0 would work fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed type to int64_t
instead of uint64_t
just for the sake of presenting a clearer error message to the python developer in case they mistakenly return a negative number.
Will be this:
RuntimeError: Stream chunk size not in the range of array length: -1
Instead of this:
RuntimeError: Unable to cast Python instance of type <class 'int'> to C++ type '?' (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)
src/pdal/PyArray.cpp
Outdated
m_stride = *NpyIter_GetInnerStrideArray(m_iter); | ||
m_size = *NpyIter_GetInnerLoopSizePtr(m_iter); | ||
if (stream_chunk_size) { | ||
if (0 <= *stream_chunk_size && *stream_chunk_size <= m_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you could use 0 as a sentinel since a chunk size of 0 seems pointless.
src/pdal/PyArray.cpp
Outdated
|
||
char *itererr; | ||
m_iterNext = NpyIter_GetIterNext(m_iter, &itererr); | ||
if (!m_iterNext) | ||
{ | ||
NpyIter_Deallocate(m_iter); | ||
throw pdal_error(std::string("Unable to create numpy iterator: ") + | ||
itererr); | ||
throw pdal_error(std::string("Unable to create numpy iterator: ") + itererr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a misleading error message.
src/pdal/PyArray.cpp
Outdated
NpyIter_Deallocate(m_iter); | ||
throw pdal_error("Unable to reset numpy iterator."); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set m_iter
to nullptr here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I've fixed it for the error case of m_iterNext
below as well.
src/pdal/PyArray.cpp
Outdated
|
||
ArrayIter::ArrayIter(PyArrayObject* np_array) | ||
void ArrayIter::resetIterator(std::optional<PyArrayObject*> np_array = {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there are really two separate cases here. An initialize
case and a reset
case, variously triggered by the optional argument. Perhaps more clear to have two functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This didn't come out as nicely as I was thinking, so I may regret my comment ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed all your comments. Let me know what you think.
Thanks.
throw pdal_error("Unable to create numpy iterator."); | ||
std::optional<int> stream_chunk_size = std::nullopt; | ||
if (m_stream_handler) { | ||
stream_chunk_size = (*m_stream_handler)(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed type to int64_t
instead of uint64_t
just for the sake of presenting a clearer error message to the python developer in case they mistakenly return a negative number.
Will be this:
RuntimeError: Stream chunk size not in the range of array length: -1
Instead of this:
RuntimeError: Unable to cast Python instance of type <class 'int'> to C++ type '?' (#define PYBIND11_DETAILED_ERROR_MESSAGES or compile in debug mode for details)
src/pdal/PyArray.cpp
Outdated
NpyIter_Deallocate(m_iter); | ||
throw pdal_error("Unable to reset numpy iterator."); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I've fixed it for the error case of m_iterNext
below as well.
My only remaining question is about documentation. Thoughts on this @hobu ? |
I forgot to comment on this but I'm happy to write a quick and simple example on how to use this new functionality. The example would be to show how I used this was to read a large E57 file using pye57 in stream mode and then write a laz file also in stream mode using pdal. All of that with a very low memory consumption. |
We don't really have proper docs. https://github.com/PDAL/python/blob/main/README.rst has been our documentation such as it is, so I suppose a section there with an example showing how to use this would be sufficient for now. |
Thanks for the docs! If you have anything else to do, open a new PR. |
I've faced a couple of cases where I wasn't able to fit all my input data in memory to then run a (streamed) PDAL pipeline with it.
With these changes, you'd be able to stream the input data into the Pipeline. This has helped me:
See below a very simple example of usage just for illustrative purposes.
Streaming the read and write of a very large LAZ file.
With support for tweaking both input and output chunk sizes