-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arrow/parquet : limit the parquet-reader memory usage #163
Conversation
include/s3select_parquet_intrf.h
Outdated
@@ -747,6 +747,15 @@ class PARQUET_EXPORT RowGroupReader { | |||
std::unique_ptr<Contents> contents_; | |||
}; | |||
|
|||
//TODO external setting? RGW options ?? | |||
#define RGW_buffer_size 1024*1024*16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pass it as a parameter to the function that init the parquet reader?
should probably be an RGW conf parameter passed from the outside
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuvalif
sure.
that is the plan. (review the comment above)
i measured whether it resolved the issue of huge memory consumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. but we should probably merge after we make the change to take the size as a parameter, so we can integrate into the rgw code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the interface (RGW option) is included in this PR.
with the standalone-application the impact of this change is visible(memory consumption, RSS).
it is also possible to measure other aspects of these changes, like the number of calls to storage systems.
this parameter probably has more impact upon using the S3 storage, we should strive for a default size that is optimized with RSS and throughput.
(users should not "play" with this parameter too much, this may cause other issues.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can an RGW option be included in an s3-select PR?
I thought that this PR would expose an interface where this value could be set, and the matching RGW PR (ceph/ceph#59465) will add a new option in: src/common/options/rgw.yaml.in
and pass it to that API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
i meant the Ref PR ceph/ceph#59465
what would be the behavior when the max buffer size is exceeded? |
the current behavior ... no limitation on buffer size. with the current fix(the buffer size limitation) it loads part after part according to the buffer size. the arrow library does not raise an exception. |
@galsalomon66 is 7ef7e67 related to the change? |
the |
while testing the arrow/parquet change(buffer size limitation) the standalone application has no impact on RGW. I refactored the code referring to the buffer response size. |
it is useful upon very row-groups(huge number of values per column). refactor of the send-back-result-response. add external configuration per parquet read-buffer. fix for result printing remove debug. Signed-off-by: Gal Salomon <[email protected]>
7ef7e67
to
4b126a7
Compare
using reader-properties to set buffer size upon reading column chunks.
it is useful for very big row-groups(huge number of values per column).
https://bugzilla.redhat.com/show_bug.cgi?id=2252403