arrow/parquet : limit the parquet-reader memory usage #163

galsalomon66 · 2024-08-27T13:53:40Z

using reader-properties to set buffer size upon reading column chunks.
it is useful for very big row-groups(huge number of values per column).

https://bugzilla.redhat.com/show_bug.cgi?id=2252403

yuvalif · 2024-08-27T17:59:46Z

include/s3select_parquet_intrf.h

@@ -747,6 +747,15 @@ class PARQUET_EXPORT RowGroupReader {
  std::unique_ptr<Contents> contents_;
 };

+//TODO external setting? RGW options ??
+#define RGW_buffer_size 1024*1024*16


can you pass it as a parameter to the function that init the parquet reader?
should probably be an RGW conf parameter passed from the outside

@yuvalif
sure.
that is the plan. (review the comment above)

i measured whether it resolved the issue of huge memory consumption.

ok. but we should probably merge after we make the change to take the size as a parameter, so we can integrate into the rgw code

yes, the interface (RGW option) is included in this PR.

with the standalone-application the impact of this change is visible(memory consumption, RSS).
it is also possible to measure other aspects of these changes, like the number of calls to storage systems.

this parameter probably has more impact upon using the S3 storage, we should strive for a default size that is optimized with RSS and throughput.
(users should not "play" with this parameter too much, this may cause other issues.)

how can an RGW option be included in an s3-select PR?
I thought that this PR would expose an interface where this value could be set, and the matching RGW PR (ceph/ceph#59465) will add a new option in: src/common/options/rgw.yaml.in and pass it to that API

yes.
i meant the Ref PR ceph/ceph#59465

yuvalif · 2024-08-27T18:00:28Z

what would be the behavior when the max buffer size is exceeded?

galsalomon66 · 2024-08-28T08:11:02Z

what would be the behavior when the max buffer size is exceeded?

the current behavior ... no limitation on buffer size.
that issue was discovered upon processing a large parquet object with a few row-groups(that is a bad parquet object)
it caused OOM for a low-end server, see the attached BZ.

with the current fix(the buffer size limitation) it loads part after part according to the buffer size.
this will impact throughput.
thus, the default setting should be quite high.

the arrow library does not raise an exception.

galsalomon66 · 2024-09-04T04:32:59Z

https://pulpito.ceph.com/gsalomon-2024-09-04_02:26:19-rgw:verify-limit_mem_usage_on_parquet_flow-distro-default-smithi/

yuvalif · 2024-09-09T09:00:37Z

@galsalomon66 is 7ef7e67 related to the change?
if not, can you please squash: 9b9f357, a8cafe8 and 7ef7e67
to one commit ?

galsalomon66 · 2024-09-09T14:39:47Z

@galsalomon66 is 7ef7e67 related to the change? if not, can you please squash: 9b9f357, a8cafe8 and 7ef7e67 to one commit ?

the remove debug related to this PR, i used that for monitoring different settings.

yuvalif · 2024-09-09T18:35:01Z

@galsalomon66 is 7ef7e67 related to the change? if not, can you please squash: 9b9f357, a8cafe8 and 7ef7e67 to one commit ?

the remove debug related to this PR, i used that for monitoring different settings.

ok. what about: ac758c1 ?

galsalomon66 · 2024-09-10T06:39:58Z

@galsalomon66 is 7ef7e67 related to the change? if not, can you please squash: 9b9f357, a8cafe8 and 7ef7e67 to one commit ?

the remove debug related to this PR, i used that for monitoring different settings.

ok. what about: ac758c1 ?

while testing the arrow/parquet change(buffer size limitation)
I found that the standalone application has the wrong setting upon parquet flow.
the wrong setting relates to response size (it could be quite big, depending on the type of query/object).

the standalone application has no impact on RGW.

I refactored the code referring to the buffer response size.

it is useful upon very row-groups(huge number of values per column). refactor of the send-back-result-response. add external configuration per parquet read-buffer. fix for result printing remove debug. Signed-off-by: Gal Salomon <[email protected]>

This was referenced Aug 27, 2024

rgw/s3select : fix for error flow. add an option to disable s3select-request. ceph/ceph#56834

Merged

rgw/s3select: limit memory usage on Parquet flow ceph/ceph#59465

Merged

yuvalif reviewed Aug 27, 2024

View reviewed changes

yuvalif self-requested a review September 5, 2024 07:50

yuvalif approved these changes Sep 5, 2024

View reviewed changes

galsalomon66 force-pushed the limit_memory_usage_per_big_row_groups branch from 7ef7e67 to 4b126a7 Compare September 10, 2024 13:34

galsalomon66 merged commit 0a0f6d4 into master Oct 6, 2024
2 checks passed

ktdreyer deleted the limit_memory_usage_per_big_row_groups branch December 12, 2024 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arrow/parquet : limit the parquet-reader memory usage #163

arrow/parquet : limit the parquet-reader memory usage #163

galsalomon66 commented Aug 27, 2024 •

edited

Loading

yuvalif Aug 27, 2024

galsalomon66 Aug 28, 2024

yuvalif Aug 28, 2024

galsalomon66 Aug 29, 2024 •

edited

Loading

yuvalif Aug 29, 2024

galsalomon66 Aug 29, 2024

yuvalif commented Aug 27, 2024

galsalomon66 commented Aug 28, 2024 •

edited

Loading

galsalomon66 commented Sep 4, 2024

yuvalif commented Sep 9, 2024

galsalomon66 commented Sep 9, 2024

yuvalif commented Sep 9, 2024

galsalomon66 commented Sep 10, 2024

arrow/parquet : limit the parquet-reader memory usage #163

arrow/parquet : limit the parquet-reader memory usage #163

Conversation

galsalomon66 commented Aug 27, 2024 • edited Loading

yuvalif Aug 27, 2024

Choose a reason for hiding this comment

galsalomon66 Aug 28, 2024

Choose a reason for hiding this comment

yuvalif Aug 28, 2024

Choose a reason for hiding this comment

galsalomon66 Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

yuvalif Aug 29, 2024

Choose a reason for hiding this comment

galsalomon66 Aug 29, 2024

Choose a reason for hiding this comment

yuvalif commented Aug 27, 2024

galsalomon66 commented Aug 28, 2024 • edited Loading

galsalomon66 commented Sep 4, 2024

yuvalif commented Sep 9, 2024

galsalomon66 commented Sep 9, 2024

yuvalif commented Sep 9, 2024

galsalomon66 commented Sep 10, 2024

galsalomon66 commented Aug 27, 2024 •

edited

Loading

galsalomon66 Aug 29, 2024 •

edited

Loading

galsalomon66 commented Aug 28, 2024 •

edited

Loading