-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory usage in plan_to_object_store
#71
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The Parquet writer keeps a whole row group buffered in memory before writing it out to the output stream, which is ~1M rows by default. Limit the group size to 65536 rows to mitigate this.
This is less secure (paths can go away or get recycled), but we need this in order to be able to move the temporary partition file into the local object store (if we're using a local FS store).
If the backing object store is local, we support a "fast upload" which is just moving the file to the new filesystem (we could be writing to the actual object store FS directly, but then we wouldn't have the temp file deletion niceties).
If we're dealing with the local FS, move the temporary file there directly instead of reading it (save memory). We're still going to be reading the partition file in order to get its stats/hash, but this is step 1.
Make a dummy local FS store pointing to the directory with the temporary Parquet file, so that we don't have to load the whole partition in memory to get its stats.
Use a streaming hasher + make it work with Tokio so that it doesn't block the rest of the app. If we're not using a local FS object store, this will still result in a read, but in other cases we get to stream the partition around and consume the minimum amount of RAM.
(try to rename, if fails, copy and delete original)
On Fly.io's free tier (with the 256MB limit), it causes OOM errors when doing `CREATE EXTERNAL TABLE` + `CREATE TABLE AS`, so disabling it for now. This reverts commit 4b89565.
mildbyte
force-pushed
the
bugfix/memory-usage
branch
from
August 30, 2022 09:12
207398b
to
67c1d1b
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Get the
CREATE EXTERNAL TABLE
+CREATE TABLE AS
to fit under the 256MB free tier Fly.io memory limit (used in the tutorial):CREATE TABLE AS
(didn't investigate memory usage in depth with mimalloc, since it doesn't seem to be profileable by standard tools)Before (measured with bytehound) -- the 1G peaks are us buffering each partition before writing it out as Parquet, the final 200M plateau at the top is each partition getting loaded for hashing/indexing at the end
After: heap usage consistently below 80M (as we buffer each row group), no plateau at the end: