Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: configurable IO runtime #2789

Merged
merged 3 commits into from
Aug 19, 2024
Merged

Conversation

ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Aug 17, 2024

Description

A configured Tokio handle can be passed into the TableBuilder nested in the RuntTime enum, conveniently added a default which simply creates a new runtime and uses that handle.

Related issues:

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ion-elgreco, looking great.

The only comment is around how to enable this... As we are looking into some performance optimizations, the delta_rs_storage_hadler, may actually also be the place where we may want to introduce some simple caching for the commit files (specifically the json commits) in order to keep the log replay logic as simple as possible.

The above is just another argument, why i think we should consider passing the config via the DeltaTableOptions...

this will also allow us to pass in configuration for the runtime itself. For now it's ok, but if we GA this feature, we probably want to expose some sort of way to pass configuration to the runtime.

crates/core/src/storage/mod.rs Outdated Show resolved Hide resolved
@ion-elgreco
Copy link
Collaborator Author

@roeap do you mean the DeltaTableLoadOptions which is used in the table builder?

@roeap
Copy link
Collaborator

roeap commented Aug 18, 2024

i thought maybe this one:

/// Configuration options for delta table
#[derive(Debug, Serialize, Deserialize, Clone, PartialEq)]
#[serde(rename_all = "camelCase")]
pub struct DeltaTableConfig {
/// Indicates whether our use case requires tracking tombstones.
/// This defaults to `true`
///
/// Read-only applications never require tombstones. Tombstones
/// are only required when writing checkpoints, so even many writers
/// may want to skip them.
pub require_tombstones: bool,
/// Indicates whether DeltaTable should track files.
/// This defaults to `true`
///
/// Some append-only applications might have no need of tracking any files.
/// Hence, DeltaTable will be loaded with significant memory reduction.
pub require_files: bool,
/// Controls how many files to buffer from the commit log when updating the table.
/// This defaults to 4 * number of cpus
///
/// Setting a value greater than 1 results in concurrent calls to the storage api.
/// This can decrease latency if there are many files in the log since the
/// last checkpoint, but will also increase memory usage. Possible rate limits of the storage backend should
/// also be considered for optimal performance.
pub log_buffer_size: usize,
/// Control the number of records to read / process from the commit / checkpoint files
/// when processing record batches.
pub log_batch_size: usize,
}

@ion-elgreco
Copy link
Collaborator Author

@roeap Hmm does it need to be accessible later though?

Below could work also I believe, we can still save it in the DeltaTableConfig but wondering where else it will be used

#[derive(Debug)]
pub struct DeltaTableLoadOptions {
  ....
    pub log_batch_size: usize,
    pub io_runtime_config: Option<IORuntimeConfig>, // If passed enable experimental IO runtime with custom config
}   

Then DeltaTableBuilder has with_io_runtime_config, during build we can pass the IORuntimeConfig into build_storage which can wrap the objectstore into the delta_rs_handler with the runtime specification

@roeap
Copy link
Collaborator

roeap commented Aug 18, 2024

Yeah, this triggers some memories - at some point i felt there is a bunch of config distributed in various ways throughout our codebase, and we need to consolidate into a single way of handling config. The duplication of log_buffer_size we already see here is a good indication if this... I guess my main question would be - if you were to keep only on of these, which one would it be 😆, and then pick that one.

Naming wise I do find that "..LoadOptions" is the worse name, as it implies that this only relates to the initial table load?

@roeap
Copy link
Collaborator

roeap commented Aug 18, 2024

@ion-elgreco - after a quick glance at the relevant code, it feels maybe the some options should be flattened into the DeltaTableBuilder, and the config be added as a field on the builder instead of the load options?

@ion-elgreco
Copy link
Collaborator Author

@ion-elgreco - after a quick glance at the relevant code, it feels maybe the some options should be flattened into the DeltaTableBuilder, and the config be added as a field on the builder instead of the load options?

Yeah that makes sense! Now it's double 😆

I'll do another round on this tomorrow then to refactor the table builder a bit and add the IO runtime config on the table config.

Will have to take a look at what is normally configurable on a Tokio runtime :)

@ion-elgreco ion-elgreco force-pushed the feat/io_runtime branch 3 times, most recently from c45e0fc to 5a949e9 Compare August 19, 2024 10:10
@ion-elgreco ion-elgreco changed the title feat: experimental objectstore wrapper with IO runtime feat: configurable IO runtime Aug 19, 2024
@ion-elgreco ion-elgreco force-pushed the feat/io_runtime branch 2 times, most recently from 4f39227 to aecfb60 Compare August 19, 2024 16:36
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this turned out quote nice, great work! 👍

@ion-elgreco ion-elgreco added this pull request to the merge queue Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants