-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enable GCS, S3, R2 and MinIO as object stores for local runs #1843
Conversation
Verified it works on GCS: $ export GOOGLE_SERVICE_ACCOUNT=/Users/markogrujic/GlareDB/glaredb-dev-playground-99fdd8e83a7c.json
$ cargo run --bin glaredb -- -l gs://gruuya-data-store-test
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package: /Users/markogrujic/GlareDB/glaredb/py-glaredb/Cargo.toml
workspace: /Users/markogrujic/GlareDB/glaredb/Cargo.toml
Finished dev [unoptimized + debuginfo] target(s) in 0.66s
Running `target/debug/glaredb -l 'gs://gruuya-data-store-test'`
GlareDB (v0.5.0)
Using in-memory catalog
Type \help for help.
> create table test as values (1, 'one'), (2, 'two');
Table created
> select * from test where column1 = 2;
┌─────────┬─────────┐
│ column1 │ column2 │
│ ── │ ── │
│ Int64 │ Utf8 │
╞═════════╪═════════╡
│ 2 │ two │
└─────────┴─────────┘ @scsmithr @vrongmeal @tychoish as for the S3 family of object stores there's a matter of deciding what to do with atomic copy-if-not-exists capabilities required for leasing. Cloudflare R2 has a header that enables this action: https://docs.rs/object_store/latest/object_store/aws/enum.S3CopyIfNotExists.html For others (basically Amazon proper and MinIO) I see the options (not necessarily exclusive) as:
|
@@ -27,8 +27,7 @@ pub async fn start_inprocess_local( | |||
|
|||
/// Starts an in-process metastore service, returning a client for the service. | |||
/// | |||
/// Useful for some tests, as well as when running GlareDB locally for testing. | |||
/// This should never be used in production. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! 🫣
The first option would be preferred to avoid having external system dependencies. What I think we can do is change lease initialization logic to:
Note that lease initialization does not automatically acquire the lease, and so an additional call to |
Not requiring |
Is there a third option that's "indirect [some?] names (or append random keys when writing to avoid clashes?)" Is there a fourth option that's do naming/renaming (again of some level?) as a (relatively simple) two phase commit? I think we have to avoid clashing with ourselves more than anything: I don't think we need to solve "how do we avoid breaking things/being broken when someone manually moves files around while we're writing data. Am I misunderstanding things? |
crates/metastore/Cargo.toml
Outdated
@@ -6,6 +6,7 @@ edition = {workspace = true} | |||
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html | |||
|
|||
[dependencies] | |||
anyhow = "1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick, but, we would want to add errors to the enum instead of using anyhow
in this crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see; changing, thanks.
crates/object_store_util/Cargo.toml
Outdated
@@ -6,6 +6,7 @@ edition = { workspace = true } | |||
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html | |||
|
|||
[dependencies] | |||
anyhow = "1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for metastore
.
pub storage_options: Vec<(String, String)>, | ||
} | ||
|
||
fn parse_key_value_pair(key_value_pair: &str) -> Result<(String, String)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you specify options which aren't supported, or options that are specified more than once? will order matter (should we sort by key?)
While I think it's good to have some kind of flexible configs here (given that not all backing storage is going have the same arguments),
it's my assumption that we'll (eventually) have all this specified by a config file or most people will specify the options interactively in via SQL/python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if you specify options which aren't supported
Unsupported options will be ignored but the initialization/setup will error out with a clear message (either from us, or courtesy of the object store crate itself) if some of the required keys are missing.
or options that are specified more than once? will order matter (should we sort by key?)
The last option value will be used in this case.
@@ -35,6 +35,9 @@ pub struct LocalClientOpts { | |||
#[clap(short = 'c', long, value_parser)] | |||
pub cloud_url: Option<Url>, | |||
|
|||
#[clap(flatten)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept wanting there to be soemthing like "flatten with prefix" so we could still get strict parsing that we didn't have to write ourselves, but it seems like this isn't to be. I also sort of wanted ArgGroup
to be able to get us more parsing, but again, no such luck.
/// Storage configuration for the compute engine. | ||
/// | ||
/// The configuration defined here alongside the configuration passed in through | ||
/// the proxy will be used to connect to database storage. | ||
#[derive(Debug, Clone)] | ||
pub enum EngineStorageConfig { | ||
Gcs { service_account_key: String }, | ||
Local { path: PathBuf }, | ||
S3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to split into storage config into authentication components which are provider specific and general options (e.g. bucket name, path prefix, etc?)
also does it make sense to have a "path prefix" so that you could theoretically store multiple engines one one bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to split into storage config into authentication components
Fwiw, I did try this initially; will try to change to that setup now as well.
also does it make sense to have a "path prefix"
Good question; usually in these cases the bucket parameter can hold an arbitrary path as well, so I figured that would work, but after a quick test now I see it doesn't so I'll need to investigate/fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should get that into main before cutting a release, but can merge this PR and then add prefixes in, if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will go ahead with merge, as I have other follow-up work as well (i.e. functional tests for this feature). Opened #1873 to track those two.
crates/sqlexec/src/engine.rs
Outdated
) | ||
}); | ||
|
||
let service_account_key = fs::read_to_string(service_account_path)?; | ||
|
||
let bucket = bucket.or(opts.get("bucket").cloned()); | ||
// Buket potentially provided as a part of the location URL, try to extract it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Buket potentially provided as a part of the location URL, try to extract it. | |
// Bucket potentially provided as a part of the location URL, try to extract it. |
/// Storage configuration for the compute engine. | ||
/// | ||
/// The configuration defined here alongside the configuration passed in through | ||
/// the proxy will be used to connect to database storage. | ||
#[derive(Debug, Clone)] | ||
pub enum EngineStorageConfig { | ||
Gcs { service_account_key: String }, | ||
Local { path: PathBuf }, | ||
S3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should get that into main before cutting a release, but can merge this PR and then add prefixes in, if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just for the record I think this is mergable modulo the changes discussed)
This PR introduces support for using external object stores as data/catalog storage in locally run GlareDB.
Still a couple of things missing:
Closes #1806, and also partially addresses #1818.