-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enable GCS, S3, R2 and MinIO as object stores for local runs #1843
Changes from 3 commits
a5a547e
affa887
eb6f680
e806edb
9a319ba
90d8cb1
9a9a0d2
7749af3
a1125d5
c266dfe
f90b8af
5e01e31
66286fa
7ce3492
578b4f5
453990b
13d1583
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
use anyhow::anyhow; | ||
use anyhow::Result; | ||
use clap::{Parser, ValueEnum}; | ||
use std::fmt::Write as _; | ||
|
@@ -81,3 +82,23 @@ pub struct PgProxyArgs { | |
#[clap(long)] | ||
pub cloud_auth_code: String, | ||
} | ||
|
||
#[derive(Debug, Clone, Parser)] | ||
pub struct StorageOptionsArgs { | ||
/// URL of the object store in which to keep the data in. | ||
#[clap(short, long)] | ||
pub location: String, | ||
|
||
/// Storage options for the object store. | ||
#[clap(short, long, requires = "location", value_parser=parse_key_value_pair)] | ||
pub opts: Vec<(String, String)>, | ||
} | ||
|
||
fn parse_key_value_pair(key_value_pair: &str) -> Result<(String, String)> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what if you specify options which aren't supported, or options that are specified more than once? will order matter (should we sort by key?) While I think it's good to have some kind of flexible configs here (given that not all backing storage is going have the same arguments), it's my assumption that we'll (eventually) have all this specified by a config file or most people will specify the options interactively in via SQL/python. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Unsupported options will be ignored but the initialization/setup will error out with a clear message (either from us, or courtesy of the object store crate itself) if some of the required keys are missing.
The last option value will be used in this case. |
||
key_value_pair | ||
.split_once('=') | ||
.map(|(key, value)| (key.to_string(), value.to_string())) | ||
.ok_or(anyhow!( | ||
"Expected key-value pair delimited by an equals sign, got '{key_value_pair}'" | ||
)) | ||
} | ||
gruuya marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,8 +27,7 @@ pub async fn start_inprocess_local( | |
|
||
/// Starts an in-process metastore service, returning a client for the service. | ||
/// | ||
/// Useful for some tests, as well as when running GlareDB locally for testing. | ||
/// This should never be used in production. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice! 🫣 |
||
/// Useful for tests, as well as when running GlareDB locally. | ||
pub async fn start_inprocess( | ||
store: Arc<dyn ObjectStore>, | ||
) -> Result<MetastoreServiceClient<Channel>> { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,7 +3,10 @@ use crate::context::remote::RemoteSessionContext; | |
use crate::errors::{ExecError, Result}; | ||
use crate::metastore::client::{MetastoreClientSupervisor, DEFAULT_METASTORE_CLIENT_CONFIG}; | ||
use crate::session::Session; | ||
use std::collections::HashMap; | ||
|
||
use object_store::aws::AmazonS3ConfigKey; | ||
use object_store::gcp::GoogleConfigKey; | ||
use std::fs; | ||
use std::ops::{Deref, DerefMut}; | ||
use std::path::Path; | ||
|
@@ -13,6 +16,7 @@ use std::sync::Arc; | |
|
||
use crate::metastore::catalog::SessionCatalog; | ||
use datafusion_ext::vars::SessionVars; | ||
use datasources::common::url::{DatasourceUrl, DatasourceUrlType}; | ||
use datasources::native::access::NativeTableStorage; | ||
use object_store_util::conf::StorageConfig; | ||
use protogen::gen::metastore::service::metastore_service_client::MetastoreServiceClient; | ||
|
@@ -30,30 +34,116 @@ pub struct SessionStorageConfig { | |
pub gcs_bucket: Option<String>, | ||
} | ||
|
||
// TODO: There's a significant amount of overlap with `StorageConfig`, would be good to consider | ||
// consolidating them into one | ||
/// Storage configuration for the compute engine. | ||
/// | ||
/// The configuration defined here alongside the configuration passed in through | ||
/// the proxy will be used to connect to database storage. | ||
#[derive(Debug, Clone)] | ||
pub enum EngineStorageConfig { | ||
gruuya marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Gcs { service_account_key: String }, | ||
Local { path: PathBuf }, | ||
S3 { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does it make sense to split into storage config into authentication components which are provider specific and general options (e.g. bucket name, path prefix, etc?) also does it make sense to have a "path prefix" so that you could theoretically store multiple engines one one bucket? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Fwiw, I did try this initially; will try to change to that setup now as well.
Good question; usually in these cases the bucket parameter can hold an arbitrary path as well, so I figured that would work, but after a quick test now I see it doesn't so I'll need to investigate/fix that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should get that into main before cutting a release, but can merge this PR and then add prefixes in, if you want. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, will go ahead with merge, as I have other follow-up work as well (i.e. functional tests for this feature). Opened #1873 to track those two. |
||
access_key_id: String, | ||
secret_access_key: String, | ||
region: Option<String>, | ||
endpoint: Option<String>, | ||
bucket: Option<String>, | ||
}, | ||
Gcs { | ||
service_account_key: String, | ||
bucket: Option<String>, | ||
}, | ||
Local { | ||
path: PathBuf, | ||
}, | ||
Memory, | ||
} | ||
|
||
impl EngineStorageConfig { | ||
pub fn try_from_options(location: &String, opts: HashMap<String, String>) -> Result<Self> { | ||
if location.starts_with("memory://") { | ||
return Ok(EngineStorageConfig::Memory); | ||
} | ||
|
||
let datasource_url = DatasourceUrl::try_new(location)?; | ||
Ok(match datasource_url { | ||
DatasourceUrl::File(path) => EngineStorageConfig::Local { path }, | ||
DatasourceUrl::Url(ref url) => { | ||
// Buket potentially provided as a part of the location URL, try to extract it. | ||
let bucket = url.host_str().map(|h| h.to_string()); | ||
|
||
match datasource_url.datasource_url_type() { | ||
DatasourceUrlType::Gcs => { | ||
let service_account_path = | ||
opts.get("service_account_path").cloned().unwrap_or_else(|| { | ||
std::env::var(GoogleConfigKey::ServiceAccount.as_ref().to_uppercase()) | ||
.expect( | ||
"'service_account_path' in provided storage options or 'GOOGLE_SERVICE_ACCOUNT' as env var", | ||
) | ||
}); | ||
|
||
let service_account_key = fs::read_to_string(service_account_path)?; | ||
|
||
let bucket = bucket.or(opts.get("bucket").cloned()); | ||
EngineStorageConfig::Gcs { | ||
service_account_key, | ||
bucket, | ||
} | ||
} | ||
DatasourceUrlType::S3 | DatasourceUrlType::Http => { | ||
let access_key_id = opts.get("access_key_id").cloned().unwrap_or_else(|| { | ||
std::env::var(AmazonS3ConfigKey::AccessKeyId.as_ref().to_uppercase()) | ||
.expect("'access_key_id' in provided storage options or 'AWS_ACCESS_KEY_ID' as env var") | ||
}); | ||
let secret_access_key = | ||
opts.get("secret_access_key").cloned().unwrap_or_else(|| { | ||
std::env::var(AmazonS3ConfigKey::SecretAccessKey.as_ref().to_uppercase()) | ||
.expect("'secret_access_key' in provided storage options or 'AWS_SECRET_ACCESS_KEY' as env var") | ||
}); | ||
|
||
let mut endpoint = opts.get("endpoint").cloned(); | ||
let region = opts.get("region").cloned(); | ||
let bucket = bucket.or(opts.get("bucket").cloned()); | ||
if !location.starts_with("s3") && !location.contains("amazonaws.com") { | ||
// For now we don't allow proper HTTP object stores as storage locations, so | ||
// interpret this case as either Cloudflare R2 or a MinIO instance | ||
endpoint = Some(location.clone()); | ||
} | ||
|
||
EngineStorageConfig::S3 { | ||
access_key_id, | ||
secret_access_key, | ||
region, | ||
endpoint, | ||
bucket, | ||
} | ||
} | ||
_ => unreachable!(), | ||
} | ||
} | ||
}) | ||
} | ||
|
||
/// Create a native table storage config from values in the engine and | ||
/// session configs. | ||
/// | ||
/// Errors if the engine config is incompatible with the session config. | ||
fn storage_config(&self, session_conf: &SessionStorageConfig) -> Result<StorageConfig> { | ||
pub fn storage_config(&self, session_conf: &SessionStorageConfig) -> Result<StorageConfig> { | ||
Ok(match (self.clone(), session_conf.gcs_bucket.clone()) { | ||
// GCS bucket storage. | ||
// GCS bucket defined via session config or at the engine config level | ||
( | ||
EngineStorageConfig::Gcs { | ||
service_account_key, | ||
.. | ||
}, | ||
Some(bucket), | ||
) | ||
| ( | ||
EngineStorageConfig::Gcs { | ||
service_account_key, | ||
bucket: Some(bucket), | ||
}, | ||
None, | ||
) => StorageConfig::Gcs { | ||
service_account_key, | ||
bucket, | ||
|
@@ -64,6 +154,22 @@ impl EngineStorageConfig { | |
"Missing bucket on session configuration", | ||
)) | ||
} | ||
( | ||
EngineStorageConfig::S3 { | ||
access_key_id, | ||
secret_access_key, | ||
region, | ||
endpoint, | ||
bucket, | ||
}, | ||
_, | ||
) => StorageConfig::S3 { | ||
access_key_id, | ||
secret_access_key, | ||
region, | ||
endpoint, | ||
bucket, | ||
}, | ||
// Local disk storage. | ||
(EngineStorageConfig::Local { path }, None) => StorageConfig::Local { path }, | ||
// In-memory storage. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept wanting there to be soemthing like "flatten with prefix" so we could still get strict parsing that we didn't have to write ourselves, but it seems like this isn't to be. I also sort of wanted
ArgGroup
to be able to get us more parsing, but again, no such luck.