-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382
Conversation
Fix deploying DataFusion site error
@alamb @devinjdangelo if you have time, a look would be perfect. Please ignore proto errors, I just want to start a discussion around this. |
Thanks @metesynnada -- could you possibly highlight some examples to look at in this this PR where the user API changed that you would like feedback? I am being lazy and asking you in the hopes you will know how to do this without much effort rather than trying to grok the PR myself. If you don't have the time i will find some time to do so |
Is there a reason that we need to repeat the |
@@ -33,7 +33,7 @@ c2 VARCHAR | |||
) STORED AS CSV | |||
WITH HEADER ROW | |||
DELIMITER ',' | |||
OPTIONS ('escape' '\"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was wrong, since the changed code was using as_bytes()[0]
, but now we directly reject if there is more than one char. \
is not an escape for .slt
files.
max_statistics_size 123, | ||
bloom_filter_fpp 0.001, | ||
bloom_filter_ndv 100 | ||
'format.parquet.compression' snappy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a solid example of how we change the option handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if the config options didn't need to be literal quoted e.g. 'format.parquet.compression'. I think it would be ideal if DFParser were aware of the custom syntax we are creating for options including the dots and double colons, so a user could simply write:
COPY table
TO 'file.parquet'
(
format.parquet.compression snappy,
format.parquet.compression::col1 zstd(5)
)
Related is #9274
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. We can enhance this behaviour after this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I carefully reviewed this PR and it LGTM.
After we merge this, let's make a quick follow-on PR to handle the following:
- Supporting options/configuration items without quotes (
foo.bar.baz
instead of'foo.bar.baz'
). - Supporting both
WITH HEADER ROW
andOPTIONS (csv.has_header true)
, with the former overriding if present. - Making the syntax uniform in
COPY TO
and elsewhere to useSTORED AS PARQUET
, and makingOPTIONS (parquet.foo.bar baz)
only legal when format is specified through theSTORED AS
quantifier. This way we don't have the weirdformat parquet
within the options clause at all.
Other than these three items we will tackle in the follow-on, I left only a few minor in-line comments to fix/improve in this PR.
Great work, @metesynnada 🚀
I visited @ozankabak review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @metesynnada
I skimmed this PR (didn't study the implementation carefully, but looked at the tests carefully) and I think it is a very nice improvement and consolidation of the configuration. The pattern you introduce I think will extend very well
I think it might make sense to update the documentation in https://arrow.apache.org/datafusion/user-guide/sql/write_options.html#available-options a bit to reflect the new format prefix.
I also marked the PR as "api-change" as the name of configuration settings now have the appropriate format prefix:
COPY source_table TO 'test_files/scratch/copy/table/' (format parquet, 'parquet.compression' 'zstd(10)');
Instead of
COPY source_table TO 'test_files/scratch/copy/table/' (format parquet, 'compression' 'zstd(10)');
} | ||
|
||
async fn create_external_table( | ||
/// Asynchronously registers an object store and its configuration extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
@@ -396,12 +455,12 @@ mod tests { | |||
|
|||
// Missing region, use object_store defaults | |||
let sql = format!("CREATE EXTERNAL TABLE test STORED AS PARQUET | |||
OPTIONS('access_key_id' '{access_key_id}', 'secret_access_key' '{secret_access_key}') LOCATION '{location}'"); | |||
OPTIONS('aws.access_key_id' '{access_key_id}', 'aws.secret_access_key' '{secret_access_key}') LOCATION '{location}'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think prefixing the configuration with cloud provider specific prefix makes sense
|
||
config_namespace_with_hashmap! { | ||
pub struct ColumnOptions { | ||
/// Sets if bloom filter is enabled for the column path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really nice to see in structured format here
.with_header(value.has_header) | ||
.with_delimiter(value.delimiter); | ||
|
||
if let Some(v) = &value.date_format { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very nice example of how the new programatic API improves things
@@ -65,7 +65,7 @@ SHOW datafusion.execution.batch_size | |||
datafusion.execution.batch_size 1 | |||
|
|||
# set variable unknown variable | |||
statement error DataFusion error: External error: could not find config namespace for key "aabbcc" | |||
statement error DataFusion error: Invalid or Unsupported Configuration: could not find config namespace for key "aabbcc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Given the large possibility of conflicts of this PR I am merging it in now. Thanks again everyone. I also have a follow on idea I will file |
Filed #9575 for backwards compatibility for options |
I noticed two other breaking changes with Parquet bloom filters (and possibly other options too):
|
Thanks for bringing these to our attention @progval. I will discuss with @metesynnada to get more clarity on how to proceed (which of these changes you observed make sense, which don't and should be fixed) |
Solved the issues on #9604, can you check if this is the expected behaviour? |
@metesynnada It is, thanks! |
Well, issue 2 is. Issue 1 is unchanged. |
The options to write_parquet changed. write_json has a new argument that I defaulted to None. We can expose that config later. Ref: apache/datafusion#9382
The options to write_parquet changed. write_json has a new argument that I defaulted to None. We can expose that config later. Ref: apache/datafusion#9382
The options to write_parquet changed. write_json has a new argument that I defaulted to None. We can expose that config later. Ref: apache/datafusion#9382
* deps: upgrade datafusion to 37.1.0 * feat: re-implement SessionContext::tables The method was removed upstream but is used in many tests for `datafusion-python`. Ref: apache/datafusion#9627 * feat: upgrade dataframe write_parquet and write_json The options to write_parquet changed. write_json has a new argument that I defaulted to None. We can expose that config later. Ref: apache/datafusion#9382 * feat: impl new ExecutionPlanProperties for DatasetExec Ref: apache/datafusion#9346 * feat: add upstream variant and method params - `WindowFunction` and `AggregateFunction` have `null_treatment` options. - `ScalarValue` and `DataType` have new variants - `SchemaProvider::table` now returns a `Result` * lint: allow(deprecated) for make_scalar_function * feat: migrate functions.rs `datafusion` completed an Epic that ported many of the `BuiltInFunctions` enum to `SclarUDF`. I created new macros to simplify the port, and used these macros to refactor a few existing functions. Ref: apache/datafusion#9285 * fixme: commented out last failing test This is a bug upstream in datafusion FAILED datafusion/tests/test_functions.py::test_array_functions - pyo3_runtime.PanicException: range end index 9 out of range for slice of length 8 * chore: update Cargo.toml package info
Which issue does this PR close?
Closes #8994.
Related to #9369 as well.
Rationale for this change
Currently, in our implementation of 'Create External Table' and 'Copy to', the configuration options are not systematically organized, leading to potential confusion and complexity for the users.
What changes are included in this PR?
Writers
as they createReaders
, which removes the writer from theFileSInkConfig
.TableOptions
is extendable, which gives us the power of uniting theobject_store
configurations, like AWS.StatementOptions
.The issues I want to talk about
write_csv
,write_parquet
etc. does not have aCsvReadOptions
orParquetReadOptions
counterpart. I think we can add a structure to ensure a symmetric design (likeCsvWriteOptions
), which reduces the friction on user behaviour.format.csv.sink.dateformat
orformat.csv.scan.has_header
Note: This PR is not compiling, since it has a huge impact on the proto file and its conversions, I will implement them after we settle on a design.
Are these changes tested?
With existing tests.
Are there any user-facing changes?
Yes, there is a breaking change here.
Now., COPY options beside
partitioned_by
andformat
must have a prefix.or CREATE EXTERNAL TABLE