Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382

metesynnada · 2024-02-28T10:18:00Z

Which issue does this PR close?

Closes #8994.
Related to #9369 as well.

Rationale for this change

Currently, in our implementation of 'Create External Table' and 'Copy to', the configuration options are not systematically organized, leading to potential confusion and complexity for the users.

What changes are included in this PR?

A document generatable, 'datafusion.x' config-like table options introduced.
FileFormat now supports all configurations coming from SQL syntax. They can create Writers as they create Readers, which removes the writer from the FileSInkConfig.
TableOptions is extendable, which gives us the power of uniting the object_store configurations, like AWS.
Removed StatementOptions.

The issues I want to talk about

write_csv, write_parquet etc. does not have a CsvReadOptions or ParquetReadOptions counterpart. I think we can add a structure to ensure a symmetric design (like CsvWriteOptions), which reduces the friction on user behaviour.
Limiting the usage of free-standing strings.
Separating the options into scan and sink, like format.csv.sink.dateformat or format.csv.scan.has_header

Note: This PR is not compiling, since it has a huge impact on the proto file and its conversions, I will implement them after we settle on a design.

Are these changes tested?

With existing tests.

Are there any user-facing changes?

Yes, there is a breaking change here.
Now., COPY options beside partitioned_by and format must have a prefix.

COPY source_table
TO 'test_files/scratch/copy/table_with_options/'
(format parquet,
'format.parquet.compression' snappy,
'format.parquet.compression::col1' 'zstd(5)',
'format.parquet.compression::col2' snappy,
'format.parquet.max_row_group_size' 12345,
'format.parquet.data_pagesize_limit' 1234,
'format.parquet.write_batch_size' 1234,
'format.parquet.writer_version' 2.0,
'format.parquet.dictionary_page_size_limit' 123,
'format.parquet.created_by' 'DF copy.slt',
'format.parquet.column_index_truncate_length' 123,
'format.parquet.data_page_row_count_limit' 1234,
'format.parquet.bloom_filter_enabled' true,
'format.parquet.bloom_filter_enabled::col1' false,
'format.parquet.bloom_filter_fpp::col2' 0.456,
'format.parquet.bloom_filter_ndv::col2' 456,
'format.parquet.encoding' plain,
'format.parquet.encoding::col1' DELTA_BINARY_PACKED,
'format.parquet.dictionary_enabled::col2' true,
'format.parquet.dictionary_enabled' false,
'format.parquet.statistics_enabled' page,
'format.parquet.statistics_enabled::col2' none,
'format.parquet.max_statistics_size' 123,
'format.parquet.bloom_filter_fpp' 0.001,
'format.parquet.bloom_filter_ndv' 100
)

or CREATE EXTERNAL TABLE

CREATE EXTERNAL TABLE csv_with_quote (
c1 VARCHAR,
c2 VARCHAR
) STORED AS CSV
WITH HEADER ROW
DELIMITER ','
OPTIONS ('format.csv.quote' '~')
LOCATION '../core/tests/data/quote.csv';

Fix deploying DataFusion site error

…ion-upstream into configuration

metesynnada · 2024-02-28T16:08:34Z

@alamb @devinjdangelo if you have time, a look would be perfect. Please ignore proto errors, I just want to start a discussion around this.

alamb · 2024-02-28T18:43:13Z

@alamb @devinjdangelo if you have time, a look would be perfect. Please ignore proto errors, I just want to start a discussion around this.

Thanks @metesynnada -- could you possibly highlight some examples to look at in this this PR where the user API changed that you would like feedback? I am being lazy and asking you in the hopes you will know how to do this without much effort rather than trying to grok the PR myself. If you don't have the time i will find some time to do so

alamb · 2024-02-28T18:44:10Z

Now., COPY options beside partitioned_by and format must have a prefix.

Is there a reason that we need to repeat the format so much? It looks quite repetitive. Maybe could we default / automatically try and add it?

metesynnada · 2024-02-29T08:41:53Z

datafusion/sqllogictest/test_files/csv_files.slt

@@ -33,7 +33,7 @@ c2 VARCHAR
 ) STORED AS CSV
 WITH HEADER ROW
 DELIMITER ','
-OPTIONS ('escape' '\"')


This test was wrong, since the changed code was using as_bytes()[0], but now we directly reject if there is more than one char. \ is not an escape for .slt files.

metesynnada · 2024-02-29T08:43:11Z

datafusion/sqllogictest/test_files/copy.slt

-max_statistics_size 123,
-bloom_filter_fpp 0.001,
-bloom_filter_ndv 100
+'format.parquet.compression' snappy,


This is a solid example of how we change the option handling.

It would be nice if the config options didn't need to be literal quoted e.g. 'format.parquet.compression'. I think it would be ideal if DFParser were aware of the custom syntax we are creating for options including the dots and double colons, so a user could simply write:

COPY table TO 'file.parquet' ( format.parquet.compression snappy, format.parquet.compression::col1 zstd(5) )

Related is #9274

Sounds good. We can enhance this behaviour after this PR.

ozankabak

I carefully reviewed this PR and it LGTM.

After we merge this, let's make a quick follow-on PR to handle the following:

Supporting options/configuration items without quotes (foo.bar.baz instead of 'foo.bar.baz').
Supporting both WITH HEADER ROW and OPTIONS (csv.has_header true), with the former overriding if present.
Making the syntax uniform in COPY TO and elsewhere to use STORED AS PARQUET, and making OPTIONS (parquet.foo.bar baz) only legal when format is specified through the STORED AS quantifier. This way we don't have the weird format parquet within the options clause at all.

Other than these three items we will tackle in the follow-on, I left only a few minor in-line comments to fix/improve in this PR.

Great work, @metesynnada 🚀

datafusion/sqllogictest/test_files/copy.slt

datafusion/sql/src/statement.rs

metesynnada · 2024-03-12T06:39:43Z

I visited @ozankabak review.

alamb

Thank you @metesynnada

I skimmed this PR (didn't study the implementation carefully, but looked at the tests carefully) and I think it is a very nice improvement and consolidation of the configuration. The pattern you introduce I think will extend very well

I think it might make sense to update the documentation in https://arrow.apache.org/datafusion/user-guide/sql/write_options.html#available-options a bit to reflect the new format prefix.

I also marked the PR as "api-change" as the name of configuration settings now have the appropriate format prefix:

COPY source_table TO 'test_files/scratch/copy/table/' (format parquet, 'parquet.compression' 'zstd(10)');

Instead of

COPY source_table TO 'test_files/scratch/copy/table/' (format parquet, 'compression' 'zstd(10)');

alamb · 2024-03-12T09:30:20Z

datafusion-cli/src/exec.rs

-}
-
-async fn create_external_table(
+/// Asynchronously registers an object store and its configuration extensions


alamb · 2024-03-12T09:31:07Z

datafusion-cli/src/exec.rs

@@ -396,12 +455,12 @@ mod tests {

        // Missing region, use object_store defaults
        let sql = format!("CREATE EXTERNAL TABLE test STORED AS PARQUET
-            OPTIONS('access_key_id' '{access_key_id}', 'secret_access_key' '{secret_access_key}') LOCATION '{location}'");
+            OPTIONS('aws.access_key_id' '{access_key_id}', 'aws.secret_access_key' '{secret_access_key}') LOCATION '{location}'");


I think prefixing the configuration with cloud provider specific prefix makes sense

alamb · 2024-03-12T09:32:42Z

datafusion/common/src/config.rs

+
+config_namespace_with_hashmap! {
+    pub struct ColumnOptions {
+        /// Sets if bloom filter is enabled for the column path.


this is really nice to see in structured format here

alamb · 2024-03-12T09:33:42Z

datafusion/common/src/file_options/csv_writer.rs

+            .with_header(value.has_header)
+            .with_delimiter(value.delimiter);
+
+        if let Some(v) = &value.date_format {


This is a very nice example of how the new programatic API improves things

alamb · 2024-03-12T09:37:14Z

datafusion/sqllogictest/test_files/set_variable.slt

@@ -65,7 +65,7 @@ SHOW datafusion.execution.batch_size
 datafusion.execution.batch_size 1

 # set variable unknown variable
-statement error DataFusion error: External error: could not find config namespace for key "aabbcc"
+statement error DataFusion error: Invalid or Unsupported Configuration: could not find config namespace for key "aabbcc"


alamb · 2024-03-12T09:44:41Z

Given the large possibility of conflicts of this PR I am merging it in now. Thanks again everyone. I also have a follow on idea I will file

alamb · 2024-03-12T09:54:09Z

Filed #9575 for backwards compatibility for options

progval · 2024-03-16T14:54:00Z

I noticed two other breaking changes with Parquet bloom filters (and possibly other options too):

using create external table before set datafusion.execution.parquet.bloom_filter_enabled = true, Bloom filters were used in queries after set. Now they are only used if datafusion.execution.parquet.bloom_filter_enabled was set first.
after this PR, set datafusion.execution.parquet.bloom_filter_enabled = true has no effect on "anonymous" tables (ie. SELECT * FROM 'path/to/file.parquet' WHERE key = ...), they don't use seem to ever use Bloom filters.

ozankabak · 2024-03-16T18:33:11Z

Thanks for bringing these to our attention @progval. I will discuss with @metesynnada to get more clarity on how to proceed (which of these changes you observed make sense, which don't and should be fixed)

metesynnada · 2024-03-18T08:25:02Z

I noticed two other breaking changes with Parquet bloom filters (and possibly other options too):

using create external table before set datafusion.execution.parquet.bloom_filter_enabled = true, Bloom filters were used in queries after set. Now they are only used if datafusion.execution.parquet.bloom_filter_enabled was set first.

after this PR, set datafusion.execution.parquet.bloom_filter_enabled = true has no effect on "anonymous" tables (ie. SELECT * FROM 'path/to/file.parquet' WHERE key = ...), they don't use seem to ever use Bloom filters.

Solved the issues on #9604, can you check if this is the expected behaviour?

progval · 2024-03-18T21:18:35Z

@metesynnada It is, thanks!

progval · 2024-03-18T21:22:27Z

Well, issue 2 is. Issue 1 is unchanged.

The options to write_parquet changed. write_json has a new argument that I defaulted to None. We can expose that config later. Ref: apache/datafusion#9382

* deps: upgrade datafusion to 37.1.0 * feat: re-implement SessionContext::tables The method was removed upstream but is used in many tests for `datafusion-python`. Ref: apache/datafusion#9627 * feat: upgrade dataframe write_parquet and write_json The options to write_parquet changed. write_json has a new argument that I defaulted to None. We can expose that config later. Ref: apache/datafusion#9382 * feat: impl new ExecutionPlanProperties for DatasetExec Ref: apache/datafusion#9346 * feat: add upstream variant and method params - `WindowFunction` and `AggregateFunction` have `null_treatment` options. - `ScalarValue` and `DataType` have new variants - `SchemaProvider::table` now returns a `Result` * lint: allow(deprecated) for make_scalar_function * feat: migrate functions.rs `datafusion` completed an Epic that ported many of the `BuiltInFunctions` enum to `SclarUDF`. I created new macros to simplify the port, and used these macros to refactor a few existing functions. Ref: apache/datafusion#9285 * fixme: commented out last failing test This is a bug upstream in datafusion FAILED datafusion/tests/test_functions.py::test_array_functions - pyo3_runtime.PanicException: range end index 9 out of range for slice of length 8 * chore: update Cargo.toml package info

metesynnada and others added 16 commits February 15, 2024 09:52

Initial but not completely work like proto

0f176a1

Delete docs.yaml

10000fb

Merge pull request #5 from synnada-ai/ci-action-fixing

6c76423

Fix deploying DataFusion site error

Merge branch 'apache:main' into main

599516b

Merge branch 'apache:main' into main

b20b65c

Before proto is handled

cac943d

Merge remote-tracking branch 'upstream/main' into configuration

3fba4a0

Update listing_table_factory.rs

c17d655

Before proto 2

13e7e83

Minor adjustments

25d5685

Merge branch 'main' into configuration

14ed29f

Update headers

daca94e

Update copy.slt

0e1300d

Merge branch 'configuration' of https://github.com/synnada-ai/datafus…

c2da778

…ion-upstream into configuration

Add new test,

ea22682

Passes SLT tests

fb86d94

metesynnada marked this pull request as draft February 28, 2024 10:18

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 28, 2024

metesynnada added 3 commits February 28, 2024 13:24

Update csv.rs

b89b138

Before trying char

33eca8d

Fix u8 handling

3caf33e

Merge remote-tracking branch 'upstream/main' into configuration

d16bb65

metesynnada commented Feb 29, 2024

View reviewed changes

alamb mentioned this pull request Mar 11, 2024

DataFusion weekly project plan (Andrew Lamb) - March 11, 2024 #9555

Closed

5 tasks

Review

129e682

ozankabak approved these changes Mar 12, 2024

View reviewed changes

datafusion/sqllogictest/test_files/copy.slt Outdated Show resolved Hide resolved

datafusion/sql/src/statement.rs Show resolved Hide resolved

metesynnada added 2 commits March 12, 2024 09:30

Review visited

9f06c6f

Merge remote-tracking branch 'upstream/main' into configuration

e27dfe8

alamb added the api change Changes the API exposed to users of the crate label Mar 12, 2024

alamb approved these changes Mar 12, 2024

View reviewed changes

alamb merged commit 27b8a49 into apache:main Mar 12, 2024
23 checks passed

alamb mentioned this pull request Mar 12, 2024

Support backwards compatible / un prefixed config options (compression rather than parquet.compression) #9575

Closed

tinfoil-knight mentioned this pull request Mar 13, 2024

support unprefixed config format options #9594

Closed

progval mentioned this pull request Mar 16, 2024

Display number of row groups matched by statistics or bloom filters #9639

Closed

alamb mentioned this pull request Mar 24, 2024

Add Support for Multivalued Copy Options in DFParser #9274

Closed

berkaysynnada mentioned this pull request Apr 5, 2024

Overwritten Format Configs by CreateExternalTable Options #9945

Closed

alamb mentioned this pull request Apr 15, 2024

Support http s3 endpoints in datafusion-cli via CREATE EXTERNAL TABLE #10080

Merged

This was referenced Apr 24, 2024

WIP: changes to upstream DF, in order to enable parallelized writes with ParquetSink influxdata/arrow-datafusion#11

Closed

ParquetSink: restore ability to provide additional user metadata into the encoded parquet file. #10223

Closed

berkaysynnada mentioned this pull request May 7, 2024

Make CREATE EXTERNAL TABLE format options consistent, remove special syntax for HEADER ROW, DELIMITER and COMPRESSION #10404

Merged

tinfoil-knight mentioned this pull request Jun 10, 2024

make format prefix optional for format options in COPY #9723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382

Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382

metesynnada commented Feb 28, 2024 •

edited

Loading

metesynnada commented Feb 28, 2024

alamb commented Feb 28, 2024

alamb commented Feb 28, 2024

metesynnada Feb 29, 2024

metesynnada Feb 29, 2024

devinjdangelo Feb 29, 2024

metesynnada Mar 7, 2024

ozankabak left a comment •

edited

Loading

metesynnada commented Mar 12, 2024

alamb left a comment

alamb Mar 12, 2024

alamb Mar 12, 2024

alamb Mar 12, 2024

alamb Mar 12, 2024

alamb Mar 12, 2024

alamb commented Mar 12, 2024

alamb commented Mar 12, 2024

progval commented Mar 16, 2024

ozankabak commented Mar 16, 2024

metesynnada commented Mar 18, 2024

progval commented Mar 18, 2024

progval commented Mar 18, 2024

Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382

Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382

Conversation

metesynnada commented Feb 28, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

metesynnada commented Feb 28, 2024

alamb commented Feb 28, 2024

alamb commented Feb 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak left a comment • edited Loading

Choose a reason for hiding this comment

metesynnada commented Mar 12, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Mar 12, 2024

alamb commented Mar 12, 2024

progval commented Mar 16, 2024

ozankabak commented Mar 16, 2024

metesynnada commented Mar 18, 2024

progval commented Mar 18, 2024

progval commented Mar 18, 2024

metesynnada commented Feb 28, 2024 •

edited

Loading

ozankabak left a comment •

edited

Loading