Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer file creation to write #8539

Merged
merged 5 commits into from
Dec 14, 2023
Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #.

Rationale for this change

This is a follow up to the work to decouple streaming and listing tables. Historically we needed to eagerly create any directories, in order for the append logic to work correctly, as it relied on performing head requests to determine the size of the existing files. This is no longer the case and so we can remove this complexity.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 14, 2023
@@ -57,7 +57,7 @@ CREATE EXTERNAL TABLE dictionary_encoded_parquet_partitioned(
b varchar,
)
STORED AS parquet
LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned'
LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary because now that the directories no longer exist it falls back to using the trailing / to determining if a directory or file is desired.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that inserting (appending) to an individual file is no longer a thing ListingTable needs to concern itself with, I wonder if the concept of a "single file" table could be completely removed. I.e. whether there is a trailing slash or not, the LOCATION is interpreted as the directory where the data files will be written to / read from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is definitely the eventual goal, is this something you would be interested in working on?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usecase of writing to /foo/bar/1.parquet (a single file) is important, but perhaps that is triggered by the name ending in .parquet 🤔

I think that would still be achievable with the proposal above, I just wanted to point it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, COPY has a SINGLE_FILE_OUTPUT option that gates this, but for CREATE EXTERNAL TABLE single files don't support insert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ticket for this #8548

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to take a stab at removing the single file table option next week, when I also plan to look at adding support for writing to arrow files.

Comment on lines 83 to 84
| SINGLE_FILE | If true, indicates that this external table is backed by a single file. INSERT INTO queries will append to this file. | false |
| INSERT_MODE | Determines if INSERT INTO queries should append to existing files or append new files to an existing directory. Valid values are append_to_file, append_new_files, and error. Note that "error" will block inserting data into this table. | CSV and JSON default to append_to_file. Parquet defaults to append_new_files |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my eventual goal to remove all of these

| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------- |
| SINGLE_FILE | If true, indicates that this external table is backed by a single file. INSERT INTO queries will append to this file. | false |
| CREATE_LOCAL_PATH | If true, the folder or file backing this table will be created on the local file system if it does not already exist when running INSERT INTO queries. | false |
| INSERT_MODE | Determines if INSERT INTO queries should append to existing files or append new files to an existing directory. Valid values are append_to_file, append_new_files, and error. Note that "error" will block inserting data into this table. | CSV and JSON default to append_to_file. Parquet defaults to append_new_files |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only valid INSERT_MODE is append_new_files so we might as well remove this from the documentation

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold and @devinjdangelo -- this change makes sense to me

}
}
statement_options.take_bool_option("create_local_path")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please leave an comment here explaining why this is being ignored? Maybe even with a ticket reference to a ticket tracking removing it?

I can file such a ticket if it would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #8547

@tustvold tustvold merged commit a971f1e into apache:main Dec 14, 2023
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants