-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defer file creation to write #8539
Conversation
@@ -57,7 +57,7 @@ CREATE EXTERNAL TABLE dictionary_encoded_parquet_partitioned( | |||
b varchar, | |||
) | |||
STORED AS parquet | |||
LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned' | |||
LOCATION 'test_files/scratch/insert_to_external/parquet_types_partitioned/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary because now that the directories no longer exist it falls back to using the trailing / to determining if a directory or file is desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that inserting (appending) to an individual file is no longer a thing ListingTable
needs to concern itself with, I wonder if the concept of a "single file" table could be completely removed. I.e. whether there is a trailing slash or not, the LOCATION
is interpreted as the directory where the data files will be written to / read from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is definitely the eventual goal, is this something you would be interested in working on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usecase of writing to /foo/bar/1.parquet
(a single file) is important, but perhaps that is triggered by the name ending in .parquet
🤔
I think that would still be achievable with the proposal above, I just wanted to point it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, COPY has a SINGLE_FILE_OUTPUT option that gates this, but for CREATE EXTERNAL TABLE
single files don't support insert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ticket for this #8548
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to take a stab at removing the single file table option next week, when I also plan to look at adding support for writing to arrow files.
f50300e
to
c2de9d3
Compare
| SINGLE_FILE | If true, indicates that this external table is backed by a single file. INSERT INTO queries will append to this file. | false | | ||
| INSERT_MODE | Determines if INSERT INTO queries should append to existing files or append new files to an existing directory. Valid values are append_to_file, append_new_files, and error. Note that "error" will block inserting data into this table. | CSV and JSON default to append_to_file. Parquet defaults to append_new_files | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my eventual goal to remove all of these
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------- | | ||
| SINGLE_FILE | If true, indicates that this external table is backed by a single file. INSERT INTO queries will append to this file. | false | | ||
| CREATE_LOCAL_PATH | If true, the folder or file backing this table will be created on the local file system if it does not already exist when running INSERT INTO queries. | false | | ||
| INSERT_MODE | Determines if INSERT INTO queries should append to existing files or append new files to an existing directory. Valid values are append_to_file, append_new_files, and error. Note that "error" will block inserting data into this table. | CSV and JSON default to append_to_file. Parquet defaults to append_new_files | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only valid INSERT_MODE is append_new_files so we might as well remove this from the documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tustvold and @devinjdangelo -- this change makes sense to me
} | ||
} | ||
statement_options.take_bool_option("create_local_path")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please leave an comment here explaining why this is being ignored? Maybe even with a ticket reference to a ticket tracking removing it?
I can file such a ticket if it would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #8547
Which issue does this PR close?
Closes #.
Rationale for this change
This is a follow up to the work to decouple streaming and listing tables. Historically we needed to eagerly create any directories, in order for the append logic to work correctly, as it relied on performing head requests to determine the size of the existing files. This is no longer the case and so we can remove this complexity.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?