-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ability to query the remote http(s) location directly in datafusion-cli #9150
Conversation
Signed-off-by: Nikolay Ulmasov <[email protected]>
Signed-off-by: Nikolay Ulmasov <[email protected]>
let url: &Url = table_url.as_ref(); | ||
match state.runtime_env().object_store_registry.get_store(url) { | ||
Ok(_) => {} | ||
Err(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please, why the http client gets built on get_store error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw it as an easy way to avoid re-registering the store for the same url. If you query the same url multiple times, it will err on first call so the store will be registered, on subsequent queries it will not. Happy to change if you can suggest something different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is confusing but follows the existing pattern in datafusion-cli, which is basically to defer registering the object store instance until something actually tries to access it (so, for example, we don't get S3 config errors b/c the config isn't setup if the user isn't actually querying S3)
As written this code will support http from select, which is great, but it doesn't support other url types (like s3://...
etc.
Instead of creating an HttpBuilder directly, would it be possible to call
So all the types of object stores are covered?
It would be great to add additional comments explaining the deferred registration logic, but i would be happy to do that as a follow on PR as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to follow up with a PR for other protocols so will try to re-use the existing logic and add more comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice initiative thanks @r3stl355 it saves multiple efforts when dataset is remote.
Im wondering if the source will be reread each time user runs the query?
What happens if http connection breaks in the middle?
@comphead thanks for the review. As far as I can tell, it downloads the file every time but maybe there are some caching settings I don't know about, so yeah - a complete connection failure will likely cause a failure but that's at the object-store level as far as I can tell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @r3stl355 -- I think this PR is well tested and documented and implements the requested functionality.
There are a few improvements (like supporting other object store types and improving comments) that I think would improve the code but are not necessary in my opinion and we could do them as a follow on PR.
I also tested it locally (don't worry about the speed, my internet connection here is pretty terrible):
DataFusion CLI v35.0.0
❯ select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
+----------+
| COUNT(*) |
+----------+
| 1000000 |
+----------+
1 row in set. Query took 7.586 seconds.
Thank you @comphead for your review 🙏
let url: &Url = table_url.as_ref(); | ||
match state.runtime_env().object_store_registry.get_store(url) { | ||
Ok(_) => {} | ||
Err(_) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is confusing but follows the existing pattern in datafusion-cli, which is basically to defer registering the object store instance until something actually tries to access it (so, for example, we don't get S3 config errors b/c the config isn't setup if the user isn't actually querying S3)
As written this code will support http from select, which is great, but it doesn't support other url types (like s3://...
etc.
Instead of creating an HttpBuilder directly, would it be possible to call
So all the types of object stores are covered?
It would be great to add additional comments explaining the deferred registration logic, but i would be happy to do that as a follow on PR as well
I'm glad I kinda followed a right path here. As you saw @alamb, I left a
If it's OK I suggest I (or someone else) addresses additional remote protocols in a follow up PR, I can create a relevant issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree a follow on ticket / PR would be great. Thank you @r3stl355 |
I think at the moment these settings are not provided by the execution context but are provided by certain statement (e.g. |
Follow up ticket #9167 |
Thanks again @r3stl355 |
Which issue does this PR close?
Closes #9133
Rationale for this change
Query the remote http(s) location directly without creating a table
What changes are included in this PR?
Are these changes tested?
Added a dedicated unit test
Are there any user-facing changes?
Added an example to the docs showing how to query the remote parquet file via HTTP