Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ability to query the remote http(s) location directly in datafusion-cli #9150

Merged
merged 2 commits into from
Feb 9, 2024

Conversation

r3stl355
Copy link
Contributor

@r3stl355 r3stl355 commented Feb 7, 2024

Which issue does this PR close?

Closes #9133

Rationale for this change

Query the remote http(s) location directly without creating a table

What changes are included in this PR?

Are these changes tested?

Added a dedicated unit test

Are there any user-facing changes?

Added an example to the docs showing how to query the remote parquet file via HTTP

Signed-off-by: Nikolay Ulmasov <[email protected]>
let url: &Url = table_url.as_ref();
match state.runtime_env().object_store_registry.get_store(url) {
Ok(_) => {}
Err(_) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, why the http client gets built on get_store error ?

Copy link
Contributor Author

@r3stl355 r3stl355 Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw it as an easy way to avoid re-registering the store for the same url. If you query the same url multiple times, it will err on first call so the store will be registered, on subsequent queries it will not. Happy to change if you can suggest something different

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is confusing but follows the existing pattern in datafusion-cli, which is basically to defer registering the object store instance until something actually tries to access it (so, for example, we don't get S3 config errors b/c the config isn't setup if the user isn't actually querying S3)

As written this code will support http from select, which is great, but it doesn't support other url types (like s3://... etc.

Instead of creating an HttpBuilder directly, would it be possible to call

https://github.com/apache/arrow-datafusion/blob/10ae9343368a893012aa80b66c02d45b4f461f9f/datafusion-cli/src/exec.rs#L305-L339

So all the types of object stores are covered?

It would be great to add additional comments explaining the deferred registration logic, but i would be happy to do that as a follow on PR as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to follow up with a PR for other protocols so will try to re-use the existing logic and add more comments

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice initiative thanks @r3stl355 it saves multiple efforts when dataset is remote.

Im wondering if the source will be reread each time user runs the query?
What happens if http connection breaks in the middle?

@r3stl355
Copy link
Contributor Author

r3stl355 commented Feb 7, 2024

Im wondering if the source will be reread each time user runs the query?
What happens if http connection breaks in the middle?

@comphead thanks for the review. As far as I can tell, it downloads the file every time but maybe there are some caching settings I don't know about, so yeah - a complete connection failure will likely cause a failure but that's at the object-store level as far as I can tell

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @r3stl355 -- I think this PR is well tested and documented and implements the requested functionality.

There are a few improvements (like supporting other object store types and improving comments) that I think would improve the code but are not necessary in my opinion and we could do them as a follow on PR.

I also tested it locally (don't worry about the speed, my internet connection here is pretty terrible):

DataFusion CLI v35.0.0
❯ select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
+----------+
| COUNT(*) |
+----------+
| 1000000  |
+----------+
1 row in set. Query took 7.586 seconds.

Thank you @comphead for your review 🙏

let url: &Url = table_url.as_ref();
match state.runtime_env().object_store_registry.get_store(url) {
Ok(_) => {}
Err(_) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is confusing but follows the existing pattern in datafusion-cli, which is basically to defer registering the object store instance until something actually tries to access it (so, for example, we don't get S3 config errors b/c the config isn't setup if the user isn't actually querying S3)

As written this code will support http from select, which is great, but it doesn't support other url types (like s3://... etc.

Instead of creating an HttpBuilder directly, would it be possible to call

https://github.com/apache/arrow-datafusion/blob/10ae9343368a893012aa80b66c02d45b4f461f9f/datafusion-cli/src/exec.rs#L305-L339

So all the types of object stores are covered?

It would be great to add additional comments explaining the deferred registration logic, but i would be happy to do that as a follow on PR as well

datafusion-cli/src/catalog.rs Show resolved Hide resolved
docs/source/user-guide/cli.md Show resolved Hide resolved
@r3stl355
Copy link
Contributor Author

r3stl355 commented Feb 8, 2024

I'm glad I kinda followed a right path here. As you saw @alamb, I left a TODO comment there to implement this for other remote resources (e.g. s3:/) for couple of reasons:

  1. the issue talks about http(s) specifically so I didn't want to expand the scope of the RP, (more chances for it to get merged too 😁 ) and cover the other protocols in a follow up PR
  2. I didn't want to modify https://github.com/apache/arrow-datafusion/blob/10ae9343368a893012aa80b66c02d45b4f461f9f/datafusion-cli/src/exec.rs#L305-L339 as I'd have to lift it to pub(crate) level, and more importantly, I didn't have access to command options (those are needed for S3 etc auth) in the place I modified the code, so I need to figure out how to get those if they are provided by the executions context (I could just use an empty HashMap so the config values are picked from the environment but that would be limilted)

If it's OK I suggest I (or someone else) addresses additional remote protocols in a follow up PR, I can create a relevant issue.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @r3stl355 and @alamb for second review, all remaining things we can cover in following PRs

@alamb
Copy link
Contributor

alamb commented Feb 8, 2024

I agree a follow on ticket / PR would be great. Thank you @r3stl355

@alamb
Copy link
Contributor

alamb commented Feb 8, 2024

so I need to figure out how to get those if they are provided by the executions context (I could just use an empty HashMap so the config values are picked from the environment but that would be limited)

I think at the moment these settings are not provided by the execution context but are provided by certain statement (e.g. COPY ... (options) so passing an empty HashMap might be great

@r3stl355
Copy link
Contributor Author

r3stl355 commented Feb 8, 2024

Follow up ticket #9167

@alamb alamb merged commit 071dc99 into apache:main Feb 9, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 9, 2024

Thanks again @r3stl355

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support selecting directly from http:// / https:// urls in datafusion-cli
3 participants