-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scan does not work as expected #495
Comments
I performed another test using the Tabular catalog, attempting to scan the sandbox warehouse in the examples namespace, specifically targeting the nyc_taxi_yellow table, but it returned no results. |
I found the problem. I don’t know how to solve it, but I will try. The While testing with Tabular, I'm receiving a 403 error from S3. So, we have two issues to solve. One is to expose the reading errors to the user, and the other is to understand why we are getting these access denied errors. |
For the Tabular example, I encountered an 'access denied' problem. The FileIO does not work with remote signing. For the MinIO example, the problem was solved when I added a match statement to return the error while tasks.next(). |
To scan with remote-signing we need to implement this |
Hi, does |
I'm guessing #498 should close this issue. Would you like to verify it? |
Yes and no. I'm not sure if this is the flow, because I haven't found any documentation; this is based on my understanding from reading the Python implementation. It's a presign process, but it's not the client's responsibility to presign. The get config will return the s3.signer.uri, and the load table will return s3.remote-signing-enabled as true along with some other S3 configurations. With that, we need to "presign" using the token returned in the load table. This is the specification for the server responsible for the signing: s3-signer-open-api.yaml
I'm not comfortable closing this issue without a regression test that guarantees the expected behavior. |
+1 on this. Currently we don't have regression tests on the whole reading progress, which involves integrating with external systems such as spark. |
Got it. So, we need to support
I think we can start with very basic tests like just scan the whole table. |
The reason I didn't start this yet is that I want to do it after integration with datafusion. Me and @ZENOTME did integration tests in icelake before, and I have to say that without sql engine support, it's painful to maintain those tests. |
I agree that we need a SQL engine to make testing easier. However, maintaining basic unit tests based on // catalog / file io setup, balbalba
let table = balabala();
let scan = table.scan().select_all().build().unwrap();
let batch_stream = scan.to_arrow().await.unwrap();
dbg!(scan);
let batches: Vec<_> = batch_stream.try_collect().await.unwrap(); |
Issue #504 created |
Correctly writing data into iceberg is not supported yet, so we need external systems such as spark to ingest data. Putting pre generated parquet files maybe an approach, but that requires maintaining binaries in repo. |
I've got some code in the perf testing branch that might help. It downloads NYC taxi data, and uses minio, the rest catalog and a spark container to create a table and insert NYC taxi data into it. |
I have fixed the issue where errors were not returned to the user, in #535 |
I believe this should have been fixed. Please feel free to open new issues if still exists. |
I'm testing using the iceberg rest image from Tabular as a catalog.
Here's the docker-compose.yml file:
I created some data with PyIceberg:
And queried with PyIceberg to verify if it's okay:
It returns 4.
And then with the Rust implementation:
Its returning nothing.
We have to define the S3 configurations because the Tabular image does not return the S3 credentials during the get config process.
The text was updated successfully, but these errors were encountered: