Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty array giving error #2439

Closed
kesavkolla opened this issue May 4, 2022 · 8 comments
Closed

Empty array giving error #2439

kesavkolla opened this issue May 4, 2022 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@kesavkolla
Copy link

I have a parquet file with schema of nested array. There are several levels where type is array but data is empty. When use this data with datafusion I am getting this error:

ArrowError(ComputeError("concat requires input of at least one array"))

Even if use projection with only non nested columns then also I am getting the above error. Can you someone share thoughts on how can I get away with this error?

@kesavkolla kesavkolla added the bug Something isn't working label May 4, 2022
@andygrove
Copy link
Member

@kesavkolla Is it possible that you could provide a minimal repro parquet file that we can use for testing?

@kesavkolla
Copy link
Author

kesavkolla commented May 4, 2022

@andygrove

I had to create a zip file from the compressed parquet because github is not letting to updload files otherwise.

Just unzip the file and try to load it via datafusion then we get the above error.

Here is my code snippet

#[tokio::main]
async fn main() -> Result<(), String> {
    let ctx = SessionContext::new();

    ctx.register_parquet("patient", "part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet",
                         ParquetReadOptions::default()).await.unwrap();

    let df = ctx.sql("SELECT patient.id FROM patient LIMIT 10").await.unwrap();
    df.show().await.unwrap();
    Ok(())
}

part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet.zip

@alamb
Copy link
Contributor

alamb commented May 13, 2022

I believe this will be fixed in arrow 15 (in a few weeks) due to @tustvold 's work in apache/arrow-rs#1682

@tustvold
Copy link
Contributor

tustvold commented May 13, 2022

Yup, I can confirm with apache/arrow-rs#1700 it is possible for the parquet reader to handle the file. Unfortunately work remains to support this on DataFusion's side, e.g. #2453. I fully intend to get this working, there just be many bugs along the way 😅

@kesavkolla
Copy link
Author

Looking forward to do early testing

@kesavkolla
Copy link
Author

I believe this will be fixed in arrow 15 (in a few weeks) due to @tustvold 's work in apache/arrow-rs#1682

I wish this is included in arrow 14 itself. :-)

@HaoYang670
Copy link
Contributor

Could we close this issue?

@alamb
Copy link
Contributor

alamb commented Sep 21, 2022

I just tried this file with @kesavkolla 's example using datafusion-cli and it does appear to run without error now:

❯ create external table patient stored as parquet location 'part-00000-f6337bce-7fcd-4021-9f9d-040413ea83f8-c000.snappy.parquet';
0 rows in set. Query took 0.002 seconds.
❯ SELECT patient.id FROM patient LIMIT 10;
+--------------------------------------+
| id                                   |
+--------------------------------------+
| eTplvxRvcd-eT1nEI8BvQRQ3             |
| 1127421b-66fd-85ff-b92e-827c9a280be2 |
| 3bdac299-3731-0acd-9cd9-fbba40236e3a |
| 47e6dfcb-30a9-2a66-c8fb-984da359ead1 |
| 8aec8d06-74af-aed5-0132-5b77fc6b418b |
| 03af4547-1d67-68ae-ebbc-cc5a9fc6c898 |
| 180aefff-e7cc-940e-cb2c-3ed99c1aed39 |
| 966070ad-f22e-631f-acf2-657c90f903f1 |
| d94a1af1-bf9b-a705-bc54-2d0009c39981 |
| 7f217143-f36c-707a-b77c-f4a1cf70f952 |
+--------------------------------------+
10 rows in set. Query took 0.003 seconds.

so closing as complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants