Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot query parquet files generated by Apache Spark from datafusion-cli #1648

Closed
andygrove opened this issue Jan 23, 2022 · 3 comments · Fixed by #1665
Closed

Cannot query parquet files generated by Apache Spark from datafusion-cli #1648

andygrove opened this issue Jan 23, 2022 · 3 comments · Fixed by #1665
Assignees
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@andygrove
Copy link
Member

Describe the bug

I have a data set created by Apache Spark and I tried to query it from the DataFusion CLI. It failed, saying that a parquet file was corrupt.

 CREATE EXTERNAL TABLE store_sales STORED AS PARQUET LOCATION 'store_sales.dat';
0 rows in set. Query took 0.002 seconds.
❯ select count(*) from store_sales;
Parquet reader thread terminated due to error: ParquetError(General("Invalid Parquet file. Corrupt footer"))

I added some debug logging and found that it was actually trying to read the following file, which is not a Parquet file.

store_sales.dat/.part-00005-5142b177-bacb-499d-b14f-12de4b94d9d9-c000.snappy.parquet.crc

To Reproduce
Create a non-Parquet file with a non-Parquet extension and put it in a directory along with some valid parquet files.

Expected behavior
Should only try and read files with file extension .parquet.

Additional context
None

@andygrove andygrove added the bug Something isn't working label Jan 23, 2022
@houqp
Copy link
Member

houqp commented Jan 23, 2022

This is because we are not providing file extension as search suffix in https://github.com/apache/arrow-datafusion/blob/9c5ccae240ce38b084128e8d7ff0752d0e2318a6/datafusion/src/execution/context.rs#L232

I think the right behavior should be providing a default extension suffix and let user override if they are using something different.

@houqp houqp added good first issue Good for newcomers help wanted Extra attention is needed labels Jan 23, 2022
@Ted-Jiang
Copy link
Member

@houqp plz assign this to me 😊.

@andygrove
Copy link
Member Author

It would also be nice if the error message could include the name of the file that is corrupt to make these issues easier to debug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants