Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending to a parquet file #47

Open
GrandChaman opened this issue Sep 3, 2021 · 4 comments
Open

Appending to a parquet file #47

GrandChaman opened this issue Sep 3, 2021 · 4 comments
Labels
feature A new feature

Comments

@GrandChaman
Copy link

Hi !

I was wondering if it would be possible to add support for appending data at the end of a parquet file ?

It would probably mean to truncate the footer, write the next RowGroup and rewrite the footer again.

@jorgecarleitao jorgecarleitao added the feature A new feature label Sep 3, 2021
@jorgecarleitao
Copy link
Owner

Interesting proposal. I do not know whether other parquet implementations even support this. Valid proposal regardless.

With std::io::Read + std::io::Write + std::io::Seek it should be possible and not a very complex task:

  1. read footer and store it
  2. check that schema of new row group matches existing schema
  3. seek to end of last row group (known from footer)
  4. write new row group and update footer
  5. write footer

It does require some internal changes, though.

@curtisalexander
Copy link

Hello! I have a project where I am reading a large file into a Chunk<Box<dyn Array>> (from arrow2) and then writing back out to a parquet file. To keep memory usage low, I want to read the large file in chunks of rows and append to the parquet file. I am able to append to the file and write the footer without any errors. However, when I try to read the parquet file (using pyarrow) I get errors regarding the page size not matching. I suspect this is because when writing the footer that the details of the footer are based on the latest FileWriter and not based on the details of the entire file (which had been previously written to).

Is my understanding of how the footer is being written is correct?

Would the feature described here address the issue I am having?

Thank you!

@jorgecarleitao
Copy link
Owner

@curtisalexander , do you have a minimal example of what you are doing?

@curtisalexander
Copy link

In developing a minimal example, I figured out that I had a bug in my project! The exercise of producing that example was most helpful. So please ignore my original question / comment.

My apologies for reaching out prematurely without better testing. The bug I observed had to do with overwriting parquet files and the order in which they were overwritten. If the file was of a larger size (say I wrote 10 rows to the file) and then I tried to overwrite it with a smaller size (only 5 rows) then I would observe reading errors. It ultimately came down to the fact that I was not using truncate on the already existing file and I was only overwriting the beginning of the file. I now have an integration test for this scenario in my project.

Again, sorry for the distraction. Thanks for your willingness to help with what I reported. And thanks for this, and the arrow2 , crates!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

No branches or pull requests

3 participants