-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ReaderAt #31
Add ReaderAt #31
Conversation
Thanks for the update. It may take me until the weekend to review your code. |
Hi Frederick, Before I can review the code I need to understand what you want to achieve. Apparently you want to have an xz decompressor that supports the ReaderAt interface. It would be helpful to understand what the actual use case is you want to support here. Questions are how large will be the file segments you want to read with ReadAt. Do you want to read the same file from multiple go routines? The answer to those questions is important because they will guide the design of the solution. The problem with Lempel-Ziv files that they are sequential, because Lempel-Ziv decompression requires that you need to have the complete decompression window available to read the whole file. The only way around this is to split the compressed file stream in independent blocks, so we would need to change the Writer in such a way that is produces those independent blocks. We would need also to decide, whether the blocking should happen on LZMA or xz level. I'm also thinking it would be easier to make the normal Reader an io.Seeker. Using Seek and Read the ReadAt function would be easy to implement. If we have agreed the solution design we could agree than on the changes that need to be made to the code. Kind regards, Ulrich |
Hi Ulrich, thanks for taking the time to review. Responses inline
My application needs access to a very large lookup table. Fortunately, the table compresses very well. I would like to be able to access the contents of this table without having to ever store the entire thing either on disk or in memory.
Since I will be creating the original compressed archive, I will be able to control this. It's a tradeoff where larger segments will get better overall compression ratio, but cause slower lookups. Right now I am using 500KiB blocks which seems to give a still-reasonable compression ratio on my data. I haven't really dialed this in yet though.
Right now I am not reading the same file from different goroutines, however it would be nice to have this option in the future if it doesn't add too much complexity to the design here.
For my use case it is not important that the writer be able to create these files. I create these read-only lookup tables separately and provide them to my application. I have been creating them like this:
I know I suggested adding
Hope this helps you understand my use case and that you think it's an okay fit for your library. I think it could be a useful feature for other users as well edit: One more thought - even if the random access feature itself is not widely useful, I think it will create the opportunity for increasing decompression speed on files with multiple blocks. It should be easy to tweak this PR so that each block is decompressed in its own goroutine. Fred |
Now I have a lot of questions. The file stores comma-separated values? Do you have an example of the content? How do you know the offset to read? How large is the file? How many records? It is true that parallel decompression requires independent blocks. That is something I could look into. |
It is just a huge CSV, ordered by one of the columns. Here's a sample:
I am basically doing binary search on offset.
It's only ~3GB uncompressed right now, but could grow in the future.
|
} | ||
blockAmtToRead := blockEndPos - blockStartPos | ||
|
||
r.readBlockAt( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do this in parallel this readBlockAt
could be changed to errgroup.Go(readBlockAt)
Thanks for the answers. I understand now better. For a local program you might be better off to seek the file directly, but if you want to run it in a web service or distribute the program using the compressed file would be of course better. |
Hi Ulrich, Just wanted to check in and see how you feeling about this change? No rush if you're still thinking about it. |
Frederick, the software will have ReaderAt and I would like to implement it together with parallel processing. I'm currently working on better LZ sequencing including a slow but optimal sequence generators, so there are things happening, which are not visible right now. |
Thanks for the update! Let me know if I can help with anything. |
This is the change I proposed a while ago in #28 to add a random access xz reader.
It's functional but not super polished yet. Would you be open to an addition like this?