-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Parquet reader is unable to read LargeString columns #39682
Comments
|
@assignUser - thank you! unfortunately the closed environment does not connect to the internet. I will point the IT team who I've asked for help with this internally at that link and see if they'd be willing to test or upload it for me so I can test. |
Given that the bug was introduced in version 14.0.0, I am wondering if it relates to #37274 |
@felipecrv Mind giving this a quick skim over? I'm not sure whether this is at the R or C++ layer, but one of the places in the codebase when I grep for the |
@thisisnic sure. I'm going to take a look now. |
The regular string builder is based on Digging into the Parquet reader code, I find... // XXX: if a LargeBinary chunk is larger than 2GB, the MSBs of offsets
// will be lost because they are first created as int32 and then cast to int64. ...leading to a commit from November:
This is from PR #8632 that fixed Issue #26405 which is about writing LargeString into Arrow, but doesn't fix the reading them back part. I suppose the value in this is that C++/R/Python scripts can produce files that the Java Parquet reader can read without problems. Next steps: |
Thanks for figuring that one out @felipecrv. I've updated the title to reflect that it's a C++ bug, though don't have the capacity to implement the fix myself. |
Do you mean 14.0.2 or other? I did a bug fix here on 14.0.2 ( #38784 ), not sure whether this is related |
It could be related, we haven't been able to release 14.0.2 to CRAN, so it'll be 14.0.0.2, which is 14.0.0 but with some R patches. |
Haha. I would never guess that perf regression fix also fixes the reading of large binary arrays :) |
Sigh, see #38577 The binary behavior changes, both cause allocation more memory and regression... |
Sure, but @nicki-dese probably has enough RAM to read the file they want to read. Is it possible since #38784? |
@felipecrv this error raise from BinaryBuilder or other, which limit the string/binary size as 2GB🤔 Maybe not related to RAM |
@mapleFU I understand the error is raised from the |
@felipecrv A previous patch ( #35825 ) want to address this is not merged in, sigh. Maybe we can re-checking that |
Hi @felipecrv - thank you for looking at this. I have 128 GB of RAM, which is enough RAM to read the file. I have successfully read the file since rolling back to 13.0.0.1. Regarding testing since #38784, unfortunately the nature of the closed environment I'm in does not allow me to test anything until it's released on CRAN, and the current CRAN version (14.0.0.2 in R) is the one where I found the bug. |
Plenty of RAM :) The problem is not memory allocation per se, but the representation instantiated by the Parquet reader being based on 32-bit integers.
The actual fix would be attempt linked by @mapleFU on the last message: #35825. Looks like a PR that was abandoned by the author. |
Sorry for late replying, @nicki-dese Have this problem solved with later R or 14.0.2? If not I'll dive into it now |
Is this still considered a blocker for the release? |
We should get @mapleFU and @felipecrv input on this — from reading there's an issue in our C++ parquet reader that is causing this. From reading #41104 and #35825 there's a proposal but it doesn't look close enough to finished that it'll make it in the release (please, do correct me if I'm wrong on that!) I'm going to remove the blocker label so we don't prevent this release, though we absolutely should fix this issue in the parquet reader |
Personally I think this is likely to be solved in 14.0.2 . The issue is that it can be read in 13.0, so it could handled by #38784 . But I didn't got enough input on this :-( |
Thanks for the update! Yeah, I tried a gentle nudge there too, let's see if that goes forward for next release (17.0.0)? |
I think large-column can be read now, and for next release we may support read large dictionary. If there're any reproduce-able bug just report and let me fix it |
ok, so I understand that this should be fixed for 17.0.0 and no further action is required. Should we close this issue then? If we want to maintain the issue opened, should we remove the blocker label and the 17.0.0 milestone? |
Oh, sorry, I misunderstood this! That's great that this is fixed for strings just not dictionaries. Do we have a test for this somewhere in the C++ suite that confirms this? I could probably add one in R too, but that's a bit higher and superfluous if we already have a C++ one. |
You're right I think I'd better adding a testcase for this 🤔 The read large column logic is tested but I guess this behavior is left behind. I'm busy in workdays these days. I'll try construct a unittest this weekend Technically, Parquet has the limit below:
but when read a batch or a table, it could be greater, so when reading string, parquet reader would use a group of api to "split to small chunks" using multiple |
@mapleFU yeah. The issue stems from offsets of I think a better strategy to support The I have code that casts string-views on my machine (PR soon). A |
Apologies for the delay in replying @mapleFU, I've just tested arrow 16.1.0 and read_parquet successfully read in a previously problematic file with large strings - so the bug appears fixed from my end. |
As we have a few follow up issues for the underlying problems and the user facing issues is fixed I'll close this. |
Describe the bug, including details regarding any error messages, version, and platform.
read_parquet() is giving the following error with large parquet files:
Versions etc from sessionInfo:
Descriptive info on example problematic table, with two columns:
The id is a hashed string, 24 characters long. It is not practical to change it, as it's the joining key.
Note, the data above is stored as a data.table in R and left that way when saving it with write_parquet(). But I've converted it to an arrow table for the above descriptive stats, because I thought they'd be more useful to you!
Other relevant information:
(unfortunately I'm not sure which version, but it was working late November/early December, we work in a closed environment and use Posit Package manager, VMs rebuild every 30 days, so it would have been a fairly recent version)
Note: I haven't been able to roll back to an earlier version of arrow - because we only have earlier source versions and not binaries and I'm using windows, I get libarrow errors. If there is a work around for this please let me know.
UPDATE
With the help of IT, I have been able to install earlier versions of arrow in my environment, and have shown that:
Component(s)
Parquet, R
The text was updated successfully, but these errors were encountered: