-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] GPU ORC reader complains errors when specifying columns that do not exist in file schema. #3058
Comments
When ansi mode is on, for CPU, the action
So CPU ORC reader will not throw exceptions even ansi mode is on for this case. |
We should decide whether this should be handled at a lower layer in cuIO. Related issue rapidsai/cudf#5447 |
Looks like the feaure of adding new columns here has been covered by the issue rapidsai/cudf#5447. However adding new columns can be supported in a short time if implementing it in the plugin side. You can see Parquet has partially supported this feature by doing it in plugin. I mean Parquet will add new columns only for the top level ones. We can do a little more in ORC to also support this for nested columns. So, personally, we can implement only this feature in plugin before cuDF supports it, and will remove the logic after cuDF has it done. |
rapidsai/cudf#5447 was filed under a mistaken assumption. I saw that the java ORC code had schema evolution built into it, and I assumed that it was a part of the standard. That turned out to not be true. It is very likely that rapidsai/cudf#5447 will never be done on the cudf side. If you want me to push on that issue so we come to some kind of a resolution I can, but personally I think we just need to make this happen on our own, and we close the cudf issue. Fundamentally it comes down to a two different operations after lining up the names/positions of the columns accordingly.
If you need help with this please let me know.
This is harder because our cast implementations are not great in all cases, so as we add in this type of support we need to be sure that we test corner cases and look at the ORC code to see what corner cases there might be. In the worst case we might need a flag like isAnsi but isOrc so we know if we have to do something special for the ORC cases. @sameerz if you agree with me on this I will close the cudf feature request. |
@revans2 agreed, we can close the cudf feature request and focus on this in the spark-rapids plugin. |
OK, I will work on the null columns first. |
Describe the bug
Spark ORC reader produces nulls for the columns that do not exist in file schema, but GPU ORC reader fails to read.
GPU:
CPU:
Steps/Code to reproduce bug
Read the attached ORC file (test.log) by GPU with the read schema as below.
Expected behavior
GPU ORC reader should output the same data with CPU ORC reader.
The text was updated successfully, but these errors were encountered: