-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support nested types in ORC writer #3696
Conversation
Signed-off-by: Firestarman <[email protected]>
Depending on rapidsai/cudf#9334. So mark it as draft. |
Signed-off-by: Firestarman <[email protected]>
|
||
/** | ||
* (We could try to merge this with `parquetWriterOptionsFromSchema` after fixing the issue | ||
* https://github.com/rapidsai/cudf/issues/7654) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue is circumvented by pruning masks ourselves and we are already calling the remove_validity_if_needed
from the writeORCChunk
in TableJni.cpp. So I don't see why this should be preventing us from merging the two methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info. But I tried locally and still getting the exception below when running the Parquet test test_write_map_nullable
after changing the nullable to go back to the actual setting.
E Caused by: ai.rapids.cudf.CudfException: cuDF failure at: /home/liangcail/work/projects/on_github/cudf/cpp/src/io/parquet/writer_impl.cu:377: Mismatch in metadata prescribed nullability and input column nullability. Metadata for nullable input column cannot prescribe nullability = false
E at ai.rapids.cudf.Table.writeParquetChunk(Native Method)
E at ai.rapids.cudf.Table.access$300(Table.java:39)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a bug in plugin. Working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it by using m.valueContainsNull
instead of the parameter nullable
when building options from MapType
.
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
Nested map is not supported yet. Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
@razajafri Could you review it again ? |
@@ -21,7 +21,8 @@ import java.util.Optional | |||
import scala.collection.mutable.ArrayBuffer | |||
import scala.language.implicitConversions | |||
|
|||
import ai.rapids.cudf.{ColumnView, Table} | |||
import ai.rapids.cudf._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good way to import all from cudf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good suggestion, but this is done by IDE, if you prefer, i can change it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. Since it's a breaking change, let's merge it as soon as possible.
@@ -21,7 +21,8 @@ import java.util.Optional | |||
import scala.collection.mutable.ArrayBuffer | |||
import scala.language.implicitConversions | |||
|
|||
import ai.rapids.cudf.{ColumnView, Table} | |||
import ai.rapids.cudf._ | |||
import ai.rapids.cudf.ColumnWriterOptions._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find ColumnWriterOptions in cudf side. Is there a pending PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it depends on rapidsai/cudf#9334. I will merge it once this PR gets approvals.
overall, LTGM |
Will trigger CI after rapidsai/cudf#9334 being merged. |
Marked as draft just because rapidsai/cudf#9334 is not merged yet. When it is merged this can go back to ready. |
rapidsai/cudf#9334 is a breaking change to plugin, and this PR should be merged as soon as possible after #9334 being merged. So I plan to get an approval first, then merge the cudf PR and trigger the CI. Then I can merge this PR once CI is done. Otherwise it will take more time to get this in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
build |
Signed-off-by: Firestarman <[email protected]>
build |
|
||
orc_write_basic_struct_gen = StructGen([['child'+str(ind), sub_gen] for ind, sub_gen in enumerate(orc_write_basic_gens)]) | ||
|
||
# Some array/struct gens, but not all because of nesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mind expanding on this comment a bit, why "not all because of nesting"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, I will update it in a following PR since this PR should be merged as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this comment.
@@ -852,7 +852,8 @@ object GpuOverrides extends Logging { | |||
(OrcFormatType, FileFormatChecks( | |||
cudfRead = (TypeSig.commonCudfTypes + TypeSig.ARRAY + TypeSig.DECIMAL_64 + | |||
TypeSig.STRUCT + TypeSig.MAP).nested(), | |||
cudfWrite = TypeSig.commonCudfTypes, | |||
cudfWrite = (TypeSig.commonCudfTypes + TypeSig.ARRAY + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also I get confused with TypeSig
. So we can read nested maps, but not write them? Or did I misunderstand? Also is there a follow on issue to handle the same types that the reader supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the truth. We can read nested map but can not write nested map. Which is limited by the cudf native orc writer.
I will file an issue to cudf first. Once cudf supported it, we can update the TypeSig here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I removed the map support due to the test failures only happened in pre-merge builds.
We have the issue #3784 to track this.
builder.withDecimalColumn(name, dt.precision, nullable) | ||
case TimestampType => | ||
builder.withTimestampColumn(name, writeInt96, nullable) | ||
case s: StructType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous version of this code for struct and array defaulted both to nullable (and had a comment that is missing in your change).
// we are setting this to nullable, in case the parent is a Map's key and wants to
// set this to false
Why don't these columns need to be set nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had discussion above for Zara's comments.
I moved similar comments to the beginning of method writerOptionsFromSchema
.
Here setting to nullable with the comment before because nullale was hard-coded to true
due to the issue rapidsai/cudf#7654.
But this issue has been circumvented by PR rapidsai/cudf#9061, so we can still keep it being nullable, but the comment is no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the structBuilder
and listBuilder
are still using nullable
for parent columns, while for child columns, we should use containsNull
for array and valueContainsNull
for map to tell whether the children are nullable.
build |
Can not find out which test failed from the log, re-build again. |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved first to unblock the CI.
Try to fix a build error in premerge builds which running tests in parallel. Signed-off-by: Firestarman <[email protected]>
build |
Because map tests failed in premerge builds where tests run in parallel. Signed-off-by: Firestarman <[email protected]>
build |
Tests of orc writing maps always fail, so removed the map support. It can not reproduce locally. We will add map support back in the future. |
@revans2 @razajafri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approve to unblock CI. Please file following issues later
Thanks, and the test failure can be tracked by #3784 . |
This fixes #3494 .
And also adds support for lists.
Signed-off-by: Firestarman [email protected]