-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support nested types in ORC writer #3696
Changes from 9 commits
52da3f9
533aa89
4d97290
4d07e95
bd5c0f7
d0593b3
6ba8dd9
b1cac2c
e877b7e
225d487
b2c6b73
0b6e599
e4f7bdd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -852,7 +852,8 @@ object GpuOverrides extends Logging { | |
(OrcFormatType, FileFormatChecks( | ||
cudfRead = (TypeSig.commonCudfTypes + TypeSig.ARRAY + TypeSig.DECIMAL_64 + | ||
TypeSig.STRUCT + TypeSig.MAP).nested(), | ||
cudfWrite = TypeSig.commonCudfTypes, | ||
cudfWrite = (TypeSig.commonCudfTypes + TypeSig.ARRAY + | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also I get confused with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this is the truth. We can read nested map but can not write nested map. Which is limited by the cudf native orc writer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now I removed the map support due to the test failures only happened in pre-merge builds. |
||
TypeSig.STRUCT).nested() + TypeSig.MAP, | ||
sparkSig = (TypeSig.atomics + TypeSig.STRUCT + TypeSig.ARRAY + TypeSig.MAP + | ||
TypeSig.UDT).nested()))) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,7 +21,8 @@ import java.util.Optional | |
import scala.collection.mutable.ArrayBuffer | ||
import scala.language.implicitConversions | ||
|
||
import ai.rapids.cudf.{ColumnView, Table} | ||
import ai.rapids.cudf._ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it's a good way to import all from cudf. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes good suggestion, but this is done by IDE, if you prefer, i can change it back. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not really. Since it's a breaking change, let's merge it as soon as possible. |
||
import ai.rapids.cudf.ColumnWriterOptions._ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't find ColumnWriterOptions in cudf side. Is there a pending PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it depends on rapidsai/cudf#9334. I will merge it once this PR gets approvals. |
||
import com.nvidia.spark.rapids.RapidsPluginImplicits.AutoCloseableProducingSeq | ||
import org.apache.orc.TypeDescription | ||
|
||
|
@@ -208,4 +209,70 @@ object SchemaUtils extends Arm { | |
case _ => col | ||
} | ||
} | ||
|
||
private def writerOptionsFromField[T <: NestedBuilder[_, _], V <: ColumnWriterOptions]( | ||
builder: NestedBuilder[T, V], | ||
dataType: DataType, | ||
name: String, | ||
nullable: Boolean, | ||
writeInt96: Boolean): T = { | ||
dataType match { | ||
case dt: DecimalType => | ||
builder.withDecimalColumn(name, dt.precision, nullable) | ||
case TimestampType => | ||
builder.withTimestampColumn(name, writeInt96, nullable) | ||
case s: StructType => | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The previous version of this code for struct and array defaulted both to nullable (and had a comment that is missing in your change).
Why don't these columns need to be set nullable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We had discussion above for Zara's comments. Here setting to nullable with the comment before because nullale was hard-coded to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can see the |
||
builder.withStructColumn( | ||
writerOptionsFromSchema( | ||
structBuilder(name, nullable), | ||
s, | ||
writeInt96).build()) | ||
case a: ArrayType => | ||
builder.withListColumn( | ||
writerOptionsFromField( | ||
listBuilder(name, nullable), | ||
a.elementType, | ||
name, | ||
a.containsNull, | ||
writeInt96).build()) | ||
case m: MapType => | ||
// It is ok to use `StructBuilder` here for key and value, since either | ||
// `OrcWriterOptions.Builder` or `ParquetWriterOptions.Builder` is actually an | ||
// `AbstractStructBuilder`, and here only handles the common column metadata things. | ||
builder.withMapColumn( | ||
mapColumn(name, | ||
writerOptionsFromField( | ||
structBuilder(name, nullable), | ||
m.keyType, | ||
"key", | ||
nullable = false, | ||
writeInt96).build().getChildColumnOptions()(0), | ||
writerOptionsFromField( | ||
structBuilder(name, nullable), | ||
m.valueType, | ||
"value", | ||
m.valueContainsNull, | ||
writeInt96).build().getChildColumnOptions()(0))) | ||
case _ => | ||
builder.withColumns(nullable, name) | ||
} | ||
builder.asInstanceOf[T] | ||
} | ||
|
||
/** | ||
* Build writer options from schema for both ORC and Parquet writers. | ||
* | ||
* (There is an open issue "https://github.com/rapidsai/cudf/issues/7654" for Parquet writer, | ||
* but it is circumvented by https://github.com/rapidsai/cudf/pull/9061, so the nullable can | ||
* go back to the actual setting, instead of the hard-coded nullable=true before.) | ||
*/ | ||
def writerOptionsFromSchema[T <: NestedBuilder[_, _], V <: ColumnWriterOptions]( | ||
builder: NestedBuilder[T, V], | ||
schema: StructType, | ||
writeInt96: Boolean = false): T = { | ||
schema.foreach(field => | ||
writerOptionsFromField(builder, field.dataType, field.name, field.nullable, writeInt96) | ||
) | ||
builder.asInstanceOf[T] | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mind expanding on this comment a bit, why "not all because of nesting"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, I will update it in a following PR since this PR should be merged as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this comment.