Skip to content

ORC Sink

vnnv01 edited this page Mar 6, 2018 · 1 revision

WARNING: Spark's ORC integration has generally lagged the robustness of Parquet integration. As such, the current state of affairs of the ORC Sink does not allow configuring custom a OutputFormat, etc. as you may be used to with the Parquet Sink.


The ORC Sink is a specific subset of the File Sink. As such, file sink options may also be configured on an ORC Sink. As always, the path option of the File Sink must be configured. Please review the File Sink if you are not already familiar with it.

Options

compression

The compression codec to use when generating ORC files. If not specified directly on the ORC Sink, orc.compress (e.g., spark.hadoop.orc.compress) will be used. Valid values include:

  • none
  • uncompressed
  • snappy
  • zlib
  • lzo

Defaults to snappy.

SAVE STREAM foo
TO ORC
OPTIONS(
  'compression'='zlib'
);

spark.hadoop.orc.*

Allows specifying internal settings for the underlying ORC file writers. Typically, users will only modify these settings for use cases requiring fine grained tuning. ORC exposes a compression setting, but users should prefer the compression option exposed directly by the ORC Sink.

If you are unfamiliar with these settings, you can use OrcConf as a reference.

-- spark.properties: spark.hadoop.orc.memory.pool=0.1
SAVE STREAM foo
TO ORC
OPTIONS();