Skip to content

Latest commit

 

History

History
28 lines (20 loc) · 1.48 KB

README.md

File metadata and controls

28 lines (20 loc) · 1.48 KB

Prereqs:

To create Iceberg tables on top of a TPC-DS dataset:

  • Generate base TPC-DS data files using dsdgen - see run-dsdgen.sh for example command line. You can download source for dsdgen that can compile on MacOS at https://github.com/gregrahn/tpcds-kit.
  • Run gen-create-tables-script.sh to generate a SQL script that can be run via Spark SQL to create Iceberg tables from the generated text data files.

For example, if I want to create my Iceberg tables under ~/warehouse/tpcds, and I will register that location as a catalog in Spark as 'tpcds':

./gen-create-tables-script.sh tpcds /path/to/tpcds/files > tpcds.sql

You can then run the script in Spark using something like:

$HOME/spark/spark-3.2.0-bin-hadoop3.2/bin/spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.0
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.type=hive
--conf spark.sql.catalog.tpcds=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.tpcds.type=hadoop
--conf spark.sql.catalog.tpcds.warehouse=$HOME/warehouse/tpcds
--conf spark.sql.iceberg.handle-timestamp-without-timezone=true
--conf spark.executor.memory=4g
--conf spark.driver.memory=4g
-f tpcds.sql