Skip to content

sgcowell/tpcds-iceberg-utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prereqs:

To create Iceberg tables on top of a TPC-DS dataset:

  • Generate base TPC-DS data files using dsdgen - see run-dsdgen.sh for example command line. You can download source for dsdgen that can compile on MacOS at https://github.com/gregrahn/tpcds-kit.
  • Run gen-create-tables-script.sh to generate a SQL script that can be run via Spark SQL to create Iceberg tables from the generated text data files.

For example, if I want to create my Iceberg tables under ~/warehouse/tpcds, and I will register that location as a catalog in Spark as 'tpcds':

./gen-create-tables-script.sh tpcds /path/to/tpcds/files > tpcds.sql

You can then run the script in Spark using something like:

$HOME/spark/spark-3.2.0-bin-hadoop3.2/bin/spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.0
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.type=hive
--conf spark.sql.catalog.tpcds=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.tpcds.type=hadoop
--conf spark.sql.catalog.tpcds.warehouse=$HOME/warehouse/tpcds
--conf spark.sql.iceberg.handle-timestamp-without-timezone=true
--conf spark.executor.memory=4g
--conf spark.driver.memory=4g
-f tpcds.sql

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published