Upgrade to Spark 3.1.1 with testing (#349)

* Testing Spark3 upgrade.WIP * Skip tests.WIP * update readme and setup for pyspark.WIP * Fix circle ci version and bump mem value * Bump memory, fix nit, bump pyhive version * Pyhive version change * enabled sasl for metastore * Explicit server2 host port * Try showing debug-level logs * Rm -n4 * move to godatadriven lates spark image * restore to 2 to check output * Restore debug and parallelized to check output * Revert to 3.0 * Revert to normal state * open source spark image * Change to pyspark image * Testing with gdd spark 3.0 for thrift * Switch back to dbt user pass * Spark 3.1.1 gdd image without configs * Clean up * Skip session test * Clean up for review * Update to CHANGELOG Co-authored-by: Jeremy Cohen <[email protected]>
dbt-labs · Jun 28, 2022 · 0082e73 · 0082e73
1 parent 120ec42
commit 0082e73
Show file tree

Hide file tree

Showing 6 changed files with 10 additions and 23 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -33,29 +33,12 @@ jobs:
       DBT_INVOCATION_ENV: circle
     docker:
       - image: fishtownanalytics/test-container:10
-      - image: godatadriven/spark:2
+      - image: godatadriven/spark:3.1.1
         environment:
           WAIT_FOR: localhost:5432
         command: >
           --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
           --name Thrift JDBC/ODBC Server
-          --conf spark.hadoop.javax.jdo.option.ConnectionURL=jdbc:postgresql://localhost/metastore
-          --conf spark.hadoop.javax.jdo.option.ConnectionUserName=dbt
-          --conf spark.hadoop.javax.jdo.option.ConnectionPassword=dbt
-          --conf spark.hadoop.javax.jdo.option.ConnectionDriverName=org.postgresql.Driver
-          --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
-          --conf spark.jars.packages=org.apache.hudi:hudi-spark-bundle_2.11:0.9.0
-          --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
-          --conf spark.driver.userClassPathFirst=true
-          --conf spark.hadoop.datanucleus.autoCreateTables=true
-          --conf spark.hadoop.datanucleus.schema.autoCreateTables=true
-          --conf spark.hadoop.datanucleus.fixedDatastore=false
-          --conf spark.sql.hive.convertMetastoreParquet=false
-          --hiveconf hoodie.datasource.hive_sync.use_jdbc=false
-          --hiveconf hoodie.datasource.hive_sync.mode=hms
-          --hiveconf datanucleus.schema.autoCreateAll=true
-          --hiveconf hive.metastore.schema.verification=false
-
       - image: postgres:9.6.17-alpine
         environment:
           POSTGRES_USER: dbt

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -18,6 +18,7 @@
 ### Features
 - Add session connection method ([#272](https://github.com/dbt-labs/dbt-spark/issues/272), [#279](https://github.com/dbt-labs/dbt-spark/pull/279))
 - rename file to match reference to dbt-core ([#344](https://github.com/dbt-labs/dbt-spark/pull/344))
+- Upgrade Spark version to 3.1.1 ([#348](https://github.com/dbt-labs/dbt-spark/issues/348), [#349](https://github.com/dbt-labs/dbt-spark/pull/349))
 
 ### Under the hood
 - Add precommit tooling to this repo ([#356](https://github.com/dbt-labs/dbt-spark/pull/356))
@@ -29,6 +30,7 @@
 ### Contributors
 - [@JCZuurmond](https://github.com/dbt-labs/dbt-spark/pull/279) ( [#279](https://github.com/dbt-labs/dbt-spark/pull/279))
 - [@ueshin](https://github.com/ueshin) ([#320](https://github.com/dbt-labs/dbt-spark/pull/320))
+- [@nssalian](https://github.com/nssalian) ([#349](https://github.com/dbt-labs/dbt-spark/pull/349))
 
 ## dbt-spark 1.1.0b1 (March 23, 2022)
 

diff --git a/README.md b/README.md
@@ -26,7 +26,7 @@ more information, consult [the docs](https://docs.getdbt.com/docs/profile-spark)
 
 ## Running locally
 A `docker-compose` environment starts a Spark Thrift server and a Postgres database as a Hive Metastore backend.
-Note that this is spark 2 not spark 3 so some functionalities might not be available.
+Note: dbt-spark now supports Spark 3.1.1 (formerly on Spark 2.x).
 
 The following command would start two docker containers
 ```

diff --git a/docker-compose.yml b/docker-compose.yml
@@ -1,8 +1,8 @@
 version: "3.7"
 services:
 
-  dbt-spark2-thrift:
-    image: godatadriven/spark:3.0
+  dbt-spark3-thrift:
+    image: godatadriven/spark:3.1.1
     ports:
       - "10000:10000"
       - "4040:4040"

diff --git a/docker/spark-defaults.conf b/docker/spark-defaults.conf
@@ -1,7 +1,9 @@
+spark.driver.memory 2g
+spark.executor.memory 2g
 spark.hadoop.datanucleus.autoCreateTables	true
 spark.hadoop.datanucleus.schema.autoCreateTables	true
 spark.hadoop.datanucleus.fixedDatastore 	false
 spark.serializer	org.apache.spark.serializer.KryoSerializer
-spark.jars.packages	org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0
+spark.jars.packages	org.apache.hudi:hudi-spark3-bundle_2.12:0.10.0
 spark.sql.extensions	org.apache.spark.sql.hudi.HoodieSparkSessionExtension
 spark.driver.userClassPathFirst true
diff --git a/tests/functional/adapter/test_basic.py b/tests/functional/adapter/test_basic.py
@@ -82,4 +82,4 @@ def project_config_update(self):
 
 @pytest.mark.skip_profile('spark_session')
 class TestBaseAdapterMethod(BaseAdapterMethod):
-    pass
+    pass