fix(pyspark): set catalog and database with `USE` instead of pyspark api #9620

gforsyth · 2024-07-17T15:42:12Z

Description of changes

As reported in #9604, there is apparently a mismatch in permissioning structure
on Databricks, so that you can set a catalog and database (schema, ugh) with sql
by running USE CATALOG, but not via the pyspark api calls that (ostensibly) do
the same thing.

This, however, is not part of standard Spark SQL and is special
Databricks Spark SQL.

SO, we try the special databricks SQL, if that throws a parser error, we
fall back to using the pyspark API methods.

Issues closed

Resolves #9604

cpcloud

Hard approve.

Is there any chance of us testing this?

gforsyth · 2024-07-17T15:59:21Z

Is there any chance of us testing this?

let me look at what's involved with setting up a separate external catalog, because I would like better testing around this stuff.

jstammers · 2024-07-17T16:27:41Z

Thanks for the PR @gforsyth . I've just tested this and have the following error:

ParseException: [[PARSE_SYNTAX_ERROR](https://docs.microsoft.com/azure/databricks/error-messages/error-classes#parse_syntax_error)] Syntax error at or near 'USE': extra input 'USE'.(line 2, pos 0)

== SQL ==
USE CATALOG hive_metastore;
USE SCHEMA default;

maybe they need to be executed as separate statements?

    135     catalog = table_loc.catalog or None
    136     database = table_loc.db or None
--> 138 table_schema = self.get_schema(name, catalog=catalog, database=database)
    139 return ops.DatabaseTable(
    140     name,
    141     schema=table_schema,
    142     source=self,
    143     namespace=ops.Namespace(catalog=catalog, database=database),
    144 ).to_expr()

gforsyth · 2024-07-17T16:32:50Z

maybe they need to be executed as separate statements?

Ahh, interesting, I bet you're right. Let me try that.

Thanks for testing it out!!

gforsyth · 2024-07-17T16:41:03Z

Hey @jstammers -- can you give this PR another shot when you have a chance? Many thanks!

jstammers · 2024-07-17T19:22:42Z

That seems to work great for me. Thanks for resolving this so quickly!

cpcloud · 2024-07-17T21:05:50Z

We should probably have at least one test for this if possible. Is creating another catalog a huge pain?

gforsyth · 2024-07-17T21:18:13Z

I agree. I'm trying to sift through some docs to get that working.

The unity catalog on Databricks has weird permissioning where it doesn't allow (at least one) user to run `setCurrentCatalog` or `setCurrentDatabase`. It _does_ allow setting those via `USE CATALOG mycat;` and `USE DATABASE mydb;`. This, however, is not part of standard Spark SQL and is special Databricks Spark SQL. SO, we try the special databricks SQL, if that throws a parser error, we fall back to using the pyspark API methods.

If you call `setCurrentCatalog`, Spark will default to using the `default` database in that catalog. So we need to note which database was being used so we can switch back to it correctly.

gforsyth · 2024-07-18T21:15:26Z

Ok, this can probably be cleaned up at some point, but I think it's all working now.

cpcloud · 2024-07-18T21:15:59Z

ibis/backends/pyspark/__init__.py

+try:
+    from pyspark.errors import ParseException as PySparkParseException
+except ImportError:
+    from pyspark.sql.utils import ParseException as PySparkParseException


cpcloud approved these changes Jul 17, 2024

View reviewed changes

cpcloud added the pyspark The Apache PySpark backend label Jul 17, 2024

gforsyth force-pushed the spark_catalog_db_sql branch from 668f20a to f4db01b Compare July 17, 2024 16:36

chore(deps): add iceberg jar to pyspark install for catalog testing

4e3fb6a

gforsyth force-pushed the spark_catalog_db_sql branch from f4db01b to 426c361 Compare July 18, 2024 17:57

gforsyth added 4 commits July 18, 2024 16:02

ci: add iceberg jar to pyspark 3.5 job

d1493f7

fix(pyspark): restore database in catalog setter context mgr

bd2a99a

If you call `setCurrentCatalog`, Spark will default to using the `default` database in that catalog. So we need to note which database was being used so we can switch back to it correctly.

chore(pyspark): fix ParseException import on older pyspark

30020cd

gforsyth force-pushed the spark_catalog_db_sql branch from 426c361 to 30020cd Compare July 18, 2024 20:07

cpcloud reviewed Jul 18, 2024

View reviewed changes

cpcloud approved these changes Jul 18, 2024

View reviewed changes

cpcloud added this to the 9.2 milestone Jul 18, 2024

cpcloud merged commit 6991f04 into ibis-project:main Jul 19, 2024
82 checks passed

cpcloud added the bug Incorrect behavior inside of ibis label Jul 19, 2024

gforsyth deleted the spark_catalog_db_sql branch July 19, 2024 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pyspark): set catalog and database with `USE` instead of pyspark api #9620

fix(pyspark): set catalog and database with `USE` instead of pyspark api #9620

gforsyth commented Jul 17, 2024 •

edited

Loading

cpcloud left a comment

gforsyth commented Jul 17, 2024 •

edited

Loading

jstammers commented Jul 17, 2024

gforsyth commented Jul 17, 2024

gforsyth commented Jul 17, 2024

jstammers commented Jul 17, 2024

cpcloud commented Jul 17, 2024

gforsyth commented Jul 17, 2024

gforsyth commented Jul 18, 2024

cpcloud Jul 18, 2024

fix(pyspark): set catalog and database with USE instead of pyspark api #9620

fix(pyspark): set catalog and database with USE instead of pyspark api #9620

Conversation

gforsyth commented Jul 17, 2024 • edited Loading

Description of changes

Issues closed

cpcloud left a comment

Choose a reason for hiding this comment

gforsyth commented Jul 17, 2024 • edited Loading

jstammers commented Jul 17, 2024

gforsyth commented Jul 17, 2024

gforsyth commented Jul 17, 2024

jstammers commented Jul 17, 2024

cpcloud commented Jul 17, 2024

gforsyth commented Jul 17, 2024

gforsyth commented Jul 18, 2024

cpcloud Jul 18, 2024

Choose a reason for hiding this comment

fix(pyspark): set catalog and database with `USE` instead of pyspark api #9620

fix(pyspark): set catalog and database with `USE` instead of pyspark api #9620

gforsyth commented Jul 17, 2024 •

edited

Loading

gforsyth commented Jul 17, 2024 •

edited

Loading