Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pyspark): set catalog and database with USE instead of pyspark api #9620

Merged
merged 5 commits into from
Jul 19, 2024

Conversation

gforsyth
Copy link
Member

@gforsyth gforsyth commented Jul 17, 2024

Description of changes

As reported in #9604, there is apparently a mismatch in permissioning structure
on Databricks, so that you can set a catalog and database (schema, ugh) with sql
by running USE CATALOG, but not via the pyspark api calls that (ostensibly) do
the same thing.

This, however, is not part of standard Spark SQL and is special
Databricks Spark SQL.

SO, we try the special databricks SQL, if that throws a parser error, we
fall back to using the pyspark API methods.

Issues closed

Resolves #9604

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard approve.

Is there any chance of us testing this?

@cpcloud cpcloud added the pyspark The Apache PySpark backend label Jul 17, 2024
@gforsyth
Copy link
Member Author

gforsyth commented Jul 17, 2024

Is there any chance of us testing this?

let me look at what's involved with setting up a separate external catalog, because I would like better testing around this stuff.

@jstammers
Copy link
Contributor

Thanks for the PR @gforsyth . I've just tested this and have the following error:

ParseException: [[PARSE_SYNTAX_ERROR](https://docs.microsoft.com/azure/databricks/error-messages/error-classes#parse_syntax_error)] Syntax error at or near 'USE': extra input 'USE'.(line 2, pos 0)

== SQL ==
USE CATALOG hive_metastore;
USE SCHEMA default;

maybe they need to be executed as separate statements?

    135     catalog = table_loc.catalog or None
    136     database = table_loc.db or None
--> 138 table_schema = self.get_schema(name, catalog=catalog, database=database)
    139 return ops.DatabaseTable(
    140     name,
    141     schema=table_schema,
    142     source=self,
    143     namespace=ops.Namespace(catalog=catalog, database=database),
    144 ).to_expr()

@gforsyth
Copy link
Member Author

maybe they need to be executed as separate statements?

Ahh, interesting, I bet you're right. Let me try that.

Thanks for testing it out!!

@gforsyth gforsyth force-pushed the spark_catalog_db_sql branch from 668f20a to f4db01b Compare July 17, 2024 16:36
@gforsyth
Copy link
Member Author

Hey @jstammers -- can you give this PR another shot when you have a chance? Many thanks!

@jstammers
Copy link
Contributor

That seems to work great for me. Thanks for resolving this so quickly!

@cpcloud
Copy link
Member

cpcloud commented Jul 17, 2024

We should probably have at least one test for this if possible. Is creating another catalog a huge pain?

@gforsyth
Copy link
Member Author

I agree. I'm trying to sift through some docs to get that working.

@gforsyth gforsyth force-pushed the spark_catalog_db_sql branch from f4db01b to 426c361 Compare July 18, 2024 17:57
gforsyth added 4 commits July 18, 2024 16:02
The unity catalog on Databricks has weird permissioning where it doesn't
allow (at least one) user to run `setCurrentCatalog` or `setCurrentDatabase`.

It _does_ allow setting those via `USE CATALOG mycat;` and `USE DATABASE
mydb;`.

This, however, is not part of standard Spark SQL and is special
Databricks Spark SQL.

SO, we try the special databricks SQL, if that throws a parser error, we
fall back to using the pyspark API methods.
If you call `setCurrentCatalog`, Spark will default to using the
`default` database in that catalog.  So we need to note which database
was being used so we can switch back to it correctly.
@gforsyth gforsyth force-pushed the spark_catalog_db_sql branch from 426c361 to 30020cd Compare July 18, 2024 20:07
@gforsyth
Copy link
Member Author

Ok, this can probably be cleaned up at some point, but I think it's all working now.

try:
from pyspark.errors import ParseException as PySparkParseException
except ImportError:
from pyspark.sql.utils import ParseException as PySparkParseException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@cpcloud cpcloud added this to the 9.2 milestone Jul 18, 2024
@cpcloud cpcloud merged commit 6991f04 into ibis-project:main Jul 19, 2024
82 checks passed
@cpcloud cpcloud added the bug Incorrect behavior inside of ibis label Jul 19, 2024
@gforsyth gforsyth deleted the spark_catalog_db_sql branch July 19, 2024 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis pyspark The Apache PySpark backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: unable to set database for table registered in databricks unity catalog using shared cluster
3 participants