Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): support setting Ibis table schemas #833

Open
deepyaman opened this issue Sep 14, 2024 · 8 comments
Open

feat(datasets): support setting Ibis table schemas #833

deepyaman opened this issue Sep 14, 2024 · 8 comments

Comments

@deepyaman
Copy link
Member

Description

I want to set the database schema for the Ibis table referenced by the ibis.TableDataset.

Context

From @mark-druffel on Slack:

Has anyone used ibis.TableDataset with duckdb schemas? If I set a schema on a data catalog entry I get the error Invalid Input Error: Could not set option "schema" as a global option .

bronze_x:
  type: ibis.TableDataset
  filepath: x.csv
  file_format: csv
  table_name: x
  backend: duckdb
  database: data.duckdb
  schema: bronze

I can reproduce this error with vanilla ibis:

con = ibis.duckdb.connect(database="data.duckdb", schema = "bronze")

Found a related question on ibis' github, it sounds like duckdb can't set the schema globally so it has to be done in the table functions. Wondering if this would require a change to ibis.TableDataset, and if so, would this pattern work the same with other backends?

Possible Implementation

Wondering if this would require a change to ibis.TableDataset,
If I understand correctly, this would be a request to pass schema (actually, database, since schema is deprecated as an argument to table) as a table_arg or something in the dataset.

@deepyaman
Copy link
Member Author

deepyaman commented Sep 14, 2024

and if so, would this pattern work the same with other backends?

I think so, because https://github.com/ibis-project/ibis/blob/main/ibis/backends/sql/__init__.py#L47 for example (called in the table() function) is generic to SQL backends).

@deepyaman
Copy link
Member Author

From @mark-druffel:

Yea I was just looking through some of the other backends, agree. I'm trying to check pyspark now to recall if table(database) is actually equivalent to catalog, database, or schema in hive.
On the ibis side, it feels like do_connect using a database parameter is confusing. For example:

con=ibis.duckdb.connect(database = "data/db/spotify/spotify.duckdb")
con.create_database(name = "bronze")
con.create_database(name = "silver")
con.create_database(name = "gold")
con.table("x", database = "bronze")

The create_database & table calls use database to mean something completely different.
On the kedro-datasets side, my question becomes would it make sense to accept an argument called "schema" and just pass that to table(database = schema) since the "database" argument is already used in the connection string for do_connect?

I'd be inclined to match the Ibis name, but create a section for table_args (i.e. arguments that get passed to the underlying table() call), since that's more in line with how most Kedro datasets are structured in my experience, but I haven't thought about it that much.

@deepyaman
Copy link
Member Author

I'd be inclined to match the Ibis name, but create a section for table_args (i.e. arguments that get passed to the underlying table() call)

Actually, looking at the code, it's more like load_args and save_args. We already pass save_args to create_table or create_view, which is where you can specify database="bronze" (same will hold true for inserts, pending #834). The table call is just on load, so it makes sense to be load_args that we pass there (right now, load_args aren't being passed to the table call, only the read_* call, but I expect read_* to move into FileDataset after #828).

So this would be very trivial to do. It just seems a bit dumb that you'd pass the schema name twice (to load_args and save_args), so I guess we can just specify database as a top-level key (not under connection), and feed them to load_args/save_args in the __init__ method.

@mark-druffel curious to know what you think of this approach? This is what it would look like in your example:

bronze_x:
  type: ibis.TableDataset
  filepath: x.csv
  file_format: csv
  table_name: x
  connection:  # Nit: Moved the keys under `connection`; assume that was just an oversight in your example
    backend: duckdb
    database: data.duckdb
  database: bronze  # Won't bother with `schema`, to be consistent with Ibis

@mark-druffel
Copy link

@deepyaman Staying consistent with ibis definitely makes sense. I do think the differing use of database in do_connect() and table() is confusing, but that's an ibis question/discussion and I'm not sure what parameter names would be better...

@mark-druffel
Copy link

@deepyaman I've been looking back through this given my pyspark example over slack. I'm still thrown by the duplicative use of database as a parameter in Ibis. I've always used language aligned with the image below and find myself getting a bit confused in Ibis because schema is different, but the confluence of database and data frame language makes thing a bit more complicated in Ibis I think 🤷
image
^Referenced SO Post

Given that, I think using your prior suggestion of table_args might be a bit clearer to end users. I could be wrong, but it also feels more akin to FileDataset's load_args / save_args as you mentioned so I think it would be more natural:

# Duckdb
bronze_x:
  type: ibis.TableDataset
  table_name: x
  connection:  
    backend: duckdb
    database: data.duckdb
  table_args:
    database: schema

# Databricks w/ Unity Catalog
bronze_x:
  type: ibis.TableDataset
  table_name: x
  connection:  
    backend: pyspark
    session: session_name #Not sure if this actually works yet, databricks default works with no named session but will test
  table_args:
    database: catalog.schema

I'm updating TableDataset and doing some testing on my side, but would love to open a PR if the change makes sense to others?

@deepyaman
Copy link
Member Author

I'm updating TableDataset and doing some testing on my side, but would love to open a PR if the change makes sense to others?

That makes sense and would be much appreciated! Appreciate any help in making the dataset as useful as possible for people such as yourself. :)

I'm sorry for not getting back to you on the Slack thread (persistent reference: https://kedro.hall.community/loading-ibis-tabledataset-from-unity-catalog-fails-with-cannot-be-found-error-0m1YHF44RYjD), but would the update you're making address this? I know you mentioned some nuances with DuckDB, which I haven't parsed myself.

@mark-druffel
Copy link

mark-druffel commented Oct 22, 2024

I'm updating TableDataset and doing some testing on my side, but would love to open a PR if the change makes sense to others?

That makes sense and would be much appreciated! Appreciate any help in making the dataset as useful as possible for people such as yourself. :)

I'm sorry for not getting back to you on the Slack thread (persistent reference: https://kedro.hall.community/loading-ibis-tabledataset-from-unity-catalog-fails-with-cannot-be-found-error-0m1YHF44RYjD), but would the update you're making address this? I know you mentioned some nuances with DuckDB, which I haven't parsed myself.

@deepyaman no worries, you've been so helpful with every question I've had. It's so appreciated!

To answer your direct question, yes. My main goal of the PR for me would be to make working with pyspark in databricks easier.

Additional Context

Basically, using the table() method works fine in Databricks, but do_connect() does not.

Using table method

We're using the pyspark backend with Unity Catalog (UC) so we need to be able to call table() with the database parameter. table() works fine with UC:

# Ibis
import ibis 
from ibis import _
ibis.options.interactive = True
con = ibis.pyspark.connect()
con.table(name = "raw_tracks", database = "comms_media_dev.dart_extensions")

# pyspark equivalent
spark.read.table("comms_media_dev.dart_extensions.raw_tracks")

image

Using do_connect method

do_connect() accepts **kwargs, "used to configure the SparkSession", but the required configs spark.sql.defaultCatalog and park.sql.catalog.spark_catalog.defaultDatabase are not accepted by Databricks.

# Error: "Cannot modify the value of a static config: spark.sql.catalog.spark_catalog.defaultDatabase."
con = ibis.pyspark.connect(
  **{"spark.sql.defaultCatalog": "comms_media_dev", 
       "spark.sql.catalog.spark_catalog.defaultDatabase": "dart_extensions"
  }) 

If I use the spark.sql.catalog.spark_catalog configuration the code runs, but doesn't work:

con = ibis.pyspark.connect(**{"spark.sql.catalog.spark_catalog": "comms_media_dev"})

image

Databricks documentation says:

Severless compute does not support setting most Spark properties for notebooks or jobs. The following are the properties you can configure:

  • spark.sql.legacy.timeParserPolicy (Default value is EXCEPTION)
  • spark.sql.session.timeZone (Default value is Etc/UTC)
  • spark.sql.shuffle.partitions (Default value is auto)
  • spark.sql.ansi.enabled (Default value is true)

Given that, it seems the only way to update Unity Catalog's default catalog & database would be to do so in the cluster configurations. That would mean pipelines couldn't use multiple catalogs and / or databases (maybe could with hooks, but seems overly complicated given table() works without any of that).

@mark-druffel
Copy link

mark-druffel commented Oct 24, 2024

@deepyaman I added table_args to table dataset and got it working, but I'm having second thoughts. I think it's confusing where things go between table_args and save_args. I'm wondering though, will TableDataset keep save_args & load_args in the future or is this just to avoid a breaking change on the initial release of FileDataset?

# With save_args
bronze_x:
  type: ibis.TableDataset
  table_name: x
  connection:
    backend: pyspark
  table_args:
    database:  my_catalog.bronze
  save_args:
    materialized: table
    overwrite: True

# Without save_args
bronze_x:
  type: ibis.TableDataset
  connection:
    backend: pyspark
  table_args:
    name: x # Maybe move table_name into table_args if there's no save_args just so that all arguments passed are in one place? 
    database: my_catalog.bronze
    materialized: table
    overwrite: True    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants