Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Daft Catalog API #3036

Merged
merged 9 commits into from
Nov 17, 2024
Merged

[FEAT] Daft Catalog API #3036

merged 9 commits into from
Nov 17, 2024

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Oct 14, 2024

Adds a DaftCatalog API to help cement Daft's catalog data access patterns. Here is the intended UX:

import daft

###
# Registering external catalogs
#
# We recognize `PyIceberg.Catalog`s and `UnityCatalog` for now
# TODO: This should be configurable via a Daft catalog config file (.toml or .yaml)
###

from pyiceberg.catalog import load_catalog

catalog = load_catalog(...)
daft.catalog.register_python_catalog(catalog)

###
# Adding named tables
###

df = daft.register_table(df, "foo")

###
# Reading tables
###

df1 = daft.read_table("foo")  # first priority is named tables

df2 = daft.read_table("x.y.z")  # next priority is the registered default catalog
df2 = daft.read_table("default.x.y.z")  # equivalent to the previous call

df3 = daft.read_table("my_other_catalog.x.y.z")  # Supports named catalogs other than default one

Other APIs which will be nice as follow-ons:

  • Integrate this with the SQLCatalog API that our SQL stuff uses
  • Detection of catalog from a YAML ~/.daft.yaml config file
  • Allow for configuring table access (e.g. daft.read_table("iceberg_table", options=daft.catalog.IcebergReadOptions(...)))
  • Implementations for other catalogs that isn't a Python catalog, and can support other table types (e.g. Hive and Delta):
    • daft.catalog.register_aws_glue()
    • daft.catalog.register_hive_metastore()
  • df.write_table("table_name", mode="overwrite|append", create_if_missing=True)
  • df.upsert("table_name", match_columns={...}, update_columns={...}, insert_columns={...})
  • DDL: allow for easy creation of tables, erroring out if the selected backend does not support a given table format
    • daft.catalog.create_table_parquet(...)
    • daft.catalog.create_table_iceberg(...)
    • daft.catalog.create_table_deltalake(...)
    • daft.catalog.list_tables(...)

@github-actions github-actions bot added the enhancement New feature or request label Oct 14, 2024
Copy link

codspeed-hq bot commented Oct 14, 2024

CodSpeed Performance Report

Merging #3036 will degrade performances by 45.6%

Comparing jay/catalogs (cc29cad) with main (0709691)

Summary

⚡ 1 improvements
❌ 1 regressions
✅ 15 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main jay/catalogs Change
test_iter_rows_first_row[100 Small Files] 317.6 ms 283.2 ms +12.16%
test_show[100 Small Files] 22.8 ms 41.9 ms -45.6%

Copy link

codecov bot commented Oct 14, 2024

Codecov Report

Attention: Patch coverage is 31.73653% with 228 lines in your changes missing coverage. Please review.

Project coverage is 75.89%. Comparing base (c4e1ab2) to head (cc29cad).
Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-catalog/python-catalog/src/python.rs 4.44% 86 Missing ⚠️
src/daft-catalog/src/lib.rs 70.29% 30 Missing ⚠️
daft/catalog/unity.py 0.00% 28 Missing ⚠️
src/daft-catalog/src/python.rs 25.80% 23 Missing ⚠️
daft/catalog/pyiceberg.py 0.00% 20 Missing ⚠️
daft/catalog/__init__.py 56.25% 14 Missing ⚠️
daft/catalog/python_catalog.py 0.00% 13 Missing ⚠️
src/daft-catalog/src/errors.rs 0.00% 13 Missing ⚠️
daft/__init__.py 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3036      +/-   ##
==========================================
- Coverage   77.50%   75.89%   -1.61%     
==========================================
  Files         666      675       +9     
  Lines       81333    83898    +2565     
==========================================
+ Hits        63036    63677     +641     
- Misses      18297    20221    +1924     
Files with missing lines Coverage Δ
daft/io/catalog.py 86.95% <100.00%> (+0.91%) ⬆️
src/common/error/src/error.rs 84.84% <ø> (ø)
src/lib.rs 95.94% <100.00%> (+0.11%) ⬆️
daft/__init__.py 23.68% <0.00%> (-0.65%) ⬇️
daft/catalog/python_catalog.py 0.00% <0.00%> (ø)
src/daft-catalog/src/errors.rs 0.00% <0.00%> (ø)
daft/catalog/__init__.py 56.25% <56.25%> (ø)
daft/catalog/pyiceberg.py 0.00% <0.00%> (ø)
src/daft-catalog/src/python.rs 25.80% <25.80%> (ø)
daft/catalog/unity.py 0.00% <0.00%> (ø)
... and 2 more

... and 139 files with indirect coverage changes

Comment on lines 26 to 45
named_data_catalogs: HashMap<String, Arc<dyn DataCatalog>>,
view_catalog: HashMap<String, LogicalPlanBuilder>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using interior mutability (e.g., Mutex) for named_data_catalogs and view_catalog. This would allow modifications through shared references, enabling use of Arc<DaftMetaCatalog> without &mut self in register methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points.... I was trying to find a way around this also specifically because I wanted to enable lazy materialization of PyIceberg objects inside of the GlueCatalog (if we receive information from Glue that a given table is an "iceberg" table, then we can lazily initialize the PyIceberg Glue client to help us interface with it)

I'm not actively working on this PR yet until perhaps next week, but will work in interior mutability here which is needed to make these happen.

@universalmind303 universalmind303 self-requested a review October 16, 2024 17:19
@universalmind303
Copy link
Contributor

nit(opinion): daft.register_aws_glue(region="us-west-2", catalog_name="my-glue")
To prevent clutter in the global namespace, I think we should have the catalog methods scoped under daft.catalog or similar.

@jaychia jaychia marked this pull request as draft October 17, 2024 17:06
@jaychia jaychia force-pushed the jay/catalogs branch 3 times, most recently from e6d0350 to f63513e Compare November 4, 2024 05:42
@jaychia
Copy link
Contributor Author

jaychia commented Nov 4, 2024

I ended up removing the glue implementation to simplify the PR, and immediately unlock registration of PyIceberg catalogs as in #3901

I also made this easily extendible from Python via simple PythonCatalog and PythonCatalogTable abstract classes. This now just wraps the Unity and PyIceberg implementations, but I can see this being useful for potentially other Python-based clients in the future.

@jaychia jaychia force-pushed the jay/catalogs branch 2 times, most recently from a2e4847 to 85b6a9b Compare November 8, 2024 04:19
Add DataCatalogTable trait

Add pyiceberg

nit

Wire up register_table and read_table

Implement pyiceberg catalog

Implement python catalog

Integrate with iceberg tests

Fix merge conflicts
@jaychia jaychia marked this pull request as ready for review November 14, 2024 02:22
@jaychia
Copy link
Contributor Author

jaychia commented Nov 14, 2024

@universalmind303 this is ready for review now. Specifically I'm interested in how this can be integrated with our SQLContext on the SQL side.

src/daft-catalog/src/lib.rs Outdated Show resolved Hide resolved
src/daft-catalog/src/lib.rs Show resolved Hide resolved
src/daft-catalog/src/lib.rs Outdated Show resolved Hide resolved
src/daft-catalog/src/lib.rs Outdated Show resolved Hide resolved
src/daft-catalog/src/lib.rs Show resolved Hide resolved
daft/catalog/__init__.py Outdated Show resolved Hide resolved
src/daft-catalog/src/lib.rs Show resolved Hide resolved
@jaychia jaychia merged commit bb506a4 into main Nov 17, 2024
43 of 46 checks passed
@jaychia jaychia deleted the jay/catalogs branch November 17, 2024 21:23
@jaychia jaychia mentioned this pull request Dec 16, 2024
52 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants