Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable initializing DataFusion IcebergCatalogProvider using specific schemas (rather than all) #601

Open
a-agmon opened this issue Sep 4, 2024 · 1 comment

Comments

@a-agmon
Copy link
Contributor

a-agmon commented Sep 4, 2024

We use Glue to store metadata of hundreds of thousands of tables, spanning dozens of schemas (many of which are not Iceberg tables).
It would be a great feature to be able to initialize an IcebergCatalogProvider for query via DF using only specific schemas, mostly to avoid traversing everything.
It looks like this can be done with a small modification to IcebergCatalogProvider::try_new functions - see here

In this way, the user can choose to initialize the catalog on a specific schema(s)

/// attempts to create a schema provider for each namespace, and
    /// collects these providers into a `HashMap`.
    pub async fn try_new(client: Arc<dyn Catalog>, schemas: Option<Vec<String>>) -> Result<Self> {
        // TODO:
        // Schemas and providers should be cached and evicted based on time
        // As of right now; schemas might become stale.
        let schema_names: Vec<_> = if let Some(schemas) = schemas {
            schemas
        } else {
            client
                .list_namespaces(None)
                .await?
                .iter()
                .flat_map(|ns| ns.as_ref().clone())
                .collect()
        };

what do you think?

@liurenjie1024
Copy link
Collaborator

Thanks @a-agmon for raising this. The reason currently we need to fetch all schemas at once is to simplify implementation, since datafusion's catalog api currently is not async. I think instead of having more parameters in catalog construction, a better approach is to have a cache between the async iceberg catalog with datafusion catalog, which is good for both performance and functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants