Enable initializing DataFusion IcebergCatalogProvider using specific schemas (rather than all) #601

a-agmon · 2024-09-04T17:52:22Z

We use Glue to store metadata of hundreds of thousands of tables, spanning dozens of schemas (many of which are not Iceberg tables).
It would be a great feature to be able to initialize an IcebergCatalogProvider for query via DF using only specific schemas, mostly to avoid traversing everything.
It looks like this can be done with a small modification to IcebergCatalogProvider::try_new functions - see here

In this way, the user can choose to initialize the catalog on a specific schema(s)

/// attempts to create a schema provider for each namespace, and
    /// collects these providers into a `HashMap`.
    pub async fn try_new(client: Arc<dyn Catalog>, schemas: Option<Vec<String>>) -> Result<Self> {
        // TODO:
        // Schemas and providers should be cached and evicted based on time
        // As of right now; schemas might become stale.
        let schema_names: Vec<_> = if let Some(schemas) = schemas {
            schemas
        } else {
            client
                .list_namespaces(None)
                .await?
                .iter()
                .flat_map(|ns| ns.as_ref().clone())
                .collect()
        };

what do you think?

The text was updated successfully, but these errors were encountered:

liurenjie1024 · 2024-09-12T02:57:37Z

Thanks @a-agmon for raising this. The reason currently we need to fetch all schemas at once is to simplify implementation, since datafusion's catalog api currently is not async. I think instead of having more parameters in catalog construction, a better approach is to have a cache between the async iceberg catalog with datafusion catalog, which is good for both performance and functionality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable initializing DataFusion IcebergCatalogProvider using specific schemas (rather than all) #601

Enable initializing DataFusion IcebergCatalogProvider using specific schemas (rather than all) #601

a-agmon commented Sep 4, 2024

liurenjie1024 commented Sep 12, 2024

Enable initializing DataFusion IcebergCatalogProvider using specific schemas (rather than all) #601

Enable initializing DataFusion IcebergCatalogProvider using specific schemas (rather than all) #601

Comments

a-agmon commented Sep 4, 2024

liurenjie1024 commented Sep 12, 2024