Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow creating native tables with different formats #2053

Open
universalmind303 opened this issue Nov 3, 2023 · 4 comments
Open

Allow creating native tables with different formats #2053

universalmind303 opened this issue Nov 3, 2023 · 4 comments
Labels
feat 🎇 New feature or request

Comments

@universalmind303
Copy link
Contributor

Description

With the upcoming support of lance format I'd like to be able to create native tables that use lance instead of delta

Some dialects use the USING keyword. others (clickhouse) use ENGINE = . Probably some other keywords used in other dialects as well.

create table t1 as (select 1) USING lance
create table t1 as (select 1) (ENGINE = 'lance')
-- defaults to delta if unspecified
create table t1 as (select 1)
-- is the same as
create table t1 as (select 1) USING delta

As a potential followup:
I think it would be cool if we could inspect the schema of the table (maybe statistics too) to determine the optimal table representation.

For example, Lance is highly optimized for ml use cases FixedSizeList, Binary, ... while Delta is optimized for more traditional OLAP workloads.

We could likely infer if the workload is ML, or standard OLAP based off of the schema + statistics. Then we could use different backends (delta or lance) depending on the workload.

@universalmind303 universalmind303 added the feat 🎇 New feature or request label Nov 3, 2023
@jordandakota
Copy link

I don't know why but when I read this I think of an old presentation around good clean code, where the prof is quoting other leaders in industry and one of the quotes is, "a method is something that does one thing well."

I think adding additional storage options means adding additional readers and writers and adds a lot of maintenence.

@universalmind303
Copy link
Contributor Author

I think adding additional storage options means adding additional readers and writers and adds a lot of maintenence.

But the readers and writers for the given formats would already exist? I don't think end users would face any additional maintenance as it'd be an opt-in feature.

Since it is opt-in, I see this as a feature for "power users" that are aware of the tradeoffs of using different storage formats.

@tychoish
Copy link
Collaborator

tychoish commented Nov 7, 2023

Format Selectivity

Having a way to specify data format (as well as some tuning options like chunk/batch size and per-column compression formats, potentially) seems like a great capability. I think I'm vaguely partial to WITH STORAGE SETTINGS (k=v, k=v) or something like that, but negotiable on the details.

Having said that I'd characterize all of it as "column oriented" and therefore OLAP centric, in all of its form (not just delta, but datafusion itself).

Automatic Format Detection

I think it would be cool if we could inspect the schema of the table (maybe statistics too) to determine the optimal table representation.

This is definitely a feature that sounds nice, but I think would be a lot of different work. I also can't necessarily imagine the situation/user who knows the schema of the table that they're creating but doesn't know what the table's going to be used for? Also what happens when the table is going to be used for more than one thing? How should people decide? (what do we do in this situation?) Is there a situation where people might be doing both kinds of workloads over the same data set?

In a lot of ways, these data format decisions are sort of akin to "deciding which indexes to build for a table" in a OLTP database, and "which fields to index" is a thing which you can do "automatically" (though it's hard to write really good code that does that as well automatically/generically as a human would do?) In a lot of ways the whole history and design of relational databases (and maybe databases themselves) revolve around "providing users with the ability to tell the database engine how the user expects the data to be used."

The other sort of challenge with the "automatic view" operation is that it makes adding new/arbitrary data formats harder, as you have to predict the workload, potentially? That seems like a pretty high burden. I'm also aware that when you create a table, we won't, yet, have the information about the usage statistics, and we also don't have a system for collecting (and storing and retaining) that data, so there are a lot of pieces.

@tychoish
Copy link
Collaborator

tychoish commented Dec 4, 2023

I've been poking around some of this code in the last week, and I was confused for a moment. I think we should definitely call this "engine" [delta, lance] as "format" made me think about CSV/JSON/etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat 🎇 New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants