Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dal: Refactor to make it easier for contributors to add new storage type #3436

Closed
Tracked by #3677
Xuanwo opened this issue Dec 13, 2021 · 12 comments
Closed
Tracked by #3677
Assignees
Labels
A-storage Area: databend storage community-take

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Dec 13, 2021

Summary

Refactor dal to make it easier for contributors to add new storage type.

Proposal: #3662

@Veeupup
Copy link
Contributor

Veeupup commented Dec 13, 2021

@Xuanwo I'm now working with dal, maybe you can check this draft pr #3433

@Xuanwo
Copy link
Member Author

Xuanwo commented Dec 13, 2021

/assignme

@dantengsky
Copy link
Member

hi @Xuanwo

There are some other feature requirements, maybe they could be taken into account together:

  • A factory, which can build instances of concrete DataAccessors, given uri and credentials

    uri is sth like s3://load/files/ or azure://myaccount.blob.core.windows.net/mycontainer/files/

    credentials is sth like

    (aws_key_id='1a2b3c' aws_secret_key='4x5y6z')
    or
    (azure_sas_token='?sv=2016-05-31&ss=b&srt=sco&sp=rwdl&se=2018-06-27T10:05:50Z&st=2017-06-27T02:05:50Z&spr=https,http&sig=bgqQwoXwxzuD2GJfagRg7VOS8hzNr3QLT7rhS8OFRLQ%3D')

    above examples are taken from https://docs.snowflake.com/en/sql-reference/sql/create-stage.html#examples

  • for s3, the factory needs to resolve the region itself

    by using GetBucketLocation api?

Background:

  • we are implementing something like the snowflake's "create stage" statement

    from https://docs.snowflake.com/en/sql-reference/sql/create-stage.html#examples

    create stage my_ext_stage1
      url='s3://load/files/'
      credentials=(aws_key_id='1a2b3c' aws_secret_key='4x5y6z');
      
    create stage mystage
      url='azure://myaccount.blob.core.windows.net/mycontainer/files/'
      credentials=(azure_sas_token='?sv=2016-05-31&ss=b&srt=sco&sp=rwdl&se=2018-06-27T10:05:50Z&st=2017-06-27T02:05:50Z&spr=https,http&sig=bgqQwoXwxzuD2GJfagRg7VOS8hzNr3QLT7rhS8OFRLQ%3D');  
      
    

    that implies we should be able to create different kinds of DataAccessors according to the urls, but

  • currently, the DataAccessors can only be obtained from QueryContext::get_data_accessor

    which creates instances of DataAccessor, but only the one type that specified in StorageConfig.

@dantengsky
Copy link
Member

@sundy-li please correct me, if I misunderstood the requirements of "create stage"

@Xuanwo
Copy link
Member Author

Xuanwo commented Dec 14, 2021

  • A factory, which can build instances of concrete DataAccessors, given uri and credentials

Makes sense, It's indeed in my plan.

  • for s3, the factory needs to resolve the region itself

I'm afraid region is a required value for s3 to init the service, although there are some tricks for it.

@Xuanwo
Copy link
Member Author

Xuanwo commented Dec 14, 2021

For uri and credentials, I plan to support something called connection string like:

s3://s3.amazonaws.com/bucket_name/path/to/workdir?access_key_id=xxxx&secret_access_key=yyyy&storage_class=STANDARD_IA

What do you think?

@dantengsky
Copy link
Member

I'm afraid region is a required value for s3 to init the service, although there are some tricks for it.

yes, it is.

After discussing with @sundy-li, and browsing the relevant docs of snowflake, I think

For uri and credentials, I plan to support something called connection string like:

s3://s3.amazonaws.com/bucket_name/path/to/workdir?access_key_id=xxxx&secret_access_key=yyyy&storage_class=STANDARD_IA

What do you think?

sounds great to me :)

@Xuanwo
Copy link
Member Author

Xuanwo commented Dec 14, 2021

it is reasonable/convenient for the end-user to omit the region, while specifying the location of s3 resources

Yes, I agree with this conclusion.

I guess, snowflake uses the GetBucketLocation API to resolve the region.

There is something I need to clarify here.

It's different to connect to s3 service and bucket.

  • To init a service, we don't need region, and we can get bucket's region from ListBuckets API.
  • To init a bucket, region is required.

It looks like we can connect to a service and allow users to choose a bucket so that we can automate all other works for them.

@sundy-li
Copy link
Member

We can call GetBucketLocation to initial the region in Stage create/update command.

@dantengsky
Copy link
Member

We can call GetBucketLocation to initial the region in Stage create/update command.

agree

@Xuanwo Xuanwo moved this from 📋 Backlog to 🔨 In Progress in Xuanwo's Work Dec 20, 2021
@Xuanwo
Copy link
Member Author

Xuanwo commented Dec 27, 2021

@dantengsky @sundy-li I created a proposal about my vision of Databend DAL, please take a look: #3662

@BohuTANG BohuTANG added the A-storage Area: databend storage label Jan 6, 2022
@Xuanwo Xuanwo moved this from 🔨 In Progress to 📋 Backlog in Xuanwo's Work Feb 12, 2022
@Xuanwo
Copy link
Member Author

Xuanwo commented Feb 14, 2022

dal2 has replaced dal in #4081

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: databend storage community-take
Projects
None yet
Development

No branches or pull requests

6 participants