Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iceberg): support iceberg engine table (in local env) #19577

Merged
merged 60 commits into from
Dec 5, 2024

Conversation

chenzl25
Copy link
Contributor

@chenzl25 chenzl25 commented Nov 26, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

  • Tracking: Iceberg engine table #19418
  • Support ./risedev d iceberg-engine to set up a local environment to run the iceberg engine table. Use ./risedev d full directly.
  • Support create/drop/select iceberg engine table. The ddl in this PR is not atomic, but it should be fine in the first version.
  • The iceberg catalog is stored in our sql meta backend and the data is stored in S3 compatible cloud storage.
  • This PR will retrieve meta backend connection info and S3 warehouse info from the environment variable for simplicity, but we will improve it by fetching this info from meta in a later PR.
  • Make the iceberg sink s3 access key and secret key optional and add an enable_config_load field to load credentials from the default credential provided chain.
  • No compaction in this PR.

Example

create table t(id int primary key, name varchar) engine = iceberg;
insert into t values(1, 'xxx');
select * from t;

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@xxchan xxchan added the ci/run-backwards-compat-tests Run backwards compatibility tests in your PR. label Nov 29, 2024
src/meta/model/src/table.rs Outdated Show resolved Hide resolved
ci/scripts/e2e-iceberg-engine-test.sh Outdated Show resolved Hide resolved
src/frontend/src/handler/create_table.rs Outdated Show resolved Hide resolved
let catalog_writer = session.catalog_writer()?;
// TODO(iceberg): make iceberg engine table creation ddl atomic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a critical issue... especially if we create tables first, before source and sink. This is because table is self-contained, while create source employs a validation stage to check whether the upstream system really work, so it has a high chance to fail. Shall we create the source/sink before table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some dependencies here. To create an iceberg source, we first need to have an iceberg table. To create an iceberg table we need to create an iceberg sink (with create_table_if_not_exists). To create an iceberg sink we need to create a hummock table first. Finally, we have this order hummock table -> iceberg sink -> iceberg source

let catalog_writer = session.catalog_writer()?;
// TODO(iceberg): make iceberg engine table creation ddl atomic
catalog_writer
.create_table(source, table, graph, job_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the source is passed as well, can it work? 🤔

What I am thinking is that, for a common table with connector, the corresponding source is supposed to generate changes that will be applied to the table. But here the Iceberg source is just for batch read, which I think is actually irrelevant/unconnected to the iceberg table internally.

Copy link
Contributor Author

@chenzl25 chenzl25 Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source here is something like kafka and postgres cdc connector instead of the iceberg source. For example

create table t (a int) with (connector = 'kafka' ...) engine = iceberg.

src/frontend/src/handler/drop_table.rs Show resolved Hide resolved
@chenzl25 chenzl25 requested review from fuyufjh and xxchan December 3, 2024 10:29
@chenzl25 chenzl25 enabled auto-merge December 4, 2024 08:05
Copy link
Contributor

@xiangjinwu xiangjinwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for Cargo.lock

@chenzl25 chenzl25 added this pull request to the merge queue Dec 5, 2024
Merged via the queue into main with commit 59fa5f8 Dec 5, 2024
38 of 39 checks passed
@chenzl25 chenzl25 deleted the dylan/support_create_iceberg_engine_table branch December 5, 2024 10:51
meta_store_database.clone()
)
}
MetaBackend::Sqlite | MetaBackend::Sql | MetaBackend::Mem => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaBackend::Sql will be widely adopted since #19560. Shall we support it as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I previously thought it was deprecated. For iceberg jdbc right now, we need to know the underlying database implementation to choose the right driver.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simply add a jdbc: prefix to the database URL? 🤣

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When getting the endpoint from meta, it has already been converted into a form with postgres:, mysql: as prefixes, including configure as MetaBackend::Sql. So MetaBackend::Sql should be unreachable.

Unrelated to this PR, 🤔 I think we'd better use sql config in risedev only for testing purpose. To support scenarios where user and password contain special characters, it's best to specify them separately through env in both production environment and cloud. That's the reason why a subdivided backend was introduced in #17530 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the underlying database is Oracle or SQL Server, I think that's acceptable. However, I still want to verify the underlying database type. For instance, SQLite is not a suitable catalog for Iceberg. Concurrent updates to SQLite by both the metadata service and Iceberg can easily cause SQLite to become unresponsive.

@chenzl25 chenzl25 mentioned this pull request Dec 9, 2024
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants