Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IcebergDocument as one implementation of VirtualDocument #3147

Open
wants to merge 67 commits into
base: master
Choose a base branch
from

Conversation

bobbai00
Copy link
Collaborator

@bobbai00 bobbai00 commented Dec 10, 2024

Implement Apache Iceberg for Result Storage

How to Enable Iceberg Result Storage

  1. Update storage-config.yaml:
    • Set result-storage-mode to iceberg.

Major Changes

  • Introduced IcebergDocument: A thread-safe VirtualDocument implementation for storing and reading results in Iceberg tables.
  • Introduced IcebergTableWriter: Append-only writer for Iceberg tables with configurable buffer size.
  • Catalog and Data storage for Iceberg: Uses a local file system (file:/) via HadoopCatalog and HadoopFileIO. This ensures Iceberg operates without relying on external storage services.
  • ProgressiveSinkOpExec with a new parameter workerId is added. Each writer of the result storage will take this workerId as one new parameter.

Dependencies

  • Added Apache Iceberg-related libraries.
  • Introduced Hadoop-related libraries to support Iceberg's HadoopCatalog and HadoopFileIO. These libraries are used for placeholder configuration but do not enforce runtime dependency on HDFS.

Overview of Iceberg Components

IcebergDocument

  • Manages reading and organizing data in Iceberg tables.
  • Supports iterator-based incremental reads with thread-safe operations for reading and clearing data.
  • Initializes or overrides the Iceberg table during construction.

IcebergTableWriter

  • Writes data as immutable Parquet files in an append-only manner.
  • Each writer uniquely prefixes its files to avoid conflicts (workerIndex_fileIndex format).
  • Not thread-safe—single-thread access is recommended.

Data Storage via Iceberg Tables

  • Write:
    • Tables are created per storage key.
    • Writers append Parquet files to the table, ensuring immutability.
  • Read:
    • Readers use IcebergDocument.get to fetch data via an iterator.
    • The iterator reads data incrementally while ensuring data order matches the commit sequence of the data files.

Data Reading Using File Metadata

  • Data files are read using getUsingFileSequenceOrder, which:
    • Retrieves and sorts metadata files (FileScanTask) by sequence numbers.
    • Reads records sequentially, skipping files or records as needed.
    • Supports range-based reading (from, until) and incremental reads.
  • Sorting ensures data consistency and order preservation.

Hadoop Usage Without HDFS

  • The HadoopCatalog uses an empty Hadoop configuration, defaulting to the local file system (file:/).
  • This enables efficient management of Iceberg tables in local or network file systems without requiring HDFS infrastructure.

@bobbai00 bobbai00 self-assigned this Dec 10, 2024
@bobbai00 bobbai00 force-pushed the jiadong-add-file-result-storage branch 2 times, most recently from 6522779 to a83d779 Compare December 14, 2024 00:14
@bobbai00 bobbai00 force-pushed the jiadong-add-file-result-storage branch 2 times, most recently from 1edb551 to cef347b Compare December 21, 2024 02:56
@bobbai00 bobbai00 changed the title Add PartitionDocument and ItemizedFileDocument Add IcebergDocument as one implementation of VirtualDocument that can be used to store operator results Dec 22, 2024
@bobbai00 bobbai00 changed the title Add IcebergDocument as one implementation of VirtualDocument that can be used to store operator results Add IcebergDocument as one implementation of VirtualDocument Dec 22, 2024
@transient lazy val catalog: Catalog = IcebergCatalogInstance.getInstance()

// During construction, create or override the table
synchronized {
Copy link
Collaborator

@shengquan-ni shengquan-ni Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this synchronized is unnecessary as it only locks this instance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants