Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Push mode streaming support #29

Open
dai-chen opened this issue Aug 30, 2023 · 0 comments
Open

[FEATURE] Push mode streaming support #29

dai-chen opened this issue Aug 30, 2023 · 0 comments
Assignees
Labels

Comments

@dai-chen
Copy link
Collaborator

Is your feature request related to a problem?

Currently, the refreshing of the Flint index is dependent on "polling" within the Spark FileStreamSource operator. This approach can potentially lead to performance issues, especially when dealing with a source table containing a substantial number of partitions and files.

What solution would you like?

The proposal is to allow user provide SNS topic for S3 data source. In this way, the streaming execution can find out "delta" (changed file list) efficiently.

Questions to think about:

  1. Is this option provided on source table or Flint index DDL statement?
  2. Do we only handle new changes via notification or we can also load cold data?

What alternatives have you considered?

Provide some way for user to refresh source table metadata periodically. But need to figure out how-to because:

  1. Spark Hive table: MSTK REPAIR statement works for this purpose but Hive table doesn't support Spark structured streaming
  2. Spark data source table: as aforementioned, FileStreamSource polls S3 file list

Do you have any additional context?

N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants