Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46641][SS] Add maxBytesPerTrigger threshold #44636

Conversation

MaxNevermind
Copy link
Contributor

@MaxNevermind MaxNevermind commented Jan 9, 2024

What changes were proposed in this pull request?

This PR adds Input Streaming Source's option maxBytesPerTrigger for limiting the total size of input files in a streaming batch. Semantics of maxBytesPerTrigger is very close to already existing one maxFilesPerTrigger option.

How a feature was implemented?

Because maxBytesPerTrigger is semantically close to maxFilesPerTrigger I used all the maxFilesPerTrigger usages in the whole repository as a potential places that requires changes, that includes:

  • Option paramater definition
  • Option related logic
  • Option related ScalaDoc and MD files
  • Option related test

I went over the usage of all usages of maxFilesPerTrigger in FileStreamSourceSuite and implemented maxBytesPerTrigger in the same fashion as those two are pretty close in their nature. From the structure and elements of ReadLimit I've concluded that current design implies only one simple rule for ReadLimit, so I openly prohibited the setting of both maxFilesPerTrigger and maxBytesPerTrigger at the same time.

Why are the changes needed?

This feature is useful for our and our sister teams and we expect it will find a broad acceptance among Spark users. We have a use-case in a few of the Spark pipelines we support when we use Available-now trigger for periodic processing using Spark Streaming. We use maxFilesPerTrigger threshold for now, but this is not ideal as Input file size might change with the time which requires periodic configuration adjustment of maxFilesPerTrigger. Computational complexity of the job depends on the event count/total size of the input and maxBytesPerTrigger is a better predictor of that than maxFilesPerTrigger.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

New unit tests were added or existing maxFilesPerTrigger test were extended. I searched maxFilesPerTrigger related test and added new tests or extended existing ones trying to minimize and simplify the changes.

Was this patch authored or co-authored using generative AI tooling?

No.

@MaxNevermind
Copy link
Contributor Author

MaxNevermind commented Jan 9, 2024

@viirya
Please review this PR.

@viirya
Copy link
Member

viirya commented Jan 9, 2024

@MaxNevermind Could you create a Jira ticket and replace SPARK-XXXX with Jira ticket number in the PR title? Thanks.

@MaxNevermind
Copy link
Contributor Author

@MaxNevermind Could you create a Jira ticket and replace SPARK-XXXX with Jira ticket number in the PR title? Thanks.

Will do, but I didn't have a ASF JIra account, I've just requested it using a ASF self service, it says that it will take few days to review my request.

@viirya
Copy link
Member

viirya commented Jan 9, 2024

I think your Jira account was created now.

@MaxNevermind MaxNevermind changed the title [SPARK-XXXX][SS] Add maxBytesPerTrigger threshold [SPARK-46641][SS] Add maxBytesPerTrigger threshold Jan 10, 2024
@MaxNevermind
Copy link
Contributor Author

@viirya
Fixed the issues, tests are passing now.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most issues are addressed internally so I don't see other issues now. I took the tests now and looks okay to me.

cc @HeartSaVioR

@dongjoon-hyun
Copy link
Member

Sorry for being late. I'm review now, @MaxNevermind .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (except a few minor style issues), @MaxNevermind .

@MaxNevermind
Copy link
Contributor Author

@viirya @dongjoon-hyun
For PR to be merged is it supposed to reach a certain count of approves? Just trying to understand the next steps.

@dongjoon-hyun
Copy link
Member

To @MaxNevermind , this is still under review technically because the previous approval is not for the last commit.

@MaxNevermind
Copy link
Contributor Author

@dongjoon-hyun
Should I re-request reviews?

@dongjoon-hyun
Copy link
Member

We are already on this PR. Do you want to bring someone-else?

@MaxNevermind
Copy link
Contributor Author

No, just want to confirm if there is no actin point to me here right now.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @viirya and @MaxNevermind . What is the conclusion for this discussion? I reviewed this PR Today once more. IIUC, this is the last remaining issue on this PR, isn't it? Did you make any conclusion on this?

Screenshot 2024-02-06 at 8 48 20 AM

@viirya
Copy link
Member

viirya commented Feb 6, 2024

Replied the comment. I also replied one issue on truncating BigInt value when comparing it with a long: #44636 (comment)

@MaxNevermind
Copy link
Contributor Author

Pushed another commit. One issue was resolved.
@viirya
please check out a solution for the last remaining proposed issue above

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM, too.

Feel free to merge if CI passes.

@MaxNevermind
Copy link
Contributor Author

@dongjoon-hyun
I think I don't have a write access to merge PR, see a screenshot bellow. Is there anything I need to do to attain it or maybe you can do it for me? Btw I guess commits need to be squashed.

Screenshot 2024-02-08 at 9 37 47 PM

@dongjoon-hyun
Copy link
Member

Hi, @MaxNevermind . It was a comment for @viirya .

Let me merge this. :)

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 4.0.0.

Congratulations for your first commit, @MaxNevermind .

I added you to the Apache Spark contributor group and assigned SPARK-46641 to you, @MaxNevermind .

@viirya
Copy link
Member

viirya commented Feb 9, 2024

Thank you @MaxNevermind @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants