Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate CSV data per Spark version for tools [databricks] #10440

Merged
merged 2 commits into from
Feb 21, 2024

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Feb 16, 2024

Fixes #10424. Adds CSV generation per Apache Spark version under tools/generated_files. The existing CSV files for tools data has been preserved to avoid breaking downstream pipelines that use this data. Once those pipelines have been updated to consume the new per-Spark versions of these files, we can remove the old tools CSV files.

@jlowe jlowe added the build Related to CI / CD or cleanly building label Feb 16, 2024
@jlowe jlowe self-assigned this Feb 16, 2024
@jlowe
Copy link
Member Author

jlowe commented Feb 16, 2024

build

@jlowe
Copy link
Member Author

jlowe commented Feb 16, 2024

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with optional nits

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we name the dir tools-support to differentiate from the tools repo and to match the Maven module name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but we probably want to address this as a followup. I don't want to break downstream pipelines which are expecting to find things under tools/generated_files/, and those would need to be updated to check the new location if the old is missing.

@@ -264,7 +264,7 @@ function build_single_shim() {
-Drat.skip="$SKIP_CHECKS" \
-Dmaven.scaladoc.skip \
-Dmaven.scalastyle.skip="$SKIP_CHECKS" \
-pl aggregator -am > "$LOG_FILE" 2>&1 || {
-pl tools -am > "$LOG_FILE" 2>&1 || {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously the idea was to build the minimum set of artifacts for dist ? Should buildall have a separate option for including tools?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an argument to not build tools? I would expect buildall to be used by developers as an easy way to build all the Spark versions, and more often than not they want to build the tools CSV data to know whether or not they need to checkin updated files as part of their PR. The build of these files is quick, so we're saving much by making the default to not build the tools data, and I think building the tools data is a better default. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It adds less than 4 seconds to the parallel part of the build, so (totalShimsInProfile / numParallelShims) * 4 sec is acceptable. Agree, no pressing need to add a skip option now

@gerashegalov
Copy link
Collaborator

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generated supported files per Spark version
3 participants