Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run promotion jobs in parallel #4747

Merged
merged 2 commits into from
Jun 6, 2024

Conversation

gaiksaya
Copy link
Member

@gaiksaya gaiksaya commented Jun 6, 2024

Description

The release promotion job today takes 1-2 hours to run as all jobs run parallel. This PR converts all those jobs to run parallelly reducing the time by 75%. The opensearch tarball promotion needs to be the last job that excutes as those artifacts are promoted to maven central. Hence triggering OpenSearch x64 is the last job that runs serially after all jobs are completed.

Issues Resolved

closes #4748

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sayali Gaikawad <[email protected]>
@gaiksaya gaiksaya changed the title Run promotion jobs parallely Run promotion jobs in parallel Jun 6, 2024
Copy link
Member

@getsaurabh02 getsaurabh02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @gaiksaya . LGTM, few minor questions/comments.

The release promotion job today takes 1-2 hours to run as all jobs run parallel.

Did we mean to say all jobs run serially (as of today)


pipeline {
options {
timeout(time: 4, unit: 'HOURS')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 4 hour still remains a relevant timeout when jobs running in parallel should finish much faster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be reduced to 2 now! Just to be safe. Sometimes bringing up new agents take time due to availability and other parallely running jobs so giving it a buffer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, is there a way to make if configurable (dynamic) for better tuning too?
I am thinking of how we can get to a sweet spot of knowing failures early enough without triggering false positive. This can be followed separately in its own issue if you think its worth pursuing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is max time allowed in case the job gets hung up due to some reason. Apart from that it has no use. Mainly added due to infrastructure constraints.

}

@Test
void shouldExecuteWithoutErrors() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to also check that job was actually executed in parallel? Since assertions only state it got executed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really! What the end state of the job would be can only be mocked from our end or run actually on the jenkins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I wish there was a mechanism to latch or clock counter.

Signed-off-by: Sayali Gaikawad <[email protected]>
Comment on lines +73 to +91
stage('OpenSearch Yum promotion') {
agent {
docker {
label AGENT_LINUX_X64
image 'docker/library/alpine:3'
registryUrl 'https://public.ecr.aws/'
alwaysPull true
}
}
steps {
echo 'Triggering distribution-promote-repos for OpenSearch Yum'
build job: 'distribution-promote-repos', wait: true, parameters: [string(name: 'DISTRIBUTION_JOB_NAME', value: 'distribution-build-opensearch'),
string(name: 'DISTRIBUTION_REPO_TYPE', value: 'yum'),
string(name: 'DISTRIBUTION_BUILD_NUMBER', value: params.OPENSEARCH_RC_BUILD_NUMBER),
string(name: 'INPUT_MANIFEST', value: "${params.RELEASE_VERSION}/opensearch-${params.RELEASE_VERSION}.yml"),
]
echo 'Promotion successful for OpenSearch yum!'
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are repeating Steps for each stages with some variation, I am wondering if there is some better way to do this with some code restructuring. Such as defining stages and steps first and then iterating over them?
I understand if that's totally not possible in scripts as this.

Copy link
Member Author

@gaiksaya gaiksaya Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is possible by putting it into groovy scripts but we want to control each job execution and parameters at the lower level here. Putting it in once single script or library is prone to errors and debugging issues. Definitely there is a scope for improvement once we are sure of the process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, maybe worth exploring this separately?

@gaiksaya
Copy link
Member Author

gaiksaya commented Jun 6, 2024

Did we mean to say all jobs run serially (as of today)

Yes! That is correct. It was kind of POC which worked (thanks to @prudhvigodithi ).

assertCallStack().contains("release-promotion-parallel.string({name=DISTRIBUTION_NAME, value=tar})")
assertCallStack().contains("release-promotion-parallel.string({name=DISTRIBUTION_ARCHITECTURE, value=x64})")

// OpenSearch Linux tar x64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know we have covered all the required steps in workflow like this? Is there a separate workflow model or state machine which can be source of truth?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait: true parameter is responsible for returning the status of the triggered job. It propagates back the state.

echo 'Promotion successful for OpenSearch Dashboards Linux tar arm64!'
}
}
stage('OpenSearch Dashboards Linux tar x64') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @gaiksaya , I think the idea is to run both OS and OSD tar x64 at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Is there any reason to wait for Dashboards till the end too? OpenSearch I know is for maven publishing reasons here.

@prudhvigodithi
Copy link
Member

Hey @gaiksaya for OpenSearch Dashboards Linux tar x64 and OpenSearch Linux tar x64 can we move them to Post success? As today rule of thumb we trigger the x64 tar for OS and OSD at last once all other promotion jobs are completed, since this is now parallel we should allow all other promotions to run in parallel once succeeded then tun the x64 tar in parallel. Please check something like.

    post {
        success {
            stage('Triggering X64 Promotion Jobs') { 
                parallel {
                    stage('OpenSearch Dashboards Linux tar x64') {
                        agent {
                            docker {
                                label AGENT_LINUX_X64
                                image 'docker/library/alpine:3'
                                registryUrl 'https://public.ecr.aws/'
                                alwaysPull true
                            }
                        }
                        steps {
                            echo 'Triggering distribution-promote-artifacts for OpenSearch Dashboards Linux tar x64'
                            build job: 'distribution-promote-artifacts', wait: true, parameters: [string(name: 'DISTRIBUTION_JOB_NAME', value: 'distribution-build-opensearch-dashboards'), 
                                                                                                string(name: 'DISTRIBUTION_PLATFORM', value: 'linux'),
                                                                                                string(name: 'DISTRIBUTION_NAME', value: 'tar'),
                                                                                                string(name: 'DISTRIBUTION_ARCHITECTURE', value: 'x64'),
                                                                                                string(name: 'DISTRIBUTION_BUILD_NUMBER', value: params.OPENSEARCH_DASHBOARDS_RC_BUILD_NUMBER),
                                                                                                string(name: 'INPUT_MANIFEST', value: "${params.RELEASE_VERSION}/opensearch-dashboards-${params.RELEASE_VERSION}.yml"),
                                                                                            ]
                            echo 'Promotion successful for OpenSearch Dashboards Linux tar x64!'
                        }
                    }
                    stage('OpenSearch Linux tar x64') {
                        agent {
                            docker {
                                label AGENT_LINUX_X64
                                image 'docker/library/alpine:3'
                                registryUrl 'https://public.ecr.aws/'
                                alwaysPull true
                            }
                        }
                        steps {
                            echo 'Triggering distribution-promote-artifacts for OpenSearch Linux tar x64'
                            build job: 'distribution-promote-artifacts', wait: true, parameters: [string(name: 'DISTRIBUTION_JOB_NAME', value: 'distribution-build-opensearch'), 
                                                                                                string(name: 'DISTRIBUTION_PLATFORM', value: 'linux'),
                                                                                                string(name: 'DISTRIBUTION_NAME', value: 'tar'),
                                                                                                string(name: 'DISTRIBUTION_ARCHITECTURE', value: 'x64'),
                                                                                                string(name: 'DISTRIBUTION_BUILD_NUMBER', value: params.OPENSEARCH_RC_BUILD_NUMBER),
                                                                                                string(name: 'INPUT_MANIFEST', value: "${params.RELEASE_VERSION}/opensearch-${params.RELEASE_VERSION}.yml"),
                                                                                            ]
                            echo 'Promotion successful for OpenSearch Linux rpm x64!'
                        }
                    }
                }
            }
        }
        always {
            node(AGENT_LINUX_X64) {
                checkout scm
                script {
                    postCleanup()
                }
            }
        }
    }
    ```

@prudhvigodithi
Copy link
Member

Thanks for this change @gaiksaya should really fasten the release promotion. You can directly modify the existing release-promotion.jenkinsfile file right ?

Copy link
Member

@peterzhuamazon peterzhuamazon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need x64 tar with OpenSearch to be the last as the native plugins of that would override in the end. Everything needs to happen before that run.

I am good with this parallel switch.

Thanks.

@gaiksaya
Copy link
Member Author

gaiksaya commented Jun 6, 2024

Hey @gaiksaya for OpenSearch Dashboards Linux tar x64 and OpenSearch Linux tar x64 can we move them to Post success? As today rule of thumb we trigger the x64 tar for OS and OSD at last once all other promotion jobs are completed, since this is now parallel we should allow all other promotions to run in parallel once succeeded then tun the x64 tar in parallel. Please check something like.

I believe the current set up will do the same. Worried that post stage may cause issues in the run. Trying to keep post stages for side activities rather than main workflow run. In serial run, the x64 wont trigger unless all parallel succeeds.

@gaiksaya
Copy link
Member Author

gaiksaya commented Jun 6, 2024

Thanks for this change @gaiksaya should really fasten the release promotion. You can directly modify the existing release-promotion.jenkinsfile file right ?

We can just wanted to keep that as a back up incase this workflow causes some issues. Once we know it works we can replace the original one and deprecate this. WDYT?

@gaiksaya gaiksaya merged commit 04be2b9 into opensearch-project:main Jun 6, 2024
10 checks passed
@gaiksaya gaiksaya deleted the add-parallel branch June 6, 2024 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run all promotion jobs in parallel
5 participants