Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Migrate OSS to temporal scheduler #12757

Merged
merged 22 commits into from
May 19, 2022

Conversation

lmossman
Copy link
Contributor

@lmossman lmossman commented May 11, 2022

What

See issue comment for context: #10021 (comment)
Resolves #10021

Migrates OSS to the new scheduler.

How

This PR sets the usesNewScheduler feature flag to true for everyone regardless of the environment variable value.
I also added logic to set a database metadata flag to true after the migration is performed, and skip the migration if it is true, in order to prevent us needing to execute this server logic on every startup. I borrowed this idea from Benoit's PR here.

I have tested this out locally and it seems to properly migrate my deployment from the old scheduler to the new one.

Recommended reading order

  1. ServerApp.java
  2. EnvVariableFeatureFlags.java
  3. DefaultJobPersistence.java

🚨 User Impact 🚨

The impact to the user is that they will be migrated to the new temporal-based scheduler upon upgrade, which should improve OSS Airbyte deployments' ability to handle a large number of connections.

A smooth migration requires users to spin down their existing deployment and spin it up on the upgraded version, as described in our docs here, which should prevent the old scheduler from running during the migration.
If any jobs were actively running at the time of the upgrade, then the next time those connections perform a sync, those zombie jobs will be marked as failed.

@github-actions github-actions bot added area/platform issues related to the platform area/scheduler area/server labels May 11, 2022
@lmossman lmossman temporarily deployed to more-secrets May 11, 2022 01:11 Inactive
@lmossman lmossman temporarily deployed to more-secrets May 11, 2022 01:11 Inactive
@@ -233,6 +233,10 @@ public static ServerRunnable getServer(final ServerFactory apiFactory,
final Flyway jobsFlyway = FlywayFactory.create(jobsDataSource, DbMigrationHandler.class.getSimpleName(), JobsDatabaseMigrator.DB_IDENTIFIER,
JobsDatabaseMigrator.MIGRATION_FILE_LOCATION);

// It is important that the migration to the temporal scheduler is performed before the server accepts any requests.
// This is why this migration is performed here instead of in the bootloader - so that the server blocks on this.
migrateExistingConnectionsToTemporalScheduler(configRepository, jobPersistence, eventRunner);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recall that as I described in this comment, the logic to cleanup non-terminal jobs is only executed just before a new job is created by the temporal workflow.

One consequence of this is that if a user upgrades while a job is running, then that job will continue to have that RUNNING status until the next time that the connection is synced. So in the worst case, this means 24 hours could pass before that zombie job is marked as FAILED and a new job is created.

This isn't an ideal situation, but I think it is tolerable as a one-time occurrence for this migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note, this was fixed in that other PR by also cleaning up jobs at the beginning of the temporal workflow, so this should no longer be an issue

Copy link
Contributor

@jdpgrailsdev jdpgrailsdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Follow up question: is there a plan/scheduled work to come back and a) remove the environment variable that represented the feature flag and b) remove/clean up the code so that there is no mention of "new" scheduler, now that there is only one?

@jdpgrailsdev
Copy link
Contributor

@lmossman Looks like some files in the PR need formatting.

private final String SCHEDULER_MIGRATION_STATUS = "schedulerMigration";

@Override
public boolean isSchedulerMigrated() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this something that we can remove next time we do a major version bump? i don't think we should do it just for this, but if we can remove it when we do, it would be good to make sure we track or label it somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should be able to remove this on the next major version bump. I will add a comment in the code and also make a ticket for tracking

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmossman Do you have a ticket for the tracking I have some elements I want to add in the list of stuff we want to do for the next major bump

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes here is the ticket: #12823

LOGGER.info("Migration to temporal scheduler has already been performed");
return;
}

LOGGER.info("Start migration to the new scheduler...");
final Set<UUID> connectionIds =
configRepository.listStandardSyncs().stream()
.filter(standardSync -> standardSync.getStatus() == Status.ACTIVE || standardSync.getStatus() == Status.INACTIVE)
.map(standardSync -> standardSync.getConnectionId()).collect(Collectors.toSet());
eventRunner.migrateSyncIfNeeded(connectionIds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a test that check that we don't run anything if the jobPersitence returns true for the isSchedulerMigrated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can add that test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added tests for both cases

@lmossman
Copy link
Contributor Author

:shipit:

Follow up question: is there a plan/scheduled work to come back and a) remove the environment variable that represented the feature flag and b) remove/clean up the code so that there is no mention of "new" scheduler, now that there is only one?

@jdpgrailsdev yes this work is planned, we briefly discussed it this morning during sprint planning. This is the ticket: #8445 (still needs to be fleshed out but it is basically what you described)

@lmossman lmossman temporarily deployed to more-secrets May 12, 2022 22:19 Inactive
@lmossman lmossman temporarily deployed to more-secrets May 12, 2022 22:19 Inactive
Base automatically changed from lmossman/start-temporal-from-clean-state to lmossman/repair-unexpected-temporal-state May 13, 2022 00:42
@lmossman
Copy link
Contributor Author

lmossman commented May 13, 2022

I merged the base branch of this PR into master without realizing that would auto-close this PR. Reopening this and rebasing it back on master

@lmossman lmossman reopened this May 13, 2022
@lmossman lmossman changed the base branch from lmossman/repair-unexpected-temporal-state to master May 13, 2022 00:56
@github-actions github-actions bot added area/api Related to the api area/worker Related to worker labels May 13, 2022
@lmossman lmossman force-pushed the lmossman/migrate-oss-to-temporal-scheduler branch from 2ffe5f8 to e452515 Compare May 13, 2022 01:00
@github-actions github-actions bot removed area/worker Related to worker area/api Related to the api labels May 13, 2022
@lmossman lmossman temporarily deployed to more-secrets May 18, 2022 16:04 Inactive
@lmossman lmossman temporarily deployed to more-secrets May 18, 2022 16:35 Inactive
@github-actions github-actions bot added the area/worker Related to worker label May 18, 2022
@lmossman lmossman temporarily deployed to more-secrets May 18, 2022 18:28 Inactive
@lmossman lmossman temporarily deployed to more-secrets May 18, 2022 20:40 Inactive
@lmossman lmossman temporarily deployed to more-secrets May 18, 2022 20:59 Inactive
@github-actions github-actions bot removed the area/worker Related to worker label May 18, 2022
@lmossman lmossman temporarily deployed to more-secrets May 18, 2022 22:56 Inactive
@lmossman lmossman changed the title Migrate OSS to temporal scheduler ✨ Migrate OSS to temporal scheduler May 19, 2022
@lmossman lmossman merged commit 26ed385 into master May 19, 2022
@lmossman lmossman deleted the lmossman/migrate-oss-to-temporal-scheduler branch May 19, 2022 00:05
@marcosmarxm
Copy link
Member

@lmossman there is any impact for users? what improvements OSS can expect with new scheduler? Asking because this can be a good feature release.
@Amruta-Ranade fyi.

@Amruta-Ranade
Copy link
Contributor

If it is a user-facing feature, can you add the 🎉 emoji in the PR title so I will know to add it to the changelog?

@lmossman lmossman changed the title ✨ Migrate OSS to temporal scheduler 🎉 Migrate OSS to temporal scheduler May 19, 2022
@lmossman
Copy link
Contributor Author

@marcosmarxm @Amruta-Ranade I described the user impact in the PR description. The migration to the temporal scheduler will happen without any user action required, so user's won't necessarily see any change in terms of the user interface. But now that the scheduling is handled by temporal, OSS deployments should now be able to handle a larger scale of number of active connections in the deployment.

And, for OSS deployments on kubernetes, after upgrading to the next version that will include this change, users should then be able to do high-availability upgrades of airbyte without resulting in job failures. I.e. in future upgrades, kube users can just kubectl apply -k kube/overlays/stable without first needing to scale down airbyte.

I updated the PR title with the correct emoji, but that won't change the commit history on master since this PR was already merged in.

suhomud pushed a commit that referenced this pull request May 23, 2022
* Migrate OSS to temporal scheduler

* add comment about migration being performed in server

* add comments about removing migration logic

* formatting and add tests for migration logic

* rm duplicated test

* remove more duplicated build task

* remove retry

* disable acceptance tests that call temporal directly when on kube

* set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars to true to be consistent

* set default value of container orchestrator enabled to true

* Revert "set default value of container orchestrator enabled to true"

This reverts commit 21b3670.

* Revert "set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars to true to be consistent"

This reverts commit 6dd2ec0.

* Revert "Revert "set NEW_SCHEDULER and CONTAINER_ORCHESTRATOR_ENABLED env vars to true to be consistent""

This reverts commit 2f40f9d.

* Revert "Revert "set default value of container orchestrator enabled to true""

This reverts commit 26068d5.

* fix sync workflow test

* remove defunct cancellation tests due to internal temporal error

* format - remove unused imports

* revert changes that set container orchestrator enabled to true everywhere

* remove NEW_SCHEDULER feature flag from .env files, and set CONTAINER_ORCHESTRATOR_ENABLED flag to true for kube .env files

Co-authored-by: Benoit Moriceau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate OSS to the new scheduler
8 participants