Design for provisioning alternative nodes on failure

First Proposed: 2021-08-25
Authors:
- Raj Dharwadkar
Relevant Team(s): Product Engineering, Provisioning
Status: [ Draft ]

Summary

Primary reasons for provisioning failures may be hardware, system, or service failure. Making provisioning workflow reliable, even in the face of an unreliable environment.

Background

The user sees more provisioning failures in unreliable environments. On any provisioning failure, there is no timeout associated with each provisioning event so the user has to wait 25 minutes to see provisioning failure.

Scope

Improve user provisioning success rate by providing provisioning retry on failure.
Make the user's provisioning experience better and improve provisioning performance, by providing faster response time from each provisioning event.

Goals

Increase provisioning success rate.
Make provisioning performance better.
Changes in provisioning success rate and performance can be tracked in the grafana provisioning dashboard.

Non-Goals

Handle repeated provisioning failure with the same configuration, if provisioning continues to fail on new hardware after switching then provisioning workflow does not continue with the next hardware.

Detailed design

Provisioning one server at a time, In case of provisioning failure, automatically switch to new hardware and start provisioning with the same configuration and deprovision failed server. We should note that users are aware of provisioning failure and switching to new hardware because they had access to the server IP and console at the initial phase of provisioning.

Additional: To prevent users from waiting for 25 minutes on provisioning failure, check all provisioning events have a wait timeout associated with them so that the user sees failure early and quickly. To improve wait timeout, events should retry 2 or 3 times before returning/exiting.

Drawbacks

Additional hardware must be available in case of failure.
Provisioning may take a longer time on failure, as we switch to other hardware to retry provisioning.

Alternatives Considered

To decrease provisioning time in the first approach, we can provision two servers at a time with the same configuration. On successful provisioning of any one of two servers, the failed server can be deprovisioned. Deprovision workflow should be modified to make it faster to prevent more hardware usage.

Drawbacks

Almost twice the number of hardware is required in the facilities, as we provision two servers at a time over a short period.
Provisioning failure may increase if there is a shortage of hardware in the facilities.

Unresolved questions

This section is optional but recommended for first drafts. What parts of this design have not yet been finalized?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design_provision.md

Design_provision.md

Design for provisioning alternative nodes on failure

Summary

Background

Scope

Goals

Non-Goals

Detailed design

Drawbacks

Alternatives Considered

Drawbacks

Unresolved questions

Files

Design_provision.md

Latest commit

History

Design_provision.md

File metadata and controls

Design for provisioning alternative nodes on failure

Summary

Background

Scope

Goals

Non-Goals

Detailed design

Drawbacks

Alternatives Considered

Drawbacks

Unresolved questions