-
Notifications
You must be signed in to change notification settings - Fork 542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added changes to handle dependency check in FdbSyncd and FpmSyncd for warm-boot #1556
Conversation
warmrestart/warmRestartAssist.cpp
Outdated
@@ -117,6 +117,16 @@ AppRestartAssist::cache_state_t AppRestartAssist::getCacheEntryState(const std:: | |||
throw std::logic_error("cache entry state is invalid"); | |||
} | |||
|
|||
void AppRestartAssist::appDataReplayed() | |||
{ | |||
WarmStart::setWarmStartState(m_appName, WarmStart::REPLAYED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 0, length = 8)
Too much indentation. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
fdbsyncd/fdbsync.cpp
Outdated
else | ||
{ | ||
SWSS_LOG_INFO("Module %s NOT Replayed or Reconciled %d",module.c_str(), (int) state); | ||
//return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused code #WontFix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stubbed code so that the basic functionality will not fail until all the warm-reboot changes are available in the code base. Once all the warm-reboot changes are available, the actual code will be uncommented and the stub will be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I am following here, what are the other changes needed to support warm-reboot? If this PR dependent on them, just mark this PR depending on other PRs, this PR won't get merged before others to be available. That will be easier to track the dependencies by automatic testing and guard all the correctness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is the last PR being merged other dependent PRs are already merged. Will delete the stub and activate the actual code here with this PR based on these comments.
fdbsyncd/fdbsync.cpp
Outdated
{ | ||
SWSS_LOG_INFO("Module %s NOT Replayed or Reconciled %d",module.c_str(), (int) state); | ||
//return false; | ||
//Return true till all the dependant code is checked in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// [](start = 12, length = 2)
Add a blank after //
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
fdbsyncd/fdbsync.cpp
Outdated
{ | ||
vector<string> required_modules = { | ||
"orchagent", | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much indentation #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
fdbsyncd/fdbsync.cpp
Outdated
{ | ||
SWSS_LOG_INFO("Module %s NOT Reconciled %d",module.c_str(), (int) state); | ||
//return false; | ||
//Return True untill the dependant orchagent code is commited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code and comment will be deleted when all the dependent warm-reboot code is available in the code base. will add the space though
sync.getRestartAssist()->readTablesToMap(); | ||
|
||
while (!sync.isIntfRestoreDone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isIntfRestoreDone [](start = 29, length = 17)
CPU is wasted on waiting. Could you subscribe Redis? #WontFix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is not a continuous busy wait ( sleep is present), this should not cause the cpu to be continuously busy. Also there is nothing for fdbsyncd to do until the interface info is populated to kernel after system warm-reboot, hence it needs to wait till such time.
replayCheckTimer.start(); | ||
s.addSelectable(&replayCheckTimer); | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra blank line #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
retest this please |
fdbsyncd/fdbsync.cpp
Outdated
else | ||
{ | ||
SWSS_LOG_INFO("Module %s NOT Reconciled %d",module.c_str(), (int) state); | ||
//return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused code. #WontFix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stubbed code so that the basic functionality will not fail until all the warm-reboot changes are available in the code base. Once all the warm-reboot changes are available, the actual code will be uncommented and the stub will be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same points here, the code should be correct, and dependencies should be marked as PR description level and let the test to guide if PR is ready or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update the actual code flow as this is the last dependent PR.
retest this please |
2 similar comments
retest this please |
retest this please |
retest vs please |
retest this please |
1 similar comment
retest this please |
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the PR, it seems orchagent dependency is pending and the function is simply returning true. Based on the priority for this PR to be taken to release branch, suggest to remove the orchagent part from this and later add with the dependent orchagent changes.
fpmsyncd/fpmsyncd.cpp
Outdated
{ | ||
bool readyToReconcile = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alignment issue. Please fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will Fix this
fpmsyncd/fpmsyncd.cpp
Outdated
if (temps == &warmStartTimer) | ||
{ | ||
readyToReconcile = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alignment issue. Please fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will Fix this
fpmsyncd/fpmsyncd.cpp
Outdated
else | ||
{ | ||
readyToReconcile = sync.isReadyToReconcile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alignment issue. Please fix
fpmsyncd/routesync.cpp
Outdated
{ | ||
SWSS_LOG_INFO("Module %s NOT Reconciled %d",module.c_str(), (int) state); | ||
//return false; | ||
//Return true untill dependent module code is commited |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the dependent module code? Are there any further changes expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dependent code means all the code PRs which add this dependency check. Since these are committed as different PRs, I have stubbed the actual check for dependency. This will ensure that if one of the PR is not present, the dependency check will not fail. Once all the PRs related to this dependence check are merged, these stubbs will be deleted and actual check will be enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same applies to the above comment regarding orchagent code too. The orchagent code is already present. Once all the PRs ( now only 1556 is remaining) are merged, the actual code check will be activated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will commit the actual code and remove the stubbed code as other PRs are already merged and this is the last PR
Fixing alignment issue
fdbsyncd/fdbsync.cpp
Outdated
SWSS_LOG_INFO("Module %s NOT Replayed or Reconciled %d",module.c_str(), (int) state); | ||
//return false; | ||
// Return true till all the dependant code is checked in | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return true here? if we do nothing here, the function will return true in the end anyways, why bother returning true here?
fpmsyncd/fpmsyncd.cpp
Outdated
else | ||
{ | ||
readyToReconcile = sync.isReadyToReconcile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you replace tab with white spaces?
@prsunny and team - can this one advance now? Thx. |
waiting on @zhenggen-xu to sign-off |
fdbsyncd/fdbsync.h
Outdated
@@ -10,7 +10,16 @@ | |||
#include "warmRestartAssist.h" | |||
|
|||
// The timeout value (in seconds) for fdbsyncd reconcilation logic | |||
#define DEFAULT_FDBSYNC_WARMSTART_TIMER 30 | |||
#define DEFAULT_FDBSYNC_WARMSTART_TIMER 600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I get the point why we increase this timer to a huge value. This is the same timer value used before for fdb table reconciliation logic itself, so we expect much longer time for fdb to reconciliation due to dependencies? Now, we have more timers, based on the code, we wait minimal FDBSYNC_RECON_WAIT_TIME to check orchagent reconciliation state after replay, if not ready, we check every second. Why FDBSYNC_RECON_WAIT_TIME is 120? This is also considerably big, we need some data to support this value. And also, if orchagent never reconcile, should we abort instead, I,E warm restart failed? We need define the behaviour of the timers mentioned above and document it.
|
||
if (pasttime > INTF_RESTORE_MAX_WAIT_TIME) | ||
{ | ||
SWSS_LOG_INFO("timed-out before all interface data was replayed to kernel!!!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if intf is not restored after max_wait_time? Shouldn't we abort to avoid more issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System will proceed further. Some mac programming to kernel might fail because underlying interface is not yet created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could not restore interface, why we should proceed further and get into some limbo state that may or may not have critical issues. I would suggest we abort to bring user's attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interfaces will be eventually restored. The only impact will be that warm-reboot might not be hitless and there will be traffic loss seen. Not sure if we need to go for full abort and impact everything and all traffic. Requesting @prsunny to comment on this too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Some mac programming to kernel might fail because underlying interface is not yet created." so this condition will be recovered by someone later? Again, if it is a critical condition, we should raise/abort so we don't get into limbo state.
fpmsyncd/fpmsyncd.cpp
Outdated
{ | ||
if (sync.isReadyToReconcile()) | ||
{ | ||
reconcileHoldTimer.setInterval(timespec{2, 0}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the hold-timer for? Why do we need wait another 2 seconds before reconcile happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once fpmsyncd reconcile is done, this timer checks if orchagent is reconciled and we are ready to start updating the reconciled fpmsyncd entries into the APP-DB. Since did not want to check too frequently, kept the timer at 2 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reconcileHoldTimer is one-shot timer in the code, it just means we wait one time 2 seconds then do reconcile. Question was, why need wait another 2 seconds? what did we wait for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reconcileHoldTimer is one shot timer which is fired once the control plane reconciliation is done. At this time if orchagnet is reconciled, we continue with fpnsyncd reconcillation. However when reconcileHoldTimer expires if orchangent is still not reconciled, we fire reconcileHoldTimer for another 2 seconds and re-check if the orchagent is reconciled yet. This continues till it finds orchagent has reconciled. The idea is it wait for minimum of control plane ( BGP ) reconciliation time ( default 120 seconds) and also check if orchagent is reconciled before reconciling fpmsyncd. Both conditions should be met. Orchagent reconcile can occur earlier or later based on system scale, however we need to wait for minimum control plane ( BGP ) reconcillation time, before reconciling fpmsyncd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some correction in the above explanation, reconcileCheckTimer ( not reconcileHoldTimer) is fired again for 2 seconds to re-check if orchagent has reconciled. Hence reconcileHoldTimer does not need 2 seconds. Will change it to 1 seconds like eoiu hold timer.
fpmsyncd/fpmsyncd.cpp
Outdated
else | ||
{ | ||
readyToReconcile = sync.isReadyToReconcile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will basically invalidate the eoiu design for fast reconciliation. Understood we probably have to rely on orchagent reconciliation status, then the DEFAULT_ROUTING_RECON_CHECK_INTERVAL should be reduced to way smaller to take advantage of eoui.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEFAULT_ROUTING_RECON_CHECK_INTERVAL timer is only fallback timer. Is will only come into action when the replay/reconcile does not happen in given time. Hence it should not affect eoui handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eoiuHoldTimer is one-shot timer as is in the code, it could be triggered very early like after a few seconds, at that time readyToReconcile could be false (likely) , then the code will fall back to the reconcileCheckTimer (if timer not configured, then DEFAULT_ROUTING_RECON_CHECK_INTERVAL =120 seconds). This is what I meant invalidate the eoiu design. Example: eoui takes 10 seconds to finish, the orchagent takes 15 seconds to get readyToReconcile state, we still need wait at least 120 seconds to reconcile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your point. When eoiuHoldTimer expires, that means control plane ( BGP ) has converged. So now we just need to wait for orchangent reconcile and proceed once that is done. To implement this, will restart reconCheck timer with 2 seconds when eoiuHoldTimer expires, to check if orchagent has reconciled. Will implement this change.
fpmsyncd/fpmsyncd.cpp
Outdated
if (temps == &warmStartTimer) | ||
{ | ||
readyToReconcile = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of orchagent still not ready, should we just abort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reconcile should take care of cleaning up stale entries. So not sure abort is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there was a reason we wait for "sync.isReadyToReconcile", I assume it was a must condition for reconcile to be working as expected. Again, if that condition is broken, we should make it visible to user. This probably won't happen in normal case, if it does, we should have information for user to debug, so raise or abort could help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again same as earlier, orchagent will eventually reconcile. If it does not, it would have its own mechanism to recover/abort. Here we are making effort to reconcile the system to pre-warm-reboot state. Hence if we continue, we would reconcile prematurely and hence warm-reboot may not be hitless. However if we abort, we would impact everything and the full traffic will be hit.
fpmsyncd/fpmsyncd.cpp
Outdated
@@ -16,7 +16,9 @@ using namespace swss; | |||
* Default warm-restart timer interval for routing-stack app. To be used only if | |||
* no explicit value has been defined in configuration. | |||
*/ | |||
const uint32_t DEFAULT_ROUTING_RESTART_INTERVAL = 120; | |||
const uint32_t DEFAULT_ROUTING_RESTART_INTERVAL = 600; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, we should well define these two timers: DEFAULT_ROUTING_RESTART_INTERVAL and DEFAULT_ROUTING_RECON_CHECK_INTERVAL in document/code-comments.
@zhenggen-xu , could you please check the updated code? |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
@prsunny - what needs to happen to merge this? Are we just waiting for @zhenggen-xu approval? |
Yes @ben-gale . @zhenggen-xu and @yxieca |
When eoiuHoldTimer expires, that means control plane ( BGP ) has converged. So now we just need to wait for orchangent reconcile and proceed once that is done. To implement this restart reconCheck timer when eoiuHoldTimer expires, to check if orchagent has reconciled
Based on discussion with @qiluo-msft , in the current warmboot design, it is not required for fdbsyncd (or any application) to wait for orchagent reconciliation. Once orchagent reads the APP_DB data during warmboot start, applications can write to DB and it would be treated as a normal operations by orchagent post bake. Also, from a design perspective, it is not recommended for applications to wait for orchagent but should be able to handle independently. Suggest to remove the orchagent reconcile section from the PR. Let us know if you've any further questions. |
Sure .. will make the change and re-submit |
Removed the dependency on orchagent reconcillation as per the review discussion and conclusion Also added exception when interfaces are not replayed to kernel in the given time
@zhenggen-xu , the PR has been restructured. Could you please take a look? |
if (sync.getFdbStateTable()->empty() && sync.getCfgEvpnNvoTable()->empty()) | ||
{ | ||
sync.getRestartAssist()->appDataReplayed(); | ||
SWSS_LOG_NOTICE("FDB Replay Complete"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removeSelectable for replayCheckTimer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since the replaychecktimer and reconciliation timer are in parallel, what is the consequence if reconciliation timer is up, but we haven't replayed? If replay is must, but not yet done after reconciliation timer, we should log the error and raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will removeSelectable replayCheckTimer and start recontillation timer after replay is done.
/* | ||
* Default warm-restart timer interval for routing-stack app | ||
*/ | ||
#define DEFAULT_FDBSYNC_WARMSTART_TIMER 120 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, how is this fdb depending on routing stack? If user configures the routing stack warm-restart timer to a bigger value and it actually took that much time to reconcile for routing stack, what is the consequence?
If the dependency is must, We should probably also read the routing stack reconciliation status before we reconcile here for fdb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fdbsyncd reconciliation is dependant on BGP convergence time. Will change it to use the BGP warm-restart timer config value instead of hardcoded value. That way the reconcile is related to the control plane convergence. Same way its done in fpmsyncd too. Further to optimise the reconciliation time, EOIU feature is implemented for fpmsyncd to check for actual protocol convergence. This is not yet validated for fdbsyncd and will be implemented later. For now fdbsyncd will only use the bgp warm-restart timer config value as in fpmsyncd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fdbsyncd reconciliation is dependant on BGP convergence time. Will change it to use the BGP warm-restart timer config value instead of hardcoded value. That way the reconcile is related to the control plane convergence. Same way its done in fpmsyncd too. Further to optimise the reconciliation time, EOIU feature is implemented for fpmsyncd to check for actual protocol convergence. This is not yet validated for fdbsyncd and will be implemented later. For now fdbsyncd will only use the bgp warm-restart timer config value as in fpmsyncd
fdbsyncd/fdbsync.h
Outdated
// The timeout value (in seconds) for fdbsyncd reconcilation logic | ||
#define DEFAULT_FDBSYNC_WARMSTART_TIMER 30 | ||
/* | ||
* Default warm-restart timer interval for routing-stack app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the comment to default timer for fdb reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure .. will do
SWSS_LOG_NOTICE("FDB Replay Complete"); | ||
s.removeSelectable(&replayCheckTimer); | ||
|
||
/* Obtain warm-restart timer defined for routing application */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comments here for:
- fdb dependencies on routing application
- We should have TBD comment for addressing optimization of EOIU etc later. IMO, checking the bgp reconciliation state is a better way to handle the dependency. If we really have to do it next, let's track it with an issue, and add it to the code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will add the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitted issue #1657 to track the eoiu implementation for EVPN AF
Commenter does not have sufficient privileges for PR 1556 in repo Azure/sonic-swss |
retest this please |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
… warm-boot (sonic-net#1556) Added changes to handle dependency check in FpmSyncd and FdbSyncd for warmreboot. This was done to ensure for EVPN warm-reboot the order of data replay to kernel is maintained across various submodules and the kernel programming will be successful.
… warm-boot (sonic-net#1556) Added changes to handle dependency check in FpmSyncd and FdbSyncd for warmreboot. This was done to ensure for EVPN warm-reboot the order of data replay to kernel is maintained across various submodules and the kernel programming will be successful.
What I did Added platform pre check support in reboot script. Checking platform based changes before stopping dockers and sonic services. Porting changes in master from 201911 branch sonic-net#1472 How I did it On branch reboot_pre_check_master Changes not staged for commit: (use "git add ..." to update what will be committed) (use "git checkout -- ..." to discard changes in working directory) modified: scripts/reboot How to verify it Write a platform pre check script(platform_reboot_pre_check) and place it in /usr/share/sonic/device// directory. If the script exit with status 0, reboot will be proceeded. If script exit with non-zero status, the reboot script gets stopped.
What I did
Added changes to handle dependency check in FpmSyncd and FdbSyncd for warmreboot
Why I did it
This was done to ensure for EVPN warm-reboot the order of data replay to kernel is maintained across various submodules and the kernel programming will be successful.
How I verified it
Verified with EVPN warmreboot
Details if related
More details in warmreboot section of EVPN VXLAN HLD
sonic-net/SONiC#437