Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add counters enabling redesign document #918

Closed
wants to merge 15 commits into from

Conversation

noaOrMlnx
Copy link
Contributor

@noaOrMlnx noaOrMlnx commented Jan 4, 2022

enable_counters.py script sleeps for 3 minutes before enabling the counters.
This design purpose is to change the sleep mechanism to be event driven.

After the change, the script logic will be in flexCounterOrch and we will wait for the system to be stable and just then enable the counters.


### Ports Daemon

enable_counters.py script will be refactored to be running as a daemon.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be merged into the orchagent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add config db schame and yang schema for this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lguohan
Updated the design according to your request.
regarding the config db schema & yang model - why do we need to add config db schema? we don't add anything new to DB - it's all internal.
If you mean the delay_status attributes, we already have schema & yang for them.
https://github.com/Azure/sonic-swss-common/blob/master/common/schema.h#L225
https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/yang-models/sonic-flex_counter.yang#L49

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in the config db, we need to add something to set the delay time, for example 300 seconds which allow user to specify how long they want to delay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lguohan
This PR is related to the fact we are moving from sleep mechanism to be event driven.
I agree that adding an option of how much delay the user wants is good, but I don't think it's related to this design.

If you want, our team will take this feature, but in another design and plan.
what do you think?

@bingwang-ms
Copy link
Contributor

Will this change increase the time cost of reconciliation after fast/warm reboot since the counter is expected to be enabled earlier after this change?

@yxieca yxieca requested a review from vaibhavhd January 10, 2022 17:23
@yxieca
Copy link
Contributor

yxieca commented Jan 10, 2022

@vaibhavhd please review from the fast/warm reboot perspective.

@liat-grozovik
Copy link
Collaborator

@vaibhavhd any feedback on warm/fast boot? we wish to close the coding part and would like to get your feedback soon please.
@lguohan HLD was updated following your comments. Could you please take further look?

@vaibhavhd
Copy link
Contributor

For fast/warm reboot -
I don't see concerns for warmboot.
For fastboot - do we have any prototype to be evaluated for the design? It will be interesting to measure CPU consumption at the time when counters are enabled much earlier than 3 minutes after reboot.
AFAIK, high CPU consumption due to enabled counters was the impacting factor in fastboot reconciliation. This caused the dataplane downtime to be > 30s (which is not accepted).

@vaibhavhd
Copy link
Contributor

More for fast-reboot:

@shlomibitton with this new enable_counters re-design, do we still need FLEX_COUNTER_DELAY_STATUS and its related changes?
FLEX_COUNTER_DELAY_STATUS was added to prevent FlexCountersOrch from enabling counters until enable_counters script is called.

Since enable_counters.py script is now not needed, and the delay mechanism is local to FlexCountersOrch, we probably not need the changes that were added to support new flag - FLEX_COUNTER_DELAY_STATUS.:

sonic-net/sonic-buildimage#8500
sonic-net/sonic-swss-common#523
sonic-net/sonic-swss#1877
sonic-net/sonic-utilities#1768

@shlomibitton
Copy link
Contributor

@vaibhavhd Yes, we don't need it anymore.
But the new change need to handle this internally.
Means, ignore all events from Config DB and enable after all ports are up or the timeout reached.

doc/counters_enabling_redesign.md Outdated Show resolved Hide resolved
## Requirements

Remove enable_counters.py script and merge the logic into FlexCountersOrch.
FlexCountersOrch will wait for system to be up using events from APP DB, then enable the counters using the logic written in enable_counters.py.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlexCountersOrch will wait for system to be up

Please define/specify what is meant by system to be up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

doc/counters_enabling_redesign.md Outdated Show resolved Hide resolved
doc/counters_enabling_redesign.md Show resolved Hide resolved

- The daemon will also wait for a timer in order to be able to enable counters even if one of the ports is not stable.

If after 3 minutes (180 seconds), the counters were not enabled yet, enable counters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the offset for After 3 minutes? Is it reboot? Please specify that.

Also, in the present design we wait for 3m if the uptime was less than 5mins. I think you are covering this.

However we also have a wait for 1m in cases other than post-reboot. Do you not want to cover this scenario, and why not?

https://github.com/Azure/sonic-buildimage/blob/a0376a6e5972a7dde8810c6108bd9bf7a4c29e08/dockers/docker-orchagent/enable_counters.py#L57

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

The daemon will create a SelectableTimer and wait for the it to poke.

When doTask(SelectableTimer*) function will be triggered, it will check if the counters are already enabled.
If not, enable them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chances of race-condition here? What happens if counters enabling is in progress and the SelectableTimer is also triggered?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, after the enabling of counters, what happens to the new data structure defined in FlexCountersOrch?
There can be two cases to be covered here - event-driven mechanism (new design) and timeout (old design).

Copy link
Contributor Author

@noaOrMlnx noaOrMlnx Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chances of race-condition here? What happens if counters enabling is in progress and the SelectableTimer is also triggered?

If counters enabling is in progress, it means the daemon first changed a boolean to True. so when the timer will expire, it will check the boolean and will not enable the counters as well.

Also, after the enabling of counters, what happens to the new data structure defined in FlexCountersOrch? There can be two cases to be covered here - event-driven mechanism (new design) and timeout (old design).

The data structure will be in use only for the event-driven mechanism.
If we got to a point we need to enable the counters - we will enable the counters for all ports as it was in previous design.
Note: We are using the same behavior as in enable_counters.py script, but here we will trigger the counters enabling when all ports and LAGs are in their expected state.
If the system is not stable, we will do a fallback to the previous design - wait for the timer to expire.

@noaOrMlnx
Copy link
Contributor Author

@vaibhavhd Yes, we don't need it anymore. But the new change need to handle this internally. Means, ignore all events from Config DB and enable after all ports are up or the timeout reached.

@vaibhavhd @shlomibitton
In current design, FLEX_COUNTER_DELAY_STATUS is an optional flag that is being set only when we do fast-boot.
If I will implement your suggestion, we will delay counters enabling after all boot types.
do you think this is the right behavior?

When doTask(SelectableTimer*) function will be triggered, it will check if the counters are already enabled.
If not, enable them.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a section for fast/cold/warm reboot and dynamic port breakout scenarios.

@liat-grozovik
Copy link
Collaborator

@noaOrMlnx
Please align the HLD according to the HLD template. In additional

  1. Please add a clear table for each type of reboot what will be the max time. please include: reboot, fastboot, warmboot, config reload, swss service restart, etc. It will be helpful to understand the system behaviour
  2. Please add a section for testing. Lets cover the entire flow including the time expired as well as counters start before and timer is stopped.
    3.Following the community review feedback need to find a simpler way to declare 'fastboot' completed. based on that we can start counters/stop the max timer.

@yxieca yxieca force-pushed the master branch 2 times, most recently from 8498931 to 8837dc2 Compare April 15, 2022 16:51
@liat-grozovik
Copy link
Collaborator

Following offline discussion we will improve fastboot flow and wont move forward with this HLD. thus it is closed.
Please refer to #980 for more information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants