Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vrf]: Fix freezing during interface binding (#6209) #1325

Merged
merged 1 commit into from
Feb 2, 2021

Conversation

ghost
Copy link

@ghost ghost commented Dec 22, 2020

  • Replacing using 'get_all' with 'exists' in port state checking
    procedure inside bind function to avoid freezing in the while loop,
    what caused by absence of related record in STATE_DB.

Signed-off-by: Maksym Belei [email protected]

- What I did
Resolves sonic-net/sonic-buildimage#6209
Freezing of "sudo config interface vrf bind EthernetXX Vrf_XX" operation has resolved.

- How I did it
As currently the system completely deletes "INTERFACE_TABLE|EthernetXX" record in STATE_DB on unbinding or on preparing stage of bind, there is not need to check presence of any keys of the record.
Checking of keys of "INTERFACE_TABLE|EthernetXX" record in STATE_DB, what caused freezing due to absence of the record, has changed to checking for existence of entire record.

- How to verify it
Perform the next steps:

  1. sudo config vrf add Vrf_custom
  2. sudo config interface vrf bind Ethernet36 Vrf_custom

After the steps above command "show vrf" should display Ethernet36 is bound to Vrf_custom.

@lgtm-com
Copy link

lgtm-com bot commented Dec 22, 2020

This pull request introduces 1 alert when merging 7d6cf0466cb756fd9428935655120ba2c624d76f into becf5b5 - view on LGTM.com

new alerts:

  • 1 for Unused import

@ghost ghost force-pushed the fix_bind_to_vrf branch 2 times, most recently from e779726 to 0f433de Compare December 22, 2020 11:45
@ghost ghost marked this pull request as ready for review December 22, 2020 14:09
@prsunny
Copy link
Contributor

prsunny commented Jan 11, 2021

@tylerlinp , can you take a look at the PR?. Is it caused due to the change in get_all introduced as part of #1119

@tylerlinp
Copy link
Contributor

@tylerlinp , can you take a look at the PR?. Is it caused due to the change in get_all introduced as part of #1119

@prsunny , I think it should be the return of get_all changed, you know the while loop is just to break at record absence (to ensure deleted completely), but now it seems not to return None. Why? I guess it results from py upgrade?

@ghost
Copy link
Author

ghost commented Jan 18, 2021

Dear @prsunny, @tylerlinp,

I have already investigated get_all function and there is no difference between new and old Python. In #6209 I have commented a real situation with the records in STATE_DB, which I faced while execution of bind command. It seems "INTERFACE_TABLE|EthernetXX" record is being completely deleted while binding, what causes the issue. I think the handling procedure of config_db.set_entry(table_name, interface_name, None) has changed, so "INTERFACE_TABLE|EthernetXX" record has begun to be deleted.

@tylerlinp
Copy link
Contributor

@maksymbelei95 I cannot understand you completely, you mean set_entry with None now is not del but add? Here is the workflow of bind operation,please help to get status of config_db and state_db.

config_db.set_entry(table_name, interface_name, None) was to delete entry if it exists in config_db.
There are two cases:
a) not exists,config_db not be changed
in this case, the interface hasn't been bond, it is not an L3 interface, and it doesn't exist in state_db,
while loop break directly, and then config_db.set_entry with vrf to bind,
this is the general case, I think your test is this one.
b) exists,entry will be deleted from config_db
in this case, the interface has been bond to a vrf or configured ip, to bind a new vrf that is vrf-move,
since config_db del event, intfmgrd handles DEL_COMMAND with state_db.del in handler,
and CLI while loop until interface deleted from state_db, that is intfmgrd has handled,
and then config_db.set_entry with vrf to bind.
After config_db.set_entry(table_name, interface_name, {"vrf_name": vrf_name}), intfmgrd handles SET_COMMAND when adds entry in state_db.

@ghost
Copy link
Author

ghost commented Jan 20, 2021

Dear @tylerlinp,

Currently, the situation with config_db and state_db while preforming config_db.set_entry(table_name, interface_name, None) is next:

  1. the command above execute the next command in config_db(output of redis-cli MONITOR):
    1610894268.243723 [4 unix:/var/run/redis/redis.sock] "DEL" "INTERFACE|Ethernet8"
  2. the action above is being handled by intfmgrd and the daemon performs deleting of the ethernet interface, including INTERFACE_TABLE|EthernetXX table key in state_db:
1610894268.287504 [0 unix:/var/run/redis/redis.sock] "EVALSHA" "88ba6312b8de850b3506966425174d8899aadd93" "4" "INTF_TABLE_CHANNEL" "INTF_TABLE_KEY_SET" "_INTF_TABLE:Ethernet8" "INTF_TABLE_DEL_SET" "G" "Ethernet8" "''" "''"
1610894268.287571 [0 lua] "SADD" "INTF_TABLE_KEY_SET" "Ethernet8"
1610894268.287626 [0 lua] "SADD" "INTF_TABLE_DEL_SET" "Ethernet8"
1610894268.287657 [0 lua] "DEL" "_INTF_TABLE:Ethernet8"
1610894268.287668 [0 lua] "PUBLISH" "INTF_TABLE_CHANNEL" "G"
1610894268.287876 [0 unix:/var/run/redis/redis.sock] "EVALSHA" "88270a7c5c90583e56425aca8af8a4b8c39fe757" "3" "INTF_TABLE_KEY_SET" "INTF_TABLE:" "INTF_TABLE_DEL_SET" "8192" "_"
1610894268.287916 [0 lua] "SPOP" "INTF_TABLE_KEY_SET" "8192"
1610894268.288275 [0 lua] "SREM" "INTF_TABLE_DEL_SET" "Ethernet8"
1610894268.288328 [0 lua] "DEL" "INTF_TABLE:Ethernet8"
1610894268.288356 [0 lua] "HGETALL" "_INTF_TABLE:Ethernet8"
1610894268.288370 [0 lua] "DEL" "_INTF_TABLE:Ethernet8"
1610894268.288874 [6 unix:/var/run/redis/redis.sock] "DEL" "INTERFACE_TABLE|Ethernet8"

I am sorry, but I am not sure what exactly consumer/producer handles those changes and executes redis scripts. It has past a lot of time since I have been investigating the issue.

Previously, the scripts from step 2 has not being performed by the system, so, INTERFACE_TABLE|EthernetXX table key in state_db has not being deleted. But now, as INTERFACE_TABLE|EthernetXX table does not exist after execution of config_db.set_entry(table_name, interface_name, None), while state_db.get_all(state_db.STATE_DB, _hash) != None: could not exit, because get_all operation always returns not None result. To exit from this while loop INTERFACE_TABLE|EthernetXX key should be present, but it should not contain any data inside.
Please, see the logs below. There are outputs of redis-cli MONITOR on a different builds(9fda295 - old build, af357f3 - new build, on which the defect has occurred):
redis_monitor_af357f3e.txt
redis_monitor_9fda295c.txt
As I mentioned in the issue, I have commented the places in the logs, in which the operations above are performed.

I though that the while loop has made as workaround to check that the system handled config_db.set_entry(table_name, interface_name, None) operation, because the system lose DEL operation when delete and add entry with the same name or something else:
# When config_db del entry and then add entry with same key, the DEL will lost.

Maybe, for example, intfmgrd daemon became to handle changes in config_db sequentially, so, currently it does not lose DEL operation?

@tylerlinp
Copy link
Contributor

Dear @tylerlinp,

Currently, the situation with config_db and state_db while preforming config_db.set_entry(table_name, interface_name, None) is next:

  1. the command above execute the next command in config_db(output of redis-cli MONITOR):
    1610894268.243723 [4 unix:/var/run/redis/redis.sock] "DEL" "INTERFACE|Ethernet8"
  2. the action above is being handled by intfmgrd and the daemon performs deleting of the ethernet interface, including INTERFACE_TABLE|EthernetXX table key in state_db:

I think you'd better read db status via redis-cli before bind command. Previously, interface Ethernet8 is not existence in config_db and state_db. What about now?

1610894268.287504 [0 unix:/var/run/redis/redis.sock] "EVALSHA" "88ba6312b8de850b3506966425174d8899aadd93" "4" "INTF_TABLE_CHANNEL" "INTF_TABLE_KEY_SET" "_INTF_TABLE:Ethernet8" "INTF_TABLE_DEL_SET" "G" "Ethernet8" "''" "''"
1610894268.287571 [0 lua] "SADD" "INTF_TABLE_KEY_SET" "Ethernet8"
1610894268.287626 [0 lua] "SADD" "INTF_TABLE_DEL_SET" "Ethernet8"
1610894268.287657 [0 lua] "DEL" "_INTF_TABLE:Ethernet8"
1610894268.287668 [0 lua] "PUBLISH" "INTF_TABLE_CHANNEL" "G"
1610894268.287876 [0 unix:/var/run/redis/redis.sock] "EVALSHA" "88270a7c5c90583e56425aca8af8a4b8c39fe757" "3" "INTF_TABLE_KEY_SET" "INTF_TABLE:" "INTF_TABLE_DEL_SET" "8192" "_"
1610894268.287916 [0 lua] "SPOP" "INTF_TABLE_KEY_SET" "8192"
1610894268.288275 [0 lua] "SREM" "INTF_TABLE_DEL_SET" "Ethernet8"
1610894268.288328 [0 lua] "DEL" "INTF_TABLE:Ethernet8"
1610894268.288356 [0 lua] "HGETALL" "_INTF_TABLE:Ethernet8"
1610894268.288370 [0 lua] "DEL" "_INTF_TABLE:Ethernet8"
1610894268.288874 [6 unix:/var/run/redis/redis.sock] "DEL" "INTERFACE_TABLE|Ethernet8"

I am sorry, but I am not sure what exactly consumer/producer handles those changes and executes redis scripts. It has past a lot of time since I have been investigating the issue.

Previously, the scripts from step 2 has not being performed by the system, so, INTERFACE_TABLE|EthernetXX table key in state_db has not being deleted. But now, as INTERFACE_TABLE|EthernetXX table does not exist after execution of config_db.set_entry(table_name, interface_name, None), while state_db.get_all(state_db.STATE_DB, _hash) != None: could not exit, because get_all operation always returns not None result. To exit from this while loop INTERFACE_TABLE|EthernetXX key should be present, but it should not contain any data inside.

Yes, as you said, there are two changes.
One is the step 2 being performed, that means intfmgrd received DEL event. Do notify DEL even though entry absence now? It doesn't matter, intfmgrd just executes extra useless operations, to del absent entry in app_db and state_db.
The other is get_all returns not None now, just as previously mentioned. This is the key of the problem, the reason should be found out. Previously, it return None in SonicV2Connector.get_all->DBInterface.get_all->blockable.wrapped at file sonic-py-swsssdk/src/swsssdk/interface.py line 53.

Please, see the logs below. There are outputs of redis-cli MONITOR on a different builds(9fda295 - old build, af357f3 - new build, on which the defect has occurred):
redis_monitor_af357f3e.txt
redis_monitor_9fda295c.txt
As I mentioned in the issue, I have commented the places in the logs, in which the operations above are performed.

I cannot download log files at the moment, but I think it maybe the DEL operations by intfmgrd.

I though that the while loop has made as workaround to check that the system handled config_db.set_entry(table_name, interface_name, None) operation, because the system lose DEL operation when delete and add entry with the same name or something else:
# When config_db del entry and then add entry with same key, the DEL will lost.

No, your thinking is beyond it was. The while loop was to wait real DEL op was executed over. It is used to handle case b) mentioned before, that it has nothing to do with this case - or case a). The system is never lose real DEL operations. The workaround is to handle rebind (unbind->bind) in bind command, it is to del existent entry just like unbind. The lost in notes is because two continuous operations were merged when addToSync.

Maybe, for example, intfmgrd daemon became to handle changes in config_db sequentially, so, currently it does not lose DEL operation?

I think intfmgrd hasn't been changed yet, while SyncMap/addToSync changed for holding DEL and SET, so your conclusion should be right.

@ghost
Copy link
Author

ghost commented Jan 27, 2021

Dear @tylerlinp

Thank you for clarification.
Could I ask you to confirm that deletion of INTERFACE_TABLE|EthernetXX from STATE_DB, while unbinding or preparing the interface to bind to some VRF is expected behavior? In older builds(9fda295c in my case), the record has not being deleted by the system, it just cleaned up the table, so, get_all operation was returning None and bind command was working fine.
If the behavior is expected, I think the issue could be solved by using operation exists for INTERFACE_TABLE|EthernetXX, and if does not exist, the interface is currently unbound.
If the behavior is not expected, I will continue investigating the issue in this direction.

@tylerlinp
Copy link
Contributor

tylerlinp commented Jan 28, 2021

@maksymbelei95 Yes, we expect INTERFACE_TABLE|EthernetXX doesn't exist in state_db when to set INTERFACE|EthernetXX to config_db, and get_all != None is just same as exists, it is more clear to use exists. Does get_all return not None or block itself? I haven't found codes changed.
By the way, in redis_monitor_af357f3e.txt interface Ethernet8 had been bond before this bind command? If not, that is say now DEL is notified even though entry absence.

To verify it, more test is needed:

  1. sudo config vrf add Vrf_custom
  2. sudo config interface vrf bind Ethernet36 Vrf_custom
  3. sudo config vrf add Vrf_custom1
  4. sudo config interface vrf bind Ethernet36 Vrf_custom1

@ghost
Copy link
Author

ghost commented Jan 28, 2021

@tylerlinp,
Thank you for confirmation.

get_all != None is just same as exists. Does get_all return not None or block itself?

Not completely so. exists checks whether the key exists or not, but get_all checks what the key contains. If the keys does not exist, get_all will cause exception throwing with according log in syslog, like Feb 28 18:20:01.457447 sonic WARNING sniffer: :- operator(): Key '{INTERFACE_TABLE|Ethernet36}' unavailable in database '{STATE_DB}', and return some trash(print result of this trash: <swsscommon.swsscommon.FieldValueMap; proxy of <Swig Object of type 'swss::TableMap *' at 0x7f731246a030> >).

I haven't found codes changed.

Me too. I think the reason, why bind operation was working is that the record in STATE_DB never has not being deleted before, what the recorded logs confirm.

By the way, in redis_monitor_af357f3e.txt interface Ethernet8 had been bond before this bind command? If not, that is say now
DEL is notified even though entry absence.

Yes, Ethernet8 was not bound. Seems your assumption is right. I have tested with rebinding to another VRF, the behavior is the same, bind operation stucks on while loop.

Any way, I will update the PR with exists, test it and ping you, when it will be done.

* Replacing using 'get_all' with 'exists' in port state checking
  procedure inside bind function to avoid freezing in the while loop,
  what caused by absence of related record in STATE_DB.

Signed-off-by: Maksym Belei <[email protected]>
@ghost
Copy link
Author

ghost commented Jan 29, 2021

@tylerlinp, @prsunny,

I have updated the PR according to result of discussion with @tylerlinp. Could you review it?
I have test a couple of cases, like, binding of non-bound interface and rebinding of bound interface to another VRF. The result looks good.

@ghost
Copy link
Author

ghost commented Feb 1, 2021

retest this please

@tylerlinp
Copy link
Contributor

It becomes more clear now, just like my first guess - the result of get_all changed, now it returns trash with throwing exception, but before the exception was caught by @blockable.wrapped() and it returns None. I think it is a general issue that needs to fix. As for this case, the new commit is perfect.

@ghost
Copy link
Author

ghost commented Feb 1, 2021

@tylerlinp, thank you for review. I will check execution stack of get_all function and will try to restore possibility of catching an exceptions.

@prsunny, as @tylerlinp has approved the PR, could I ask you to merge it?

@prsunny prsunny merged commit a6ae218 into sonic-net:master Feb 2, 2021
yxieca pushed a commit that referenced this pull request Feb 23, 2021
* Replacing using 'get_all' with 'exists' in port state checking procedure inside bind function to avoid freezing in the while loop, what caused by absence of related record in STATE_DB.

Signed-off-by: Maksym Belei <[email protected]>
anand-kumar-subramanian pushed a commit to anand-kumar-subramanian/sonic-utilities that referenced this pull request Mar 2, 2021
* Replacing using 'get_all' with 'exists' in port state checking procedure inside bind function to avoid freezing in the while loop, what caused by absence of related record in STATE_DB.

Signed-off-by: Maksym Belei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Could not bind Ethernet interface to VRF
4 participants