-
Notifications
You must be signed in to change notification settings - Fork 29
Conversation
Unit tests passed:
DISABLED_grpc_client_connectivity_test result:
|
Future improvement: Right now, the on-demand engine, after receiving the operation status from NCM, sleeps for a period of time, with the hope of it waking up with the needed goalstate(s) received and applied. This leads to, in many cases, a long time for the first packet to go out(around 1000ms). In some cases, when only sending 1 packet to the destination, the packet got timed out eventually; in other lucky case, all first packets were sent out in less than 2 ms (futurewei-cloud/alcor#560 (comment)). I believe this highly depends on the connection between ACAs and the NCM, and how fast NCM sends back data. If it is fast enough, then on-demand engine wakes up with needed goalstates, it sends out the packet really quickly; otherwise, it is slow. Possible solutions:
|
Implemented a "Smart" sleep here, which checks if the target arp entry exists, Posting results of 10 ping tests:
|
|
||
do { | ||
found_arp_entry = | ||
aca_arp_responder::ACA_ARP_Responder::get_instance().does_arp_entry_exist(stData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zzxgzgz do we know how long it will take for this call? if it takes too much time, we better increase wait_time
. If it takes very short, we can reduce wait_time
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I am keeping the max wait time to be one second.
If the call takes more than one second to reply, the reply is useless, even if the reply is SUCCESS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zzxgzgz Sorry, I mean the check_frequency_us
. That will affect how often we need to make this call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. No worries, it is configurable in include/aca_config.h
.
For now it is set to 10 seconds, we can adjust it how ever we prefer.
data_for_on_demand_call *data = new data_for_on_demand_call; | ||
// auto size_of_packet_void_pointer = sizeof(packet); | ||
void *packet_copy = malloc(packet_size); | ||
memcpy(packet_copy, packet, packet_size); | ||
data->in_port = in_port; | ||
data->packet = packet_copy; | ||
data->packet_size = packet_size; | ||
data->protocol = _protocol; | ||
request_uuid_on_demand_data_map[uuid_str] = data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zzxgzgz So you put every unknown packet into a map including it's packet_payload. I am afraid of you will run out of memory if the unknown request never got processed by NCM and there are many unknown packets coming in.
We need to figure out how to handle this exception. For example, set up a threshold for each unknown packet. If threshold time-up, just drop packet and remove it from the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can clear the map periodically, like every one minute.
A few scenarios to think about:
- Packet comes in, only a few entries in the map -> no problem.
- Packet comes in, a lot of entries in the map -> means that a lot of entries haven't been taken care of, probably NCM is down, then we clear the map if the time exceeds the threshold.
- Packet doesn't come in, we don't touch the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added another thread to clean the request_uuid_on_demand_data_map
periodically.
…INE to handle cleaning expired entries
…dded time measurement for map writes
Added time measurement for
Also, a new variable called |
Did a comparison with the ACA on 100 ports were generated on both hosts, before 100 ping commands are issued from Sync call output:
Async call output:
One thing to notice is that, when testing with the |
… gs sent->received&&programmed operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@zzxgzgz This is very good thinking. You've made quite a bit of optimization in this PR, this is very positive way of approaching an optimized solution. Also, for any planed items or ideas you plan to do, you could create issue(s) in this repo so that you could continue tracking it. Thanks! |
This PR tries to address the following:
Current results:
ACA is able to call NCM asynchronously, and so far it looks like that the on-demand workflow goes as usual.
The above screenshot shows port_1, which is on host_one and has an IP of 10.0.0.2, was able to ping 10.0.0.3, which is another port inside of another host, after ACA on host_one requested the needed GoalStates from the NCM.
Next:
Need to test scenarios in which multiple ports are involved, to see if the change really works.