Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in ConntrackConnectionStore and FlowExporter #3655

Merged
merged 1 commit into from
Apr 25, 2022

Conversation

heanlan
Copy link
Contributor

@heanlan heanlan commented Apr 18, 2022

Conntrack connection store's polling go routine and flow exporter both access to conntrack connection store, and there's a race condition error.

In func (cs *ConntrackConnectionStore) Poll(), two things happen in sequence:

  1. in deleteIfStaleOrResetConn, we acquire the lock, reset conn.IsPresent = false for all the connections in connection map, and then release the lock (conn.IsPresent is used to describe whether the connection exist in conntrack table or not)
  2. if a connection is verified to exist in the conntrack table, in AddOrUpdateConn, we acquire the lock, set conn.IsPresent = true, then release the lock

It is likely to happen, when flow exporter's timer is triggered between 1 and 2, it will grab the lock, and read a connection with IsPresent set to false. In the corresponding exported flow record, flowEndReason will be set to 3, representing the flow has ended. Here's an antrea-agent log to verify the existence of this error: log

We fix it by holding the lock until we finish 1&2.

The observation comes from the error log of flowaggregator e2e test. In the record, flowEndReason was set to 3, so the test treated the record as the last record. Pointer to test code. It computed the throughput value by totalByteCount/iperfTimeSec, without reading from the throughput field in the record, which has the correct value.

Fixes: #3650

Signed-off-by: heanlan [email protected]

@heanlan heanlan added area/flow-visibility/exporter Issues or PRs related to the Flow Exporter functions in the Agent kind/bug Categorizes issue or PR as related to a bug. labels Apr 18, 2022
@heanlan heanlan marked this pull request as draft April 18, 2022 21:41
@heanlan heanlan force-pushed the flaky-aggregator-e2e branch from e00f8c1 to a6093c3 Compare April 18, 2022 22:01
@heanlan heanlan marked this pull request as ready for review April 18, 2022 22:03
@codecov-commenter
Copy link

codecov-commenter commented Apr 18, 2022

Codecov Report

Merging #3655 (323ba53) into main (bd82ef6) will decrease coverage by 7.54%.
The diff coverage is 67.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3655      +/-   ##
==========================================
- Coverage   64.65%   57.10%   -7.55%     
==========================================
  Files         278      392     +114     
  Lines       39363    54823   +15460     
==========================================
+ Hits        25449    31306    +5857     
- Misses      11939    21060    +9121     
- Partials     1975     2457     +482     
Flag Coverage Δ
integration-tests 38.39% <ø> (?)
kind-e2e-tests 52.07% <67.56%> (-0.22%) ⬇️
unit-tests 43.84% <29.72%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/agent/flowexporter/connections/connections.go 75.75% <44.44%> (-2.27%) ⬇️
.../flowexporter/connections/conntrack_connections.go 78.92% <75.00%> (-0.39%) ⬇️
pkg/agent/openflow/service.go 72.09% <0.00%> (-11.63%) ⬇️
pkg/agent/controller/networkpolicy/reject.go 77.05% <0.00%> (-8.83%) ⬇️
pkg/controller/networkpolicy/status_controller.go 68.14% <0.00%> (-3.63%) ⬇️
pkg/controller/traceflow/controller.go 74.72% <0.00%> (-2.53%) ⬇️
pkg/log/log_file.go 70.17% <0.00%> (-1.76%) ⬇️
pkg/agent/route/route_linux.go 48.22% <0.00%> (-1.58%) ⬇️
pkg/controller/ipam/antrea_ipam_controller.go 77.55% <0.00%> (-1.54%) ⬇️
...gent/controller/noderoute/node_route_controller.go 56.15% <0.00%> (-0.97%) ⬇️
... and 123 more

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the helpful PR description.
A couple of typos there, e.g. "We fix it by hold the lock until finish 1&2." instead of "We fix it by holding the lock until we finish 1&2."

pkg/agent/flowexporter/connections/connections.go Outdated Show resolved Hide resolved
@heanlan heanlan force-pushed the flaky-aggregator-e2e branch from 44d2030 to fd726be Compare April 19, 2022 01:14
antoninbas
antoninbas previously approved these changes Apr 19, 2022
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@heanlan
Copy link
Contributor Author

heanlan commented Apr 19, 2022

/test-all

@heanlan heanlan force-pushed the flaky-aggregator-e2e branch 2 times, most recently from 6737581 to c4e0568 Compare April 19, 2022 19:55
@heanlan
Copy link
Contributor Author

heanlan commented Apr 19, 2022

Squash the commits and rewrite the commit message...

/test-all

Copy link
Contributor

@dreamtalen dreamtalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except a nit.

test/e2e/framework.go Outdated Show resolved Hide resolved
@heanlan
Copy link
Contributor Author

heanlan commented Apr 19, 2022

/test-all

antoninbas
antoninbas previously approved these changes Apr 19, 2022
@heanlan
Copy link
Contributor Author

heanlan commented Apr 20, 2022

/test-networkpolicy

@heanlan
Copy link
Contributor Author

heanlan commented Apr 20, 2022

/test-e2e

Conntrack connection store's polling go routine and flow exporter both access
to conntrack connection store, and there's a race condition error.

In the polling go routine, `deleteIfStaleOrResetConn` and `AddOrUpdateConn`
both grab the lock, modify `conn.IsPresent` field, and release the lock. Between
the execution of these two functions, it is likely that FlowExporter's timer is
triggered and it reads the wrong `conn.IsPresent` value in an intermidiate state.
We fix it by holding the lock until we finish the execution of both two functions.

Fixes: antrea-io#3650

Signed-off-by: heanlan <[email protected]>
@heanlan
Copy link
Contributor Author

heanlan commented Apr 21, 2022

/test-all

@heanlan
Copy link
Contributor Author

heanlan commented Apr 21, 2022

/test-e2e

@heanlan
Copy link
Contributor Author

heanlan commented Apr 25, 2022

/test-e2e
/test-networkpolicy

@heanlan heanlan requested a review from antoninbas April 25, 2022 18:46
@antoninbas antoninbas merged commit 052987a into antrea-io:main Apr 25, 2022
@heanlan heanlan deleted the flaky-aggregator-e2e branch April 28, 2022 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/flow-visibility/exporter Issues or PRs related to the Flow Exporter functions in the Agent kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky test] TestFlowAggregator/IPv4/InterNodeFlows
5 participants