Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: fix data race in plugin manager #12553

Merged
merged 1 commit into from
Apr 12, 2022
Merged

CSI: fix data race in plugin manager #12553

merged 1 commit into from
Apr 12, 2022

Conversation

tgross
Copy link
Member

@tgross tgross commented Apr 12, 2022

The plugin manager for CSI hands out instances of a plugin for callers
that need to mount a volume. The MounterForPlugin method accesses
the internal instances map without a lock, and can be called
concurrently from outside the plugin manager's main run-loop.

The original commit for the instances map included a warning that it
needed to be accessed only from the main loop but that comment was
unfortunately ignored shortly thereafter, so this bug has existed in
the code for a couple years without being detected until we ran tests
with -race in #12098. Lesson learned here: comments make for lousy
enforcement of invariants!

test outputs with -race

client RPC tests:

$ NOMAD_TEST_LOG_LEVEL=OFF go test -race -v ./client -run TestCSI
=== RUN   TestCSIController_AttachVolume
=== PAUSE TestCSIController_AttachVolume
=== RUN   TestCSIController_ValidateVolume
=== PAUSE TestCSIController_ValidateVolume
=== RUN   TestCSIController_DetachVolume
=== PAUSE TestCSIController_DetachVolume
=== RUN   TestCSIController_CreateVolume
=== PAUSE TestCSIController_CreateVolume
=== RUN   TestCSIController_DeleteVolume
=== PAUSE TestCSIController_DeleteVolume
=== RUN   TestCSIController_ListVolumes
=== PAUSE TestCSIController_ListVolumes
=== RUN   TestCSIController_CreateSnapshot
=== PAUSE TestCSIController_CreateSnapshot
=== RUN   TestCSIController_DeleteSnapshot
=== PAUSE TestCSIController_DeleteSnapshot
=== RUN   TestCSIController_ListSnapshots
=== PAUSE TestCSIController_ListSnapshots
=== RUN   TestCSINode_DetachVolume
=== PAUSE TestCSINode_DetachVolume
=== CONT  TestCSIController_AttachVolume
=== RUN   TestCSIController_AttachVolume/returns_plugin_not_found_errors
=== RUN   TestCSIController_AttachVolume/validates_volumeid_is_not_empty
=== RUN   TestCSIController_AttachVolume/validates_nodeid_is_not_empty
=== RUN   TestCSIController_AttachVolume/validates_AccessMode
=== RUN   TestCSIController_AttachVolume/validates_attachmentmode_is_not_empty
=== RUN   TestCSIController_AttachVolume/returns_transitive_errors
=== RUN   TestCSIController_AttachVolume/handles_nil_PublishContext
=== RUN   TestCSIController_AttachVolume/handles_non-nil_PublishContext
--- PASS: TestCSIController_AttachVolume (0.74s)
    --- PASS: TestCSIController_AttachVolume/returns_plugin_not_found_errors (0.10s)
    --- PASS: TestCSIController_AttachVolume/validates_volumeid_is_not_empty (0.10s)
    --- PASS: TestCSIController_AttachVolume/validates_nodeid_is_not_empty (0.09s)
    --- PASS: TestCSIController_AttachVolume/validates_AccessMode (0.09s)
    --- PASS: TestCSIController_AttachVolume/validates_attachmentmode_is_not_empty (0.09s)
    --- PASS: TestCSIController_AttachVolume/returns_transitive_errors (0.10s)
    --- PASS: TestCSIController_AttachVolume/handles_nil_PublishContext (0.08s)
    --- PASS: TestCSIController_AttachVolume/handles_non-nil_PublishContext (0.09s)
=== CONT  TestCSINode_DetachVolume
=== RUN   TestCSINode_DetachVolume/returns_plugin_not_found_errors
=== RUN   TestCSINode_DetachVolume/validates_volumeid_is_not_empty
=== RUN   TestCSINode_DetachVolume/validates_nodeid_is_not_empty
=== RUN   TestCSINode_DetachVolume/returns_transitive_errors
--- PASS: TestCSINode_DetachVolume (0.36s)
    --- PASS: TestCSINode_DetachVolume/returns_plugin_not_found_errors (0.09s)
    --- PASS: TestCSINode_DetachVolume/validates_volumeid_is_not_empty (0.08s)
    --- PASS: TestCSINode_DetachVolume/validates_nodeid_is_not_empty (0.09s)
    --- PASS: TestCSINode_DetachVolume/returns_transitive_errors (0.10s)
=== CONT  TestCSIController_ListSnapshots
=== RUN   TestCSIController_ListSnapshots/returns_plugin_not_found_errors
=== RUN   TestCSIController_ListSnapshots/returns_transitive_errors
=== RUN   TestCSIController_ListSnapshots/returns_volumes
--- PASS: TestCSIController_ListSnapshots (0.27s)
    --- PASS: TestCSIController_ListSnapshots/returns_plugin_not_found_errors (0.09s)
    --- PASS: TestCSIController_ListSnapshots/returns_transitive_errors (0.08s)
    --- PASS: TestCSIController_ListSnapshots/returns_volumes (0.09s)
=== CONT  TestCSIController_CreateSnapshot
=== RUN   TestCSIController_CreateSnapshot/returns_plugin_not_found_errors
=== RUN   TestCSIController_CreateSnapshot/returns_transitive_errors
=== RUN   TestCSIController_CreateSnapshot/returns_snapshot_on_success
--- PASS: TestCSIController_CreateSnapshot (0.26s)
    --- PASS: TestCSIController_CreateSnapshot/returns_plugin_not_found_errors (0.09s)
    --- PASS: TestCSIController_CreateSnapshot/returns_transitive_errors (0.09s)
    --- PASS: TestCSIController_CreateSnapshot/returns_snapshot_on_success (0.08s)
=== CONT  TestCSIController_DeleteSnapshot
=== RUN   TestCSIController_DeleteSnapshot/returns_plugin_not_found_errors
=== RUN   TestCSIController_DeleteSnapshot/returns_transitive_errors
--- PASS: TestCSIController_DeleteSnapshot (0.17s)
    --- PASS: TestCSIController_DeleteSnapshot/returns_plugin_not_found_errors (0.08s)
    --- PASS: TestCSIController_DeleteSnapshot/returns_transitive_errors (0.08s)
=== CONT  TestCSIController_DeleteVolume
=== RUN   TestCSIController_DeleteVolume/returns_plugin_not_found_errors
=== RUN   TestCSIController_DeleteVolume/returns_transitive_errors
--- PASS: TestCSIController_DeleteVolume (0.17s)
    --- PASS: TestCSIController_DeleteVolume/returns_plugin_not_found_errors (0.08s)
    --- PASS: TestCSIController_DeleteVolume/returns_transitive_errors (0.08s)
=== CONT  TestCSIController_ListVolumes
=== RUN   TestCSIController_ListVolumes/returns_plugin_not_found_errors
=== RUN   TestCSIController_ListVolumes/returns_transitive_errors
=== RUN   TestCSIController_ListVolumes/returns_volumes
--- PASS: TestCSIController_ListVolumes (0.26s)
    --- PASS: TestCSIController_ListVolumes/returns_plugin_not_found_errors (0.09s)
    --- PASS: TestCSIController_ListVolumes/returns_transitive_errors (0.09s)
    --- PASS: TestCSIController_ListVolumes/returns_volumes (0.09s)
=== CONT  TestCSIController_DetachVolume
=== RUN   TestCSIController_DetachVolume/returns_plugin_not_found_errors
=== RUN   TestCSIController_DetachVolume/validates_volumeid_is_not_empty
=== RUN   TestCSIController_DetachVolume/validates_nodeid_is_not_empty
=== RUN   TestCSIController_DetachVolume/returns_transitive_errors
--- PASS: TestCSIController_DetachVolume (0.35s)
    --- PASS: TestCSIController_DetachVolume/returns_plugin_not_found_errors (0.08s)
    --- PASS: TestCSIController_DetachVolume/validates_volumeid_is_not_empty (0.08s)
    --- PASS: TestCSIController_DetachVolume/validates_nodeid_is_not_empty (0.09s)
    --- PASS: TestCSIController_DetachVolume/returns_transitive_errors (0.09s)
=== CONT  TestCSIController_ValidateVolume
=== RUN   TestCSIController_ValidateVolume/validates_volumeid_is_not_empty
=== RUN   TestCSIController_ValidateVolume/returns_plugin_not_found_errors
=== RUN   TestCSIController_ValidateVolume/validates_attachmentmode
=== RUN   TestCSIController_ValidateVolume/validates_AccessMode
=== RUN   TestCSIController_ValidateVolume/returns_transitive_errors
--- PASS: TestCSIController_ValidateVolume (0.44s)
    --- PASS: TestCSIController_ValidateVolume/validates_volumeid_is_not_empty (0.08s)
    --- PASS: TestCSIController_ValidateVolume/returns_plugin_not_found_errors (0.08s)
    --- PASS: TestCSIController_ValidateVolume/validates_attachmentmode (0.10s)
    --- PASS: TestCSIController_ValidateVolume/validates_AccessMode (0.09s)
    --- PASS: TestCSIController_ValidateVolume/returns_transitive_errors (0.09s)
=== CONT  TestCSIController_CreateVolume
=== RUN   TestCSIController_CreateVolume/returns_plugin_not_found_errors
=== RUN   TestCSIController_CreateVolume/validates_AccessMode
=== RUN   TestCSIController_CreateVolume/validates_attachmentmode_is_not_empty
=== RUN   TestCSIController_CreateVolume/returns_transitive_errors
--- PASS: TestCSIController_CreateVolume (0.37s)
    --- PASS: TestCSIController_CreateVolume/returns_plugin_not_found_errors (0.11s)
    --- PASS: TestCSIController_CreateVolume/validates_AccessMode (0.09s)
    --- PASS: TestCSIController_CreateVolume/validates_attachmentmode_is_not_empty (0.08s)
    --- PASS: TestCSIController_CreateVolume/returns_transitive_errors (0.09s)
PASS
ok      github.com/hashicorp/nomad/client       3.520s

plugin manager tests:

$ NOMAD_TEST_LOG_LEVEL=warn go test -race -v ./client/pluginmanager/csimanager/
=== RUN   TestBuildBasicFingerprint_Node
=== RUN   TestBuildBasicFingerprint_Node/Minimal_successful_response
=== RUN   TestBuildBasicFingerprint_Node/Successful_response_with_capabilities_and_topologies
=== RUN   TestBuildBasicFingerprint_Node/PluginGetCapabilities_Failed
=== RUN   TestBuildBasicFingerprint_Node/NodeGetInfo_Failed
--- PASS: TestBuildBasicFingerprint_Node (0.00s)
    --- PASS: TestBuildBasicFingerprint_Node/Minimal_successful_response (0.00s)
    --- PASS: TestBuildBasicFingerprint_Node/Successful_response_with_capabilities_and_topologies (0.00s)
    --- PASS: TestBuildBasicFingerprint_Node/PluginGetCapabilities_Failed (0.00s)
    --- PASS: TestBuildBasicFingerprint_Node/NodeGetInfo_Failed (0.00s)
=== RUN   TestBuildControllerFingerprint
=== RUN   TestBuildControllerFingerprint/Minimal_successful_response
=== RUN   TestBuildControllerFingerprint/Successful_response_with_capabilities
=== RUN   TestBuildControllerFingerprint/ControllerGetCapabilities_Failed
--- PASS: TestBuildControllerFingerprint (0.00s)
    --- PASS: TestBuildControllerFingerprint/Minimal_successful_response (0.00s)
    --- PASS: TestBuildControllerFingerprint/Successful_response_with_capabilities (0.00s)
    --- PASS: TestBuildControllerFingerprint/ControllerGetCapabilities_Failed (0.00s)
=== RUN   TestBuildNodeFingerprint
=== RUN   TestBuildNodeFingerprint/Minimal_successful_response
=== RUN   TestBuildNodeFingerprint/Successful_response_with_capabilities_and_topologies
=== RUN   TestBuildNodeFingerprint/NodeGetCapabilities_Failed
--- PASS: TestBuildNodeFingerprint (0.00s)
    --- PASS: TestBuildNodeFingerprint/Minimal_successful_response (0.00s)
    --- PASS: TestBuildNodeFingerprint/Successful_response_with_capabilities_and_topologies (0.00s)
    --- PASS: TestBuildNodeFingerprint/NodeGetCapabilities_Failed (0.00s)
=== RUN   TestInstanceManager_Shutdown
--- PASS: TestInstanceManager_Shutdown (0.02s)
=== RUN   TestManager_RegisterPlugin
--- PASS: TestManager_RegisterPlugin (0.01s)
=== RUN   TestManager_DeregisterPlugin
--- PASS: TestManager_DeregisterPlugin (0.02s)
=== RUN   TestManager_MultiplePlugins
--- PASS: TestManager_MultiplePlugins (0.03s)
=== RUN   TestManager_ConcurrentPlugins
=== RUN   TestManager_ConcurrentPlugins/replacement_races_on_host_restart
=== RUN   TestManager_ConcurrentPlugins/interleaved_register_and_deregister
--- PASS: TestManager_ConcurrentPlugins (0.06s)
    --- PASS: TestManager_ConcurrentPlugins/replacement_races_on_host_restart (0.03s)
    --- PASS: TestManager_ConcurrentPlugins/interleaved_register_and_deregister (0.02s)
=== RUN   TestUsageTracker
=== RUN   TestUsageTracker/Register_and_deregister_all_allocs
=== RUN   TestUsageTracker/Register_all_and_deregister_partial_allocs
--- PASS: TestUsageTracker (0.00s)
    --- PASS: TestUsageTracker/Register_and_deregister_all_allocs (0.00s)
    --- PASS: TestUsageTracker/Register_all_and_deregister_partial_allocs (0.00s)
=== RUN   TestVolumeManager_ensureStagingDir
=== PAUSE TestVolumeManager_ensureStagingDir
=== RUN   TestVolumeManager_stageVolume
=== PAUSE TestVolumeManager_stageVolume
=== RUN   TestVolumeManager_unstageVolume
=== PAUSE TestVolumeManager_unstageVolume
=== RUN   TestVolumeManager_publishVolume
=== PAUSE TestVolumeManager_publishVolume
=== RUN   TestVolumeManager_unpublishVolume
=== PAUSE TestVolumeManager_unpublishVolume
=== RUN   TestVolumeManager_MountVolumeEvents
=== PAUSE TestVolumeManager_MountVolumeEvents
=== CONT  TestVolumeManager_stageVolume
=== RUN   TestVolumeManager_stageVolume/Returns_an_error_when_an_invalid_AttachmentMode_is_provided
=== RUN   TestVolumeManager_stageVolume/Returns_an_error_when_an_invalid_AccessMode_is_provided
=== RUN   TestVolumeManager_stageVolume/Returns_an_error_when_the_plugin_returns_an_error
=== RUN   TestVolumeManager_stageVolume/Happy_Path
--- PASS: TestVolumeManager_stageVolume (0.01s)
    --- PASS: TestVolumeManager_stageVolume/Returns_an_error_when_an_invalid_AttachmentMode_is_provided (0.00s)
    --- PASS: TestVolumeManager_stageVolume/Returns_an_error_when_an_invalid_AccessMode_is_provided (0.00s)
    --- PASS: TestVolumeManager_stageVolume/Returns_an_error_when_the_plugin_returns_an_error (0.00s)
    --- PASS: TestVolumeManager_stageVolume/Happy_Path (0.00s)
=== CONT  TestVolumeManager_MountVolumeEvents
--- PASS: TestVolumeManager_MountVolumeEvents (0.00s)
=== CONT  TestVolumeManager_unpublishVolume
=== RUN   TestVolumeManager_unpublishVolume/Returns_an_error_when_the_plugin_returns_an_error
=== RUN   TestVolumeManager_unpublishVolume/Happy_Path
--- PASS: TestVolumeManager_unpublishVolume (0.00s)
    --- PASS: TestVolumeManager_unpublishVolume/Returns_an_error_when_the_plugin_returns_an_error (0.00s)
    --- PASS: TestVolumeManager_unpublishVolume/Happy_Path (0.00s)
=== CONT  TestVolumeManager_unstageVolume
=== RUN   TestVolumeManager_unstageVolume/Returns_an_error_when_the_plugin_returns_an_error
=== RUN   TestVolumeManager_unstageVolume/Happy_Path
--- PASS: TestVolumeManager_unstageVolume (0.00s)
    --- PASS: TestVolumeManager_unstageVolume/Returns_an_error_when_the_plugin_returns_an_error (0.00s)
    --- PASS: TestVolumeManager_unstageVolume/Happy_Path (0.00s)
=== CONT  TestVolumeManager_publishVolume
=== RUN   TestVolumeManager_publishVolume/Returns_an_error_when_the_plugin_returns_an_error
=== RUN   TestVolumeManager_publishVolume/Happy_Path
=== RUN   TestVolumeManager_publishVolume/Mount_options_in_the_volume
=== RUN   TestVolumeManager_publishVolume/Mount_options_override_in_the_request
--- PASS: TestVolumeManager_publishVolume (0.00s)
    --- PASS: TestVolumeManager_publishVolume/Returns_an_error_when_the_plugin_returns_an_error (0.00s)
    --- PASS: TestVolumeManager_publishVolume/Happy_Path (0.00s)
    --- PASS: TestVolumeManager_publishVolume/Mount_options_in_the_volume (0.00s)
    --- PASS: TestVolumeManager_publishVolume/Mount_options_override_in_the_request (0.00s)
=== CONT  TestVolumeManager_ensureStagingDir
=== RUN   TestVolumeManager_ensureStagingDir/Creates_a_directory_when_one_does_not_exist
=== RUN   TestVolumeManager_ensureStagingDir/Does_not_fail_because_of_a_pre-existing_directory
=== RUN   TestVolumeManager_ensureStagingDir/Returns_negative_mount_info
=== RUN   TestVolumeManager_ensureStagingDir/Returns_positive_mount_info
    volume_test.go:92: TODO: Skipped because we don't detect bind mounts
--- PASS: TestVolumeManager_ensureStagingDir (0.00s)
    --- PASS: TestVolumeManager_ensureStagingDir/Creates_a_directory_when_one_does_not_exist (0.00s)
    --- PASS: TestVolumeManager_ensureStagingDir/Does_not_fail_because_of_a_pre-existing_directory (0.00s)
    --- PASS: TestVolumeManager_ensureStagingDir/Returns_negative_mount_info (0.00s)
    --- SKIP: TestVolumeManager_ensureStagingDir/Returns_positive_mount_info (0.00s)
PASS
ok      github.com/hashicorp/nomad/client/pluginmanager/csimanager      0.235s

@tgross tgross requested review from shoenig and lgfa29 April 12, 2022 14:23
The plugin manager for CSI hands out instances of a plugin for callers
that need to mount a volume. The `MounterForPlugin` method accesses
the internal instances map without a lock, and can be called
concurrently from outside the plugin manager's main run-loop.

The original commit for the instances map included a warning that it
needed to be accessed only from the main loop but that comment was
unfortunately ignored shortly thereafter, so this bug has existed in
the code for a couple years without being detected until we ran tests
with `-race` in #12098. Lesson learned here: comments make for lousy
enforcement of invariants!
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments make for lousy enforcement of invariants!

Wise words 🙂

client/pluginmanager/csimanager/manager.go Show resolved Hide resolved
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tgross
Copy link
Member Author

tgross commented Apr 12, 2022

Bah, I accidentally pushed another commit onto here that should be in its own PR (#12554). I've backed that out. Waiting for CI to be green again.

@tgross tgross merged commit 9d5b3bc into main Apr 12, 2022
@tgross tgross deleted the b-csi-data-race branch April 12, 2022 16:18
@lgfa29 lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Apr 20, 2022
lgfa29 pushed a commit that referenced this pull request Apr 20, 2022
The plugin manager for CSI hands out instances of a plugin for callers
that need to mount a volume. The `MounterForPlugin` method accesses
the internal instances map without a lock, and can be called
concurrently from outside the plugin manager's main run-loop.

The original commit for the instances map included a warning that it
needed to be accessed only from the main loop but that comment was
unfortunately ignored shortly thereafter, so this bug has existed in
the code for a couple years without being detected until we ran tests
with `-race` in #12098. Lesson learned here: comments make for lousy
enforcement of invariants!
lgfa29 pushed a commit that referenced this pull request Apr 20, 2022
The plugin manager for CSI hands out instances of a plugin for callers
that need to mount a volume. The `MounterForPlugin` method accesses
the internal instances map without a lock, and can be called
concurrently from outside the plugin manager's main run-loop.

The original commit for the instances map included a warning that it
needed to be accessed only from the main loop but that comment was
unfortunately ignored shortly thereafter, so this bug has existed in
the code for a couple years without being detected until we ran tests
with `-race` in #12098. Lesson learned here: comments make for lousy
enforcement of invariants!
@shoenig shoenig mentioned this pull request Jun 2, 2022
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants