Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race in stunnel port selection #129

Merged
merged 1 commit into from
Dec 2, 2022

Conversation

tsmetana
Copy link
Contributor

Issue #, if available:
Issue #125

Description of changes:
To prevent the stunnel port selection race between parallel mount.efs processes keep the probed port bound until the stunnel configuration file gets written and add a check for the config file existence prior trying to bind the stunnel port to check for its availability.

@tsmetana
Copy link
Contributor Author

I'll fix the failing tests (so far I keep running into pytest-dev/pytest#8539 due to python-3.10)...

@elimumford
Copy link

@lshigupt you appear to be the primary maintainer. This PR would be very helpful in addressing issues in the kubernetes-sigs/aws-efs-csi-driver (kubernetes-sigs/aws-efs-csi-driver#695)

@tsmetana
Copy link
Contributor Author

I actually don't think the CSI driver issue kubernetes-sigs/aws-efs-csi-driver#695 can be fixed from within efs-utils: this would however help with other problems in the CSI driver (like deleting multiple unused volumes).

@lshigupt
Copy link
Contributor

lshigupt commented Jul 22, 2022

Thanks a lot @tsmetana for the PR, I am doing the testing and I could see that some of the Tests are failing on our End. I am trying to debug them and Will post the comments where it is failing.

These are the Tests which are Failing:

[CPython37-release] =========================== short test summary info ============================
[CPython37-release] FAILED test/mount_efs_test/test_bootstrap_tls.py::test_bootstrap_tls_state_file_dir_exists
[CPython37-release] FAILED test/mount_efs_test/test_bootstrap_tls.py::test_bootstrap_tls_state_file_nonexistent_dir
[CPython37-release] FAILED test/mount_efs_test/test_bootstrap_tls.py::test_bootstrap_tls_cert_created
[CPython37-release] FAILED test/mount_efs_test/test_bootstrap_tls.py::test_bootstrap_tls_non_default_port
[CPython37-release] FAILED test/mount_efs_test/test_bootstrap_tls.py::test_bootstrap_tls_non_default_verify_level
[CPython37-release] FAILED test/mount_efs_test/test_bootstrap_tls.py::test_bootstrap_tls_ocsp_option
[CPython37-release] FAILED test/mount_efs_test/test_bootstrap_tls.py::test_bootstrap_tls_noocsp_option
[CPython37-release] FAILED test/mount_efs_test/test_choose_tls_port.py::test_choose_tls_port_first_try
[CPython37-release] FAILED test/mount_efs_test/test_choose_tls_port.py::test_choose_tls_port_second_try
[CPython37-release] FAILED test/mount_efs_test/test_choose_tls_port.py::test_choose_tls_port_never_succeeds
[CPython37-release] FAILED test/mount_efs_test/test_choose_tls_port.py::test_choose_tls_port_option_specified

@tsmetana
Copy link
Contributor Author

tsmetana commented Aug 2, 2022

Hello. I've tried to run the tests with python-3.7.13 and I'm not able to reproduce the failure. Do you have some more detailed logs?

@tsmetana
Copy link
Contributor Author

@lshigupt Hi. I wouldn't really want the issue and PR to rot completely... Is there anything I can do to help getting it moving forward?

@RyanStan
Copy link
Member

Hi @tsmetana, sorry for the delay. We've picked this back up and will look into it.

@tsmetana
Copy link
Contributor Author

tsmetana commented Dec 1, 2022

@RyanStan any news? I mean... The patch is not that big and (I hope) quite understandable.

Copy link
Contributor

@Cappuccinuo Cappuccinuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. We will do a quick release for this fix soon tomorrow morning so we will put this commit along with other urgent fix we have.

The fix is not permanent, since it is not guaranteed there is no race condition between the socket is closed and the stunnel is launched, though the time interval is pretty tight. We will work on a long term fix for this.

src/mount_efs/__init__.py Outdated Show resolved Hide resolved
@@ -944,13 +944,13 @@ def choose_tls_port(config, options):
assert len(tls_ports) == len(ports_to_try)

if "netns" not in options:
tls_port = find_tls_port_in_range(ports_to_try)
sock = find_tls_port_in_range(state_file_dir, fs_id, mountpoint, ports_to_try)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tls_port_sock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

Comment on lines +970 to +985
mount_filename = get_mount_specific_filename(fs_id, mountpoint, tls_port)
config_file = get_stunnel_config_filename(state_file_dir, mount_filename)
if os.access(config_file, os.R_OK):
logging.info("confifguration for port %s already exists, trying another port", tls_port)
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Since if the port is already used the binding will fail anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's is not necessary but helps in the case the port biniding fails to distinguish whether we're not clashing with other processes. Makes debugging easier.

@@ -1430,7 +1436,8 @@ def bootstrap_tls(
state_file_dir=STATE_FILE_DIR,
fallback_ip_address=None,
):
tls_port = choose_tls_port(config, options)
sock = choose_tls_port(state_file_dir, fs_id, mountpoint, config, options)
tls_port = sock.getsockname()[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use a function for this socket.getsockname() such that in unit test you can directly use the func.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're always interested in both the socket and the port, I changed the function to return a tuple instead.

Comment on lines 1516 to 1517
# close the socket now, so the stunnel process can bind to the port
sock.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add the close to a finally statement, so if any steps failed in between you create the socket and create the socket, the socket will eventually be closed cleanly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

@tsmetana
Copy link
Contributor Author

tsmetana commented Dec 2, 2022

The fix is not permanent, since it is not guaranteed there is no race condition between the socket is closed and the stunnel is launched, though the time interval is pretty tight. We will work on a long term fix for this.

Could you elaborate on this, please? The important part is the configuration file here: it has to be written first before the socket is closed and the stunnel process is launched. No other efs-mount can choose the same port since it would either find the existing configuration file (that serves also as a lock for the chosen port essentially) or fail to bind the same port if other efs-mount is trying to create the new configuration (i.e. the file is not written yet, but other efs-mount has chosen the same port already). This is why there's the check for the config file existence there and why I wanted to explicitly log if another efs-mount is trying to use the same port based on that.

@Cappuccinuo
Copy link
Contributor

No other efs-mount can choose the same port since it would either find the existing configuration file

I think that only applies to the same file system right? Since the configuration file is checked based on the fs and mountpoint and tlsport, but another file system may not be applied here.

Anyway I will merge this change, and push another commit and bump the release version. We can continue our discussion here in the thread, I will add more detail to it. Thanks for the PR!

@Cappuccinuo Cappuccinuo merged commit 478f009 into aws:master Dec 2, 2022
@tsmetana
Copy link
Contributor Author

tsmetana commented Dec 2, 2022

I think that only applies to the same file system right? Since the configuration file is checked based on the fs and mountpoint and tlsport, but another file system may not be applied here.

True. So adding a separate lock file with just the port should be sufficient. Or remove the fs and mount point from the config file name if it's not used for anything else (would have to check), or at worse parse the filename and check just the port part...

@tsmetana
Copy link
Contributor Author

tsmetana commented Dec 2, 2022

@Cappuccinuo I think by removing the config file existence check in the latest patch you actually introduced the race you described even for single fsid/mountpoint case: now we rely on the (uncertain) fact that stunnel binds the socket before another efs-mount tries is out... If we checked the config file, the time between closing the socket and stunnel binding the port would not matter, the port wouldn't be even probed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants