-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement connection retries #10
Conversation
621540f
to
03ed2bd
Compare
Pull Request Test Coverage Report for Build 115
💛 - Coveralls |
@dagwieers once merged #12 will fix the Python 2.6 issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @dagwieers, do you have a easy to reproduce script/playbook that I can use to test this out?
pypsrp/wsman.py
Outdated
connect=self.reconnection_retries, | ||
read=self.reconnection_retries, | ||
backoff_factor=self.reconnection_backoff, | ||
status_forcelist=[425, 429, 503], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why these status codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All three of these are HTTP status codes that allow for retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact HTTP 413, HTTP 429 and HTTP 503 return a Retry-After header that influences the backoff value. This way the server can tell the client to come back at some later interval (because it hasn't finished starting, or is handling too many connections).
Before our implementation would ignore those. (TBH I don't know if Microsoft has any of this implemented, but it doesn't hurt to do it anyway because a proxy or gateway could be used)
pypsrp/wsman.py
Outdated
@@ -126,6 +126,7 @@ def __init__(self, server, max_envelope_size=153600, operation_timeout=20, | |||
path="wsman", auth="negotiate", cert_validation=True, | |||
connection_timeout=30, encryption='auto', proxy=None, | |||
no_proxy=False, locale='en-US', data_locale=None, | |||
reconnection_retries=4, reconnection_backoff=2.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your reasoning for making this the default? I'm leaning towards keeping the existing behaviour rather than defaulting to retry on newer versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think reliability is more important than keeping the existing behavior, especially if the new behavior is identical to the existing behavior when everything works fine. Only when things don't work fine immediately (Connection Refused, Service Unavailable, Too Early, Too Many Requests) we retry, which is exactly what we should be doing (knowing what we know now).
The reason for choosing 4 retries with a backoff of 2 seconds means we will retry 4 times no longer than 30 seconds. Which is exactly the same as we have specified for a connection timeout. So I think this acceptable, if you accept that an unresponsive system will also take 30 seconds by default before it fails.
Not retrying in the above cases is the wrong behavior IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So with this, when does connection_timeout
apply and retry apply? It sounds like if there's no response after 30 seconds then the connection timeout will be reached? Is it going to try to reconnect again or is the reconnects only on issues that happen post connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation is not clear on this, but normally the connection timeout is per connection (not the total time) this should be apparent when you have slow DNS lookups or DNS problems, in this case the DNS latency is added to the connection timeout as well.
If you use retry=4, each connection has the same timeout. For the successful case there's no difference in behavior, for the case where the connection was established (but faulty) there's also no difference in behavior. But for the case where the server does not respond, and retries are attempted (not sure if it retries in this case, I would assume not) the connection timeout could be 4x as long. But it's easy to test this behavior in fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try it out, I'm wondering whether this will affect the logic in win_reboot and we should override some values in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we may need to look at win_reboot, although the default 4 retries and backoff value of 2 seconds works fine in our tests for win_reboot too. But it's probably not really needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in one of my tests connecting with WinRM to a non-existing server, it timed out after 30 seconds. Which means that it doesn't timeout for every retry, the timeout appears to cover the total time (including retries). Not sure whether this is because of urllib3, requests or a lower-level mechanism in this case.
@jborean93 This is an example where we need to recover gracefully: - name: Mount System-Center ISO image
win_disk_image:
image_path: C:\Windows\Temp\mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso
state: present
register: iso
# NOTE: Debug logs are available from: C:\ProgramData\VMMLogs\SetupWizard.log
# NOTE: The SCVMM installer does not support a config file in UTF-8 (with BOM).
# FIXME: This action requires that pywinrm can survive 30+ seconds of WinRM service unavailability
# And needs a patch for winrm/transport.py, see: https://github.com/ansible/ansible/issues/25532
- name: Run System-Center installer
win_command: '{{ iso.mount_path }}setup.exe /server /i /f "C:\Windows\Temp\VMServer.ini" /SqlDBAdminDomain "{{ windows_domain }}" /SqlDBAdminName "{{ windows_admin_user }}" /SqlDBAdminPassword "{{ windows_admin_password }}" /VmmServiceDomain "{{ windows_domain }}" /VmmServiceUserName "scvmmsvc" /VmmServiceUserPassword "{{ windows_admin_password }}" /IACCEPTSCEULA'
args:
creates: C:\Program Files\Microsoft System Center 2012 R2\Virtual Machine Manager\bin\VmmAdminUi.exe
vars:
ansible_user: '{{ windows_domain }}\{{ windows_admin_user }}'
async: 1000
poll: 15
- name: Unmount System-Center ISO image
win_disk_image:
image_path: C:\Windows\Temp\mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso
state: absent But it is not the easiest example. BTW Async is needed here to avoid: WSManFaultError: Received a WSManFault message. (Code: 995, Machine: 38.38.51.3, Reason: The I/O operation has been aborted because of either a thread exit or an application request.) |
03ed2bd
to
72b3809
Compare
No worries, I'll try it out and see how I go. |
72b3809
to
f8a5e35
Compare
I added HTTP 413 which can also return a Retry-After header. And I forced read=0 as well. |
f8a5e35
to
e5bb040
Compare
I did some tests:
|
Just did my first test, will add any other tests/issues I come across to this post;
|
Did you look at the logs, because with our defaults that seems impossible. |
Sorry 4 times not seconds, total of 30 seconds |
e5bb040
to
b7f539b
Compare
@jborean93 Right, I guess SSL errors are also connection errors, just like "Connection refused" and other connection errors. We cannot distinguish one from another within the requests Retry framework. And the intention is clearly that we do want to retry for "Connection refused" and other connection errors. I expect other HTTP clients using requests with Retry will have the same behavior. Not sure if we can/want to influence this behavior internally. Note that the following link actually recommends using the Retry framework over build-your-own especially for SSL-related issues: https://www.peterbe.com/plog/best-practice-with-retries-with-requests |
Hmm all these edge cases and the compounding of the connection timeout and backoff_factor is really pushing me against making retries the default and just keep it configurable by an end user if they really want this feature. I also question the validity of the status code retries, I'm trying to come up with a scenario where we would want to try on a non-200 error message. For example a 413 error message is due to us not setting the max envelope size properly and we send a packet that's too big for the server to handle. No amount of retries will change this behaviour and would just be a waste of time. Have you come across this before that requires these retries? |
@jborean93 https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/413 It can return a Retry-After header, which we wouldn't be supporting if we don't implement Retry. This is one of the use-cases of the mechanism. But I'm not married to HTTP 413. |
Now retries is off by default, please re-review. |
Looks like requests has issued an update that is not compatible with the tests, I will fix that in a separate commit. |
This implements the same solution for retrying connections as in pywinrm diyan/pywinrm#174