Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement connection retries #10

Merged
merged 3 commits into from
Oct 21, 2018
Merged

Conversation

dagwieers
Copy link
Contributor

This implements the same solution for retrying connections as in pywinrm diyan/pywinrm#174

@coveralls
Copy link

coveralls commented Oct 7, 2018

Pull Request Test Coverage Report for Build 115

  • 5 of 5 (100.0%) changed or added relevant lines in 1 file are covered.
  • 3 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.09%) to 98.358%

Files with Coverage Reduction New Missed Lines %
pypsrp/negotiate.py 1 78.38%
pypsrp/serializer.py 1 99.76%
pypsrp/spgnego.py 1 87.75%
Totals Coverage Status
Change from base Build 112: -0.09%
Covered Lines: 3354
Relevant Lines: 3410

💛 - Coveralls

@jborean93
Copy link
Owner

@dagwieers once merged #12 will fix the Python 2.6 issues.

Copy link
Owner

@jborean93 jborean93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @dagwieers, do you have a easy to reproduce script/playbook that I can use to test this out?

pypsrp/wsman.py Outdated
connect=self.reconnection_retries,
read=self.reconnection_retries,
backoff_factor=self.reconnection_backoff,
status_forcelist=[425, 429, 503],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these status codes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three of these are HTTP status codes that allow for retrying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact HTTP 413, HTTP 429 and HTTP 503 return a Retry-After header that influences the backoff value. This way the server can tell the client to come back at some later interval (because it hasn't finished starting, or is handling too many connections).

Before our implementation would ignore those. (TBH I don't know if Microsoft has any of this implemented, but it doesn't hurt to do it anyway because a proxy or gateway could be used)

pypsrp/wsman.py Show resolved Hide resolved
pypsrp/wsman.py Show resolved Hide resolved
pypsrp/wsman.py Outdated
@@ -126,6 +126,7 @@ def __init__(self, server, max_envelope_size=153600, operation_timeout=20,
path="wsman", auth="negotiate", cert_validation=True,
connection_timeout=30, encryption='auto', proxy=None,
no_proxy=False, locale='en-US', data_locale=None,
reconnection_retries=4, reconnection_backoff=2.0,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your reasoning for making this the default? I'm leaning towards keeping the existing behaviour rather than defaulting to retry on newer versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reliability is more important than keeping the existing behavior, especially if the new behavior is identical to the existing behavior when everything works fine. Only when things don't work fine immediately (Connection Refused, Service Unavailable, Too Early, Too Many Requests) we retry, which is exactly what we should be doing (knowing what we know now).

The reason for choosing 4 retries with a backoff of 2 seconds means we will retry 4 times no longer than 30 seconds. Which is exactly the same as we have specified for a connection timeout. So I think this acceptable, if you accept that an unresponsive system will also take 30 seconds by default before it fails.

Not retrying in the above cases is the wrong behavior IMO.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with this, when does connection_timeout apply and retry apply? It sounds like if there's no response after 30 seconds then the connection timeout will be reached? Is it going to try to reconnect again or is the reconnects only on issues that happen post connection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation is not clear on this, but normally the connection timeout is per connection (not the total time) this should be apparent when you have slow DNS lookups or DNS problems, in this case the DNS latency is added to the connection timeout as well.

If you use retry=4, each connection has the same timeout. For the successful case there's no difference in behavior, for the case where the connection was established (but faulty) there's also no difference in behavior. But for the case where the server does not respond, and retries are attempted (not sure if it retries in this case, I would assume not) the connection timeout could be 4x as long. But it's easy to test this behavior in fact.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try it out, I'm wondering whether this will affect the logic in win_reboot and we should override some values in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we may need to look at win_reboot, although the default 4 retries and backoff value of 2 seconds works fine in our tests for win_reboot too. But it's probably not really needed.

Copy link
Contributor Author

@dagwieers dagwieers Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in one of my tests connecting with WinRM to a non-existing server, it timed out after 30 seconds. Which means that it doesn't timeout for every retry, the timeout appears to cover the total time (including retries). Not sure whether this is because of urllib3, requests or a lower-level mechanism in this case.

@dagwieers
Copy link
Contributor Author

dagwieers commented Oct 8, 2018

@jborean93 This is an example where we need to recover gracefully:

- name: Mount System-Center ISO image
  win_disk_image:
    image_path: C:\Windows\Temp\mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso
    state: present
  register: iso

# NOTE: Debug logs are available from: C:\ProgramData\VMMLogs\SetupWizard.log
# NOTE: The SCVMM installer does not support a config file in UTF-8 (with BOM).
# FIXME: This action requires that pywinrm can survive 30+ seconds of WinRM service unavailability
#        And needs a patch for winrm/transport.py, see: https://github.com/ansible/ansible/issues/25532
- name: Run System-Center installer
  win_command: '{{ iso.mount_path }}setup.exe /server /i /f "C:\Windows\Temp\VMServer.ini" /SqlDBAdminDomain "{{ windows_domain }}" /SqlDBAdminName "{{ windows_admin_user }}" /SqlDBAdminPassword "{{ windows_admin_password }}" /VmmServiceDomain "{{ windows_domain }}" /VmmServiceUserName "scvmmsvc" /VmmServiceUserPassword "{{ windows_admin_password }}" /IACCEPTSCEULA'
  args:
    creates: C:\Program Files\Microsoft System Center 2012 R2\Virtual Machine Manager\bin\VmmAdminUi.exe
  vars:
    ansible_user: '{{ windows_domain }}\{{ windows_admin_user }}'
  async: 1000
  poll: 15

- name: Unmount System-Center ISO image
  win_disk_image:
    image_path: C:\Windows\Temp\mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso
    state: absent

But it is not the easiest example.

BTW Async is needed here to avoid: WSManFaultError: Received a WSManFault message. (Code: 995, Machine: 38.38.51.3, Reason: The I/O operation has been aborted because of either a thread exit or an application request.)

@jborean93
Copy link
Owner

No worries, I'll try it out and see how I go.

@dagwieers
Copy link
Contributor Author

I added HTTP 413 which can also return a Retry-After header. And I forced read=0 as well.

@dagwieers
Copy link
Contributor Author

I did some tests:

  • It takes exactly 30 seconds (with a ConnectTimeoutError) when the system is unreachable. (System was powered off). So here clearly the connect_timeout won from the retries.
  • It takes about 5 seconds (with a NewConnectionError) when the system has "No route to host" (System was not routeable)
  • When I use another HTTP server to do WinRM communication (httpbin.org port 80), I was expecting the transport to fail early and fast, but it takes 4 minutes. Which is peculiar because I wouldn't expected more than 2 minutes (4 x 30 seconds). This needs some investigation.

@jborean93
Copy link
Owner

jborean93 commented Oct 13, 2018

Just did my first test, will add any other tests/issues I come across to this post;

  • When using SSL, if the certificates aren't validated then the retries will occur instead of failing straight away it will retry a total of 4 seconds times. We shouldn't retry on these types of errors.

@dagwieers
Copy link
Contributor Author

dagwieers commented Oct 13, 2018

it will retry a total of 4 seconds

Did you look at the logs, because with our defaults that seems impossible.

@jborean93
Copy link
Owner

jborean93 commented Oct 13, 2018

Sorry 4 times not seconds, total of 30 seconds

@dagwieers
Copy link
Contributor Author

dagwieers commented Oct 13, 2018

@jborean93 Right, I guess SSL errors are also connection errors, just like "Connection refused" and other connection errors. We cannot distinguish one from another within the requests Retry framework. And the intention is clearly that we do want to retry for "Connection refused" and other connection errors.

I expect other HTTP clients using requests with Retry will have the same behavior. Not sure if we can/want to influence this behavior internally.

Note that the following link actually recommends using the Retry framework over build-your-own especially for SSL-related issues: https://www.peterbe.com/plog/best-practice-with-retries-with-requests

@jborean93
Copy link
Owner

Hmm all these edge cases and the compounding of the connection timeout and backoff_factor is really pushing me against making retries the default and just keep it configurable by an end user if they really want this feature.

I also question the validity of the status code retries, I'm trying to come up with a scenario where we would want to try on a non-200 error message. For example a 413 error message is due to us not setting the max envelope size properly and we send a packet that's too big for the server to handle. No amount of retries will change this behaviour and would just be a waste of time. Have you come across this before that requires these retries?

@dagwieers
Copy link
Contributor Author

@jborean93 https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/413

It can return a Retry-After header, which we wouldn't be supporting if we don't implement Retry. This is one of the use-cases of the mechanism. But I'm not married to HTTP 413.

@dagwieers
Copy link
Contributor Author

Now retries is off by default, please re-review.

@jborean93
Copy link
Owner

Looks like requests has issued an update that is not compatible with the tests, I will fix that in a separate commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants