Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSM connection plugin doesnt properly close connections #494

Closed
Chilinot opened this issue Mar 23, 2021 · 5 comments · Fixed by #542
Closed

SSM connection plugin doesnt properly close connections #494

Chilinot opened this issue Mar 23, 2021 · 5 comments · Fixed by #542
Labels
bug This issue/PR relates to a bug has_pr needs_triage python3

Comments

@Chilinot
Copy link
Contributor

Chilinot commented Mar 23, 2021

SUMMARY

When trying to run a big playbook using the SSM connection plugin, it randomly hangs in the middle of it. Very rarely am I able to run the entire playbook without issues.

ISSUE TYPE
  • Bug Report
COMPONENT NAME

ssm connection plugin

ANSIBLE VERSION
ansible 2.10.5
  config file = /Users/xxx/.ansible.cfg
  configured module search path = ['/Users/xxx/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.8/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.8.8 (default, Feb 21 2021, 10:35:39) [Clang 12.0.0 (clang-1200.0.32.29)]
CONFIGURATION
DEFAULT_HOST_LIST(env: ANSIBLE_INVENTORY) = ['/Users/xxx/ansible_hosts']
DEFAULT_VAULT_IDENTITY_LIST(/Users/xxx/.ansible.cfg) = ['xx@~/.vault_pass.txt', 'xxx@~/.vault_pass_xxx.txt', 'xxx@~/.vault_pass_xxx.txt']

Ansible variables used in the playbook for configuring the SSM plugin:

  vars:
    ansible_connection: community.aws.aws_ssm
    ansible_aws_ssm_bucket_name: xxx-ansible-ssm
    ansible_aws_ssm_region: eu-west-1
OS / ENVIRONMENT

Target OS: Amazon-Linux 2

STEPS TO REPRODUCE

I dont have exact steps to replicate this issue, it seems to happen to bigger playbooks. And happens randomly, sometimes it dies immediately, sometimes it dies in the middle or end, and very rarely does it complete without issues.

EXPECTED RESULTS

To complete the playbook without hanging.

ACTUAL RESULTS

When running in verbose mode, these are the last lines printed, i left the playbook running for 10 minutes and no change happened after which i stopped it manually:

....

<i-xxx> ESTABLISH SSM CONNECTION TO: i-xxx
<i-xxx> SSM CONNECTION ID: xxx-0a55f9c52a37613a0
<i-xxx> EXEC echo ~
^C [ERROR]: User interrupted execution

If I SSH to the server, it seems there are a lot of connections left hanging, this is the output of ps -e --forest -o ppid,pid,user,command:
output of ps command

This has been an issue for me for several releases of the ssm connection plugin.

@ansibullbot
Copy link

Files identified in the description:

If these files are inaccurate, please update the component name section of the description or use the !component bot command.

click here for bot help

@ansibullbot ansibullbot added bug This issue/PR relates to a bug needs_triage python3 labels Mar 23, 2021
@Chilinot
Copy link
Contributor Author

Chilinot commented Mar 23, 2021

Looking more into the issue, it seems if you run a task with a loop, only the last SSM connection is actually terminated. The rest of them are left hanging. When running with -vvvv debug, only one TERMINATE SSM SESSION line is output, but it establishes a new SSM connection for each iteration of the loop.

Once the task is done, there are a bunch of connections left hanging on the server even though the ansible playbook is done.

@Chilinot
Copy link
Contributor Author

Running strace on the hanging connections on the server shows that they seem stuck on the read syscall. Presumably just waiting on input from a dead connection.

@Chilinot Chilinot changed the title ssm connection plugin hangs randomly SSM connection plugin doesnt properly close connections Mar 25, 2021
@hgrgic
Copy link

hgrgic commented Apr 6, 2021

I am experiencing a similar issue where in my case I have a large set of playbook with around 200 tasks that are being executed across several private EC2 instances. At one point my tasks start timing-out with a particularly "hot" instance.

I have noticed that in the middle of the execution I have over 100 connections with status "connected" in my AWS account. I have noticed this by checking Systems Manager console.

A sample of the log associated with the failing command is similar to this:
failed: [i-11111222223333344444] (item={u'name': u'some-name', u'value': u'some-value'}) => {"ansible_loop_var": "item", "item": {"name": "some-name", "value": "some-value"}, "msg": "SSM exec_command timeout on host: i-11111222223333344444", "unreachable": true}

@Chilinot
Copy link
Contributor Author

After debugging function calls and learning how connection plugins works in ansible i was able to determine that a simple destructor was all that was missing to properly clean up the connections after they are used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue/PR relates to a bug has_pr needs_triage python3
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants