Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Reruning epicli may fail for clustered Postgres #1702

Closed
sunshine69 opened this issue Sep 25, 2020 · 13 comments
Closed

[BUG] Reruning epicli may fail for clustered Postgres #1702

sunshine69 opened this issue Sep 25, 2020 · 13 comments

Comments

@sunshine69
Copy link
Contributor

sunshine69 commented Sep 25, 2020

Describe the bug
Re run the deployment causing fail in postgres roles

See it in version 0.6.0. and expect same behaviour in 0.7.0 because that postgres role does not change.

To Reproduce
Steps to reproduce the behavior:

Build a cluster with postgres with two nodes

Observe that vm-1 is set as primary and vm-0 is hot standby

Re run the deployment with nothing changes

epcli run failed when applying postgres role.

Expected behavior
It should be no errors

Config files

OS (please complete the following information):

  • OS: Ubuntu 18.04

Cloud Environment (please complete the following information):

  • Cloud Provider MS Azure

Additional context

The reason is the that the role uses the condition groups['postgresql'][0] == inventory_hostname to decide which host is primary. The first run the condition is resolved to vm-1.

However the second run it resolved to vm-0 and because vm-0 is already setup as standby the task failed.

Below is the log case

First run

https://abb-jenkins.duckdns.org:8080/view/Development/job/DEPLOY-de-cluster/433/console

it picks master is vm-1

02:23:07 INFO cli.engine.ansible.AnsibleCommand - TASK [postgresql : Check if master is already registered in repmgr] ************
02:23:07 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-0]
02:23:07 INFO cli.engine.ansible.AnsibleCommand - ok: [de-stdbase-postgresql-vm-1]

epicli postgres role in replication-repmgr-Debian.yml

# Master:
- name: Check if master is already registered in repmgr
  become_user: postgres
  shell: >-
    set -o pipefail &&
    {{ repmgr_bindir[ansible_os_family] }}/repmgr cluster show -f {{ repmgr_config_dir[ansible_os_family] }}/repmgr.conf | grep -i {{ inventory_hostname }} | grep -v standby
  changed_when: false
  register: is_master_already_registered
  failed_when: is_master_already_registered.rc == 2
  args:
    executable: /bin/bash
  when:
    - groups['postgresql'][0] == inventory_hostname

Now re-run it.

https://abb-jenkins.duckdns.org:8080/view/Development/job/DEPLOY-de-cluster/434/console

06:15:30 INFO cli.engine.ansible.AnsibleCommand - TASK [postgresql : Check if master is already registered in repmgr] ************
06:15:30 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-1]
06:15:31 INFO cli.engine.ansible.AnsibleCommand - ok: [de-stdbase-postgresql-vm-0]

as u can see it picks up vm-0 now.
and then it failed because vm-0 is not primary, it is vm-1

06:15:35 INFO cli.engine.ansible.AnsibleCommand - skipping: [de-stdbase-postgresql-vm-1]
06:15:36 INFO cli.engine.ansible.AnsibleCommand - fatal: [de-stdbase-postgresql-vm-0]: FAILED! => {"changed": true, "cmd": "/usr/bin/repmgr primary register -f /etc/postgresql/10/main/repmgr.conf --force --superuser=epi_repmgr_admin", "delta": "0:00:00.044363", "end": "2020-09-25 06:15:36.171462", "msg": "non-zero return code", "rc": 1, "start": "2020-09-25 06:15:36.127099", "stderr": "ERROR: server is in standby mode and cannot be registered as a primary", "stderr_lines": ["ERROR: server is in standby mode and cannot be registered as a primary"], "stdout": "", "stdout_lines": []}

There is 50% chance it is ok if the
groups['postgresql'][0] points to vm-1

Thus the issues is not 100% reproducible and easily skipped/ignored.

Suggestion to fix.

We need to have a stable mechanism in selecting nodes especially for roles depending the order of nodes to make a decision such as postgres. I do believe kafka roles when making the node_id will suffer the same issues.

For Azure it may be easy by using the vm-name host patter (the last is a number) but it might not be portable across provider such as AWS. I don't know how to hostname looks like in AWS.

If looking in the code AnsibleInventoryCreator.py to add the group I found that it is a bit harder to fix from there due to the python return in iterations. So for now I don't have any best way to deal with this.

I may need to look more into the teraform template to see the hostname rules it generates and maybe use the consistent hostname pattern matching.

@sunshine69 sunshine69 changed the title [BUG] Re run the dpeloyment causing fail in postgres roles [BUG] Re run the deployment causing fail in postgres role Sep 25, 2020
@sunshine69
Copy link
Contributor Author

Below log is an example of azure cli command return not consistently in order

stevek@ecdfa3b752e8:/work$ az vm list-ip-addresses --ids $(az resource list --query "[?type=='Microsoft.Compute/virtualMachines' && tags.kafka == '' && tags.cluster == 'de-stdbase'].id" --output tsv)
[
  [
    {
      "virtualMachine": {
        "name": "de-stdbase-kafka-vm-2",
        "network": {
          "privateIpAddresses": [
            "10.160.19.21"
          ],
          "publicIpAddresses": []
        },
        "resourceGroup": "de-stdbase-rg"
      }
    }
  ],
  [
    {
      "virtualMachine": {
        "name": "de-stdbase-kafka-vm-1",
        "network": {
          "privateIpAddresses": [
            "10.160.19.20"
          ],
          "publicIpAddresses": []
        },
        "resourceGroup": "de-stdbase-rg"
      }
    }
  ],
  [
    {
      "virtualMachine": {
        "name": "de-stdbase-kafka-vm-3",
        "network": {
          "privateIpAddresses": [
            "10.160.19.23"
          ],
          "publicIpAddresses": []
        },
        "resourceGroup": "de-stdbase-rg"
      }
    }
  ],
  [
    {
      "virtualMachine": {
        "name": "de-stdbase-kafka-vm-0",
        "network": {
          "privateIpAddresses": [
            "10.160.19.22"
          ],
          "publicIpAddresses": []
        },
        "resourceGroup": "de-stdbase-rg"
      }
    }
  ]
]

As you can see order is not.
However try to get the id only now and see it seems to be in order

stevek@ecdfa3b752e8:/work$ az resource list --query "[?type=='Microsoft.Compute/virtualMachines' && tags.kafka == '' && tags.cluster == 'de-stdbase'].id" --output tsv
/subscriptions/d7198735-a0aa-4871-b234-5def4356a9be/resourceGroups/de-stdbase-rg/providers/Microsoft.Compute/virtualMachines/de-stdbase-kafka-vm-0
/subscriptions/d7198735-a0aa-4871-b234-5def4356a9be/resourceGroups/de-stdbase-rg/providers/Microsoft.Compute/virtualMachines/de-stdbase-kafka-vm-1
/subscriptions/d7198735-a0aa-4871-b234-5def4356a9be/resourceGroups/de-stdbase-rg/providers/Microsoft.Compute/virtualMachines/de-stdbase-kafka-vm-2
/subscriptions/d7198735-a0aa-4871-b234-5def4356a9be/resourceGroups/de-stdbase-rg/providers/Microsoft.Compute/virtualMachines/de-stdbase-kafka-vm-3

And if you re-run the first command again you get the correct order

stevek@ecdfa3b752e8:/work$ az vm list-ip-addresses --ids $(az resource list --query "[?type=='Microsoft.Compute/virtualMachines' && tags.kafka == '' && tags.cluster == 'de-stdbase'].id" --output tsv)
[
  [
    {
      "virtualMachine": {
        "name": "de-stdbase-kafka-vm-0",
        "network": {
          "privateIpAddresses": [
            "10.160.19.22"
          ],
          "publicIpAddresses": []
        },
        "resourceGroup": "de-stdbase-rg"
      }
    }
  ],
  [
    {
      "virtualMachine": {
        "name": "de-stdbase-kafka-vm-1",
        "network": {
          "privateIpAddresses": [
            "10.160.19.20"
          ],
          "publicIpAddresses": []
        },
        "resourceGroup": "de-stdbase-rg"
      }
    }
  ],
  [
    {
      "virtualMachine": {
        "name": "de-stdbase-kafka-vm-2",
        "network": {
          "privateIpAddresses": [
            "10.160.19.21"
          ],
          "publicIpAddresses": []
        },
        "resourceGroup": "de-stdbase-rg"
      }
    }
  ],
  [
    {
      "virtualMachine": {
        "name": "de-stdbase-kafka-vm-3",
        "network": {
          "privateIpAddresses": [
            "10.160.19.23"
          ],
          "publicIpAddresses": []
        },
        "resourceGroup": "de-stdbase-rg"
      }
    }
  ]
]

Alas we depends on the function get_ips_for_feature in APIProxy.py to return the list and append it to form the groups.

@sunshine69
Copy link
Contributor Author

Lets try to sort it and see if resolved problem for me, any comments are welcome

APIProxy.py

    def get_ips_for_feature(self, component_key):
        look_for_public_ip = self.cluster_model.specification.cloud.use_public_ips
        cluster = cluster_tag(self.cluster_prefix, self.cluster_name)
        running_instances = self.run(self, f'az vm list-ip-addresses --ids $(az resource list --query "[?type==\'Microsoft.Compute/virtualMachines\' && tags.{component_key} == \'\' && tags.cluster == \'{cluster}\'].id" --output tsv)')
        result = []
        for instance in running_instances:
            if isinstance(instance, list):
                instance = instance[0]
            name = instance['virtualMachine']['name']
            if look_for_public_ip:
                ip = instance['virtualMachine']['network']['publicIpAddresses'][0]['ipAddress']
            else:
                ip = instance['virtualMachine']['network']['privateIpAddresses'][0]
            result.append(AnsibleHostModel(name, ip))
        result.sort(key=lambda x: x.name, reverse=False)
        return result

sunshine69 pushed a commit to sunshine69/epiphany that referenced this issue Sep 26, 2020
This is related to issue hitachienergy#1702.

In short epiphany return the ansible hostgroups (for a features such as
postgres) with underterministic order between runs. Basically the first run for
example groups['postgres'][0] is mapped to vm-1 and the second run it may map
to vm-0.

This causes some roles (like postgres or kafka) which depends on the order to
select the primary node or identify node_id will fail if we re-run it.

This change attempt to solve it by sorting the result return by `get_ips_for_feature`.
@atsikham
Copy link
Contributor

atsikham commented Sep 29, 2020

I can confirm, the same behavior in #1577.

@seriva
Copy link
Collaborator

seriva commented Sep 29, 2020

Outstanding pull for his suggestion: #1706 I think this should work but we need to test it.

@sk4zuzu
Copy link
Contributor

sk4zuzu commented Sep 29, 2020

I'm afraid that sorting hostnames in AWS is not a complete solution. It will work only with the assumption nobody will add new or remove old nodes from the cluster. It should be rather sorted using timestamp when VM was created or something similar, but not the hostname 🤔
Refering to this line: https://github.com/epiphany-platform/epiphany/pull/1706/files#diff-20056616cbf0a609d4a1ac1d280b8eeaR26

@to-bar to-bar changed the title [BUG] Re run the deployment causing fail in postgres role [BUG] Reruning epicli may fail for clustered Postgres Sep 29, 2020
@sunshine69
Copy link
Contributor Author

I already thought about that but I decided to use hostname for readability over odd cases and try declare odd cases as non supported activities.

We should aim for a naming consistent when creating vm and discourage user handle a mixture of manually / automation.

I will look into the AWS hostname creation pattern again soon and see if we can impose the rules (use tag maybe) by the time stamp.

IMHO we should aim for:

Not allowing user to add vm manually into the system with hope that only run epicli ansible stage to scale. User should:

  • Add more nodes in the cluster config
  • Run epicli again full stages

I admit that sometimes there is exception case. Terraform fail often because Azure API. Usually I fixed by by manually removing things and re-run which should work for our case here.

For the on-prem case this situation works still if user know what to do in naming their cluster - and by best practice they should do it for clarifications and readability. We just need to document it clearly.

So in short I will favor readability and complete documentation vs hidden in this case.

@sk4zuzu
Copy link
Contributor

sk4zuzu commented Sep 29, 2020

OK, to be completely transparent. We faced this issue already when implementing Kubernetes HA code.

Epiphany uses in AWS autoscaling groups to create virtual machines. Terraform AWS provider does not control
specific vms inside the ASG, what we can only do with it is to increase or decrease the size of the ASG. In other
words we have no control over ordering of the vms inside the ASG.

So let's say we add a vm (increase the size), then the hostname (and inventory_hostname as well) will be completely
random and ordering inside the inventory will change. Sorting hostnames in that case will not help (in general) because
new vm could go between 2 other machines in the group.

OK, so in the case of Postgres where we have 1 master and 1 replica and it's static forever that does not matter at all, but I'm referring to a bigger picture. Do we want to fix just the Postgres in Azure or do we want to make things right for everybody forever? 🤔

The are to ways to make things "right" IMHO:

  1. Find a way to order inventory hostnames in such a manner that adding new ones will NOT cause sorting procedure to produce different ordering.
  2. Remove ASGs completely as we don't use any of the advanced features they provide and just stick to "count/for" loops in Terraform (we do that in Azure).

BTW, (because of the ASGs) removing vms in AWS currently is a potential suicide, because we have no control over which vms will be destroyed.

I may be wrong though, please prove me wrong guys. 🤗

@sunshine69
Copy link
Contributor Author

sunshine69 commented Oct 6, 2020

Do we want to fix just the Postgres in Azure or do we want to make things right for everybody foreve

  • Surely we want to make things right for everybody forever.

we have no control over which vms will be destroyed.

You can set the behavior what hosts the ASG will kill off when downsizing

There are complexity we need to address in order to make it work for most cases (not all as you may never be able to achieve it anyway). The AWS ASG one is an example.

However using hostname sort we can still do it. What I did before is that modifying the Launch Config and add a bit of userdata scripts in the ASG, that it will parse some logic and then create the hostname (or whatever condition we might end up with) based on the timestamp for example and a template of naming convention. That script will update the AWS VM hostname satisfying the sort order requirements.

I have not looked over the AWS implementations thoroughly but IMHO the use of ASG is dangerous in most deployment settings (postgres, kafka) except the scaling out kubernetes nodes. And I think auto scaling kubernetes nodes is the most important feature we should keep if we can. For Azure I don't see that we implement it yet but if AWS already support there is no reason to remove it.

So in short:

  • Use LC to use user data and name the host name based on time based launch in AWS (details more needed to look at which specs is stable for ASG to determine the history host launch event.
  • Make sure ASG not used in a non supported servers

@sunshine69
Copy link
Contributor Author

Let me know the way I propose for the ASG LC is OK so I can go ahead and implement it?

@sunshine69
Copy link
Contributor Author

Team, any directions on this? Please let me know so I can proceed to get it fixed and merged.

@seriva
Copy link
Collaborator

seriva commented Oct 23, 2020

Moving forward its probably best to remove the ASG for AWS so it works the same as Azure. There are more issues related to ASG and we don't need them as we are not going to support this kind of scaling they are meant for.

@to-bar
Copy link
Contributor

to-bar commented Jul 30, 2021

This issue seems to be solved by changes made to upgrade PostgreSQL to version 13. We should test after merging feature/upgrade-postgresql-13 branch to develop.

@przemyslavic
Copy link
Collaborator

✔️ Fixed in #2501. Added functionality to automatically detect the master node for an already configured cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants