-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Reruning epicli may fail for clustered Postgres #1702
Comments
Below log is an example of azure cli command return not consistently in order
As you can see order is not.
And if you re-run the first command again you get the correct order
Alas we depends on the function |
Lets try to sort it and see if resolved problem for me, any comments are welcome APIProxy.py
|
This is related to issue hitachienergy#1702. In short epiphany return the ansible hostgroups (for a features such as postgres) with underterministic order between runs. Basically the first run for example groups['postgres'][0] is mapped to vm-1 and the second run it may map to vm-0. This causes some roles (like postgres or kafka) which depends on the order to select the primary node or identify node_id will fail if we re-run it. This change attempt to solve it by sorting the result return by `get_ips_for_feature`.
I can confirm, the same behavior in #1577. |
Outstanding pull for his suggestion: #1706 I think this should work but we need to test it. |
I'm afraid that sorting hostnames in AWS is not a complete solution. It will work only with the assumption nobody will add new or remove old nodes from the cluster. It should be rather sorted using timestamp when VM was created or something similar, but not the hostname 🤔 |
I already thought about that but I decided to use hostname for readability over odd cases and try declare odd cases as non supported activities. We should aim for a naming consistent when creating vm and discourage user handle a mixture of manually / automation. I will look into the AWS hostname creation pattern again soon and see if we can impose the rules (use tag maybe) by the time stamp. IMHO we should aim for: Not allowing user to add vm manually into the system with hope that only run epicli ansible stage to scale. User should:
I admit that sometimes there is exception case. Terraform fail often because Azure API. Usually I fixed by by manually removing things and re-run which should work for our case here. For the on-prem case this situation works still if user know what to do in naming their cluster - and by best practice they should do it for clarifications and readability. We just need to document it clearly. So in short I will favor readability and complete documentation vs hidden in this case. |
OK, to be completely transparent. We faced this issue already when implementing Kubernetes HA code. Epiphany uses in AWS autoscaling groups to create virtual machines. Terraform AWS provider does not control So let's say we add a vm (increase the size), then the hostname (and inventory_hostname as well) will be completely OK, so in the case of Postgres where we have 1 master and 1 replica and it's static forever that does not matter at all, but I'm referring to a bigger picture. Do we want to fix just the Postgres in Azure or do we want to make things right for everybody forever? 🤔 The are to ways to make things "right" IMHO:
BTW, (because of the ASGs) removing vms in AWS currently is a potential suicide, because we have no control over which vms will be destroyed. I may be wrong though, please prove me wrong guys. 🤗 |
You can set the behavior what hosts the ASG will kill off when downsizing There are complexity we need to address in order to make it work for most cases (not all as you may never be able to achieve it anyway). The AWS ASG one is an example. However using hostname sort we can still do it. What I did before is that modifying the Launch Config and add a bit of userdata scripts in the ASG, that it will parse some logic and then create the hostname (or whatever condition we might end up with) based on the timestamp for example and a template of naming convention. That script will update the AWS VM I have not looked over the AWS implementations thoroughly but IMHO the use of ASG is dangerous in most deployment settings (postgres, kafka) except the scaling out kubernetes nodes. And I think auto scaling kubernetes nodes is the most important feature we should keep if we can. For Azure I don't see that we implement it yet but if AWS already support there is no reason to remove it. So in short:
|
Let me know the way I propose for the ASG LC is OK so I can go ahead and implement it? |
Team, any directions on this? Please let me know so I can proceed to get it fixed and merged. |
Moving forward its probably best to remove the ASG for AWS so it works the same as Azure. There are more issues related to ASG and we don't need them as we are not going to support this kind of scaling they are meant for. |
This issue seems to be solved by changes made to upgrade PostgreSQL to version 13. We should test after merging |
✔️ Fixed in #2501. Added functionality to automatically detect the master node for an already configured cluster. |
Describe the bug
Re run the deployment causing fail in postgres roles
See it in version 0.6.0. and expect same behaviour in 0.7.0 because that postgres role does not change.
To Reproduce
Steps to reproduce the behavior:
Build a cluster with postgres with two nodes
Observe that vm-1 is set as primary and vm-0 is hot standby
Re run the deployment with nothing changes
epcli run failed when applying postgres role.
Expected behavior
It should be no errors
Config files
OS (please complete the following information):
Cloud Environment (please complete the following information):
Additional context
The reason is the that the role uses the condition
groups['postgresql'][0] == inventory_hostname
to decide which host is primary. The first run the condition is resolved to vm-1.However the second run it resolved to vm-0 and because vm-0 is already setup as standby the task failed.
Below is the log case
First run
https://abb-jenkins.duckdns.org:8080/view/Development/job/DEPLOY-de-cluster/433/console
it picks master is vm-1
epicli postgres role in replication-repmgr-Debian.yml
Now re-run it.
https://abb-jenkins.duckdns.org:8080/view/Development/job/DEPLOY-de-cluster/434/console
as u can see it picks up vm-0 now.
and then it failed because vm-0 is not primary, it is vm-1
There is 50% chance it is ok if the
groups['postgresql'][0] points to vm-1
Thus the issues is not 100% reproducible and easily skipped/ignored.
Suggestion to fix.
We need to have a stable mechanism in selecting nodes especially for roles depending the order of nodes to make a decision such as postgres. I do believe kafka roles when making the node_id will suffer the same issues.
For Azure it may be easy by using the vm-name host patter (the last is a number) but it might not be portable across provider such as AWS. I don't know how to hostname looks like in AWS.
If looking in the code
AnsibleInventoryCreator.py
to add the group I found that it is a bit harder to fix from there due to the python return in iterations. So for now I don't have any best way to deal with this.I may need to look more into the teraform template to see the hostname rules it generates and maybe use the consistent hostname pattern matching.
The text was updated successfully, but these errors were encountered: