Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to give more than one address for an Endpoint #5976

Open
widhalmt opened this issue Jan 14, 2018 · 17 comments · May be fixed by #8433
Open

Allow to give more than one address for an Endpoint #5976

widhalmt opened this issue Jan 14, 2018 · 17 comments · May be fixed by #8433
Assignees
Labels
area/distributed Distributed monitoring (master, satellites, clients) enhancement New feature or request queue/wishlist

Comments

@widhalmt
Copy link
Member

Hi,

While I know Split Brain is not as critical for Icinga 2 instances as it is for other highly available software there are scenarios where you definitely don't want to write two nodes different status information into your IDO database.

Many large enterprises and some well designed smaller networks have multiple networks that connect servers. e.g. administrative network, backup network, direct link between nodes and so on.

It would be nice to make use of such extra connections to more precisely determine the status of other endpoints.

Cheers,
Thomas

@widhalmt
Copy link
Member Author

Since I was asked in private messages I want to add some more information to clarify what this is all about.

Say, your Icinga 2 masters are multihomed hosts. They have a nic in the production LAN to connect to satellites and agents, one in an "admin lan" used by ops-people to connect to the nodes and one nic in an "cluster lan" used for cluster communication.

Say further the firewall people are a bit hung over and cut the connection between your data center for the production lan. There you go, you have a split brain scenario where both masters try to check your hosts and write their findings into IDO and a grapher which will lead to a messed up history.

What I want is that I can all IP addresses from all 3 networks to the endpoints so when the main link is broken Icinga 2 will communicate over the cluster network for avoiding split brain. Icinga 2 will work like nothing happens but can alarm you that one of the three configured connections went down.

@widhalmt
Copy link
Member Author

Basically, what I want is:

object Endpoint "master02" {
    	host = [ "192.168.23.15", "192.168.69.15" ]
}

So, when one link goes down the nodes can still communicate over the other.

@Crunsher Crunsher added enhancement New feature or request area/distributed Distributed monitoring (master, satellites, clients) labels Jan 15, 2018
@Crunsher
Copy link
Contributor

Just to be certain, both IPs point to the same Icinga 2 instance? I'm worried about the fallout of having one endpoint being two different installations which would be awkward to keep in sync

@lazyfrosch
Copy link
Contributor

Hm we basically should have the same behavior in dual-stack v4/v6. (only then based on DNS)

@dnsmichi
Copy link
Contributor

This will be really hard to debug. I would rather use DNS and round-robin returned addresses based on routing hops.

@dnsmichi dnsmichi added the needs feedback We'll only proceed once we hear from you again label Jan 15, 2018
@widhalmt
Copy link
Member Author

@Crunsher : Yes, both to the same Icinga 2 instance. I'd use the cerficitates DN for verifying that we connected to the same instance with all connections. Maybe some more information just to be really sure. This could be a use for the cluster or cluster-zone check to show that we lost one connection or have all connections active but because of a typo connected to different instances.

@dnsmichi : I'd really not want to use DNS round robin for this because I don't have any control over which connection to use. A detail that would help with this issue could be a way to mark one connection as primary and the others as fallback. Use the first one in the array? Or a seperate option. Besides I can't imagine users to create DNS entries that map to hosts in a production and a cluster network.

@dnsmichi
Copy link
Contributor

dnsmichi commented Jan 15, 2018

I don't think that replacing the current string attribute into an array of strings is a possible migration route either (even if this gets implemented somehow). IP addresses do have another problem - if they're renumbered, who checks the monitoring system and updates such? In my opinion, you'll end up with many retries with unreachable addresses, and your cluster doesn't really work then.

Previous issues had been created where we had multiple connections opened for an endpoint/client, and we have removed such to make the socket IO as performant as possible. This feature request will slow that down and make it complicated again.

@widhalmt
Copy link
Member Author

Then we could have another option for endpoints. Say:

object Endpoint "master02" {
        host = "192.168.23.15"
        fallback_hosts= ["192.168.66.13", "192.168.69.15" ]
}

If you don't want to rely on IP addresses, just use hostnames.

object Endpoint "master02" {
        host = "monitor02.example.com"
        fallback_hosts= ["monitor02.adm.example.com", "monitor02.bkp.example.com" ]
}

Besides this very issue: I tend to tell users to use IP addresses for Endpoints and Hosts. While it might be more common to change IPs than have DNS failing completely I think that changing IPs is always planned, while a DNS outage might hit you unexpectedly and a major problem like this calls for monitoring to keep an eye on what is still working.

@dnsmichi
Copy link
Contributor

I still think this overly complicates everything, and solves a niche problem where other tools are doing better. A clear 'no' from my side.

@Crunsher
Copy link
Contributor

The more I think about this the more problems I see with this idea :/

What, for example, about recovery, once we switched to a working fallback how do we know when to change back? And how would you know your main master failed, unless it's monitored by the fallbacks (and the fallbacks monitor each other) - which would then lead to a complicated zone configuration on the nodes.

I think this is a much bigger feature with harsher implications to our code and users, a rabbit hole so to say.

@lazyfrosch
Copy link
Contributor

Please consider:

  1. We need to support dual-stack (we should try AAAA and A on connections)
  2. We should allow using IP addresses
  3. Other tools like carbon-relay allow multiple destination addresses per route

Implementation is relatively simple, try to connect to each address until a connection can be established.

This is similar to what dual-stack usually does.

Config should be straight forward: host = [ "1.1.1.1", "fe80::111" ]

@dnsmichi dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Apr 5, 2018
@dnsmichi dnsmichi added discussion TBD To be defined - We aren't certain about this yet queue/wishlist and removed TBD To be defined - We aren't certain about this yet discussion labels May 7, 2019
@Al2Klimov
Copy link
Member

IMAO an admin can use DNS or HAproxy LB.

@N-o-X What do you think?

@Al2Klimov Al2Klimov added the needs feedback We'll only proceed once we hear from you again label May 19, 2020
@N-o-X
Copy link
Contributor

N-o-X commented May 20, 2020

@Al2Klimov yes, although I like the idea of having such a feature, I don't think this is currently worth our time. We can still keep this on our wish list tough.

@N-o-X N-o-X removed their assignment May 20, 2020
@N-o-X N-o-X removed the needs feedback We'll only proceed once we hear from you again label May 20, 2020
Al2Klimov added a commit that referenced this issue Nov 3, 2020
@Al2Klimov Al2Klimov removed the TBD To be defined - We aren't certain about this yet label Nov 3, 2020
@Al2Klimov Al2Klimov self-assigned this Nov 3, 2020
@Al2Klimov Al2Klimov linked a pull request Nov 3, 2020 that will close this issue
Al2Klimov added a commit that referenced this issue Nov 3, 2020
Al2Klimov added a commit that referenced this issue Nov 3, 2020
Al2Klimov added a commit that referenced this issue Dec 14, 2020
... for the case that #host isn't reachable.

refs #5976
@Al2Klimov
Copy link
Member

I'd really not want to use DNS round robin for this because I don't have any control over which connection to use.

@widhalmt Oh, you DO have control. Recently I've set up an authoritative NS. ($ dig NS kli.mov) E.g. BIND IIRC even has an option whether to shuffle answers or not. Icinga should try the addresses in the order they appear.

Have I already mentioned iptables can also do LB?

@gvde
Copy link

gvde commented Jun 7, 2023

Icinga should try the addresses in the order they appear.

Does it try all addresses (i.e. all A and AAAA RRs directly or via CNAME) or does it try only the first one? I have seen many applications which take the first address they find and if that isn't working they fail...

@Al2Klimov
Copy link
Member

It tries all, admittedly even "too fanatic":

@Al2Klimov
Copy link
Member

@julianbrost Shall we just put a fullstop on this discussion and close this issue + PR? DNS can provide the LB and IMAO LB should shuffle/rotate the addresses in its responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) enhancement New feature or request queue/wishlist
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants