Allow to give more than one address for an Endpoint #5976

widhalmt · 2018-01-14T15:36:45Z

Hi,

While I know Split Brain is not as critical for Icinga 2 instances as it is for other highly available software there are scenarios where you definitely don't want to write two nodes different status information into your IDO database.

Many large enterprises and some well designed smaller networks have multiple networks that connect servers. e.g. administrative network, backup network, direct link between nodes and so on.

It would be nice to make use of such extra connections to more precisely determine the status of other endpoints.

Cheers,
Thomas

widhalmt · 2018-01-14T16:52:35Z

Since I was asked in private messages I want to add some more information to clarify what this is all about.

Say, your Icinga 2 masters are multihomed hosts. They have a nic in the production LAN to connect to satellites and agents, one in an "admin lan" used by ops-people to connect to the nodes and one nic in an "cluster lan" used for cluster communication.

Say further the firewall people are a bit hung over and cut the connection between your data center for the production lan. There you go, you have a split brain scenario where both masters try to check your hosts and write their findings into IDO and a grapher which will lead to a messed up history.

What I want is that I can all IP addresses from all 3 networks to the endpoints so when the main link is broken Icinga 2 will communicate over the cluster network for avoiding split brain. Icinga 2 will work like nothing happens but can alarm you that one of the three configured connections went down.

widhalmt · 2018-01-15T08:30:53Z

Basically, what I want is:

object Endpoint "master02" {
    	host = [ "192.168.23.15", "192.168.69.15" ]
}

So, when one link goes down the nodes can still communicate over the other.

Crunsher · 2018-01-15T08:54:25Z

Just to be certain, both IPs point to the same Icinga 2 instance? I'm worried about the fallout of having one endpoint being two different installations which would be awkward to keep in sync

lazyfrosch · 2018-01-15T09:08:51Z

Hm we basically should have the same behavior in dual-stack v4/v6. (only then based on DNS)

dnsmichi · 2018-01-15T09:22:08Z

This will be really hard to debug. I would rather use DNS and round-robin returned addresses based on routing hops.

widhalmt · 2018-01-15T10:50:57Z

@Crunsher : Yes, both to the same Icinga 2 instance. I'd use the cerficitates DN for verifying that we connected to the same instance with all connections. Maybe some more information just to be really sure. This could be a use for the cluster or cluster-zone check to show that we lost one connection or have all connections active but because of a typo connected to different instances.

@dnsmichi : I'd really not want to use DNS round robin for this because I don't have any control over which connection to use. A detail that would help with this issue could be a way to mark one connection as primary and the others as fallback. Use the first one in the array? Or a seperate option. Besides I can't imagine users to create DNS entries that map to hosts in a production and a cluster network.

dnsmichi · 2018-01-15T12:46:32Z

I don't think that replacing the current string attribute into an array of strings is a possible migration route either (even if this gets implemented somehow). IP addresses do have another problem - if they're renumbered, who checks the monitoring system and updates such? In my opinion, you'll end up with many retries with unreachable addresses, and your cluster doesn't really work then.

Previous issues had been created where we had multiple connections opened for an endpoint/client, and we have removed such to make the socket IO as performant as possible. This feature request will slow that down and make it complicated again.

widhalmt · 2018-01-15T16:49:21Z

Then we could have another option for endpoints. Say:

object Endpoint "master02" {
        host = "192.168.23.15"
        fallback_hosts= ["192.168.66.13", "192.168.69.15" ]
}

If you don't want to rely on IP addresses, just use hostnames.

object Endpoint "master02" {
        host = "monitor02.example.com"
        fallback_hosts= ["monitor02.adm.example.com", "monitor02.bkp.example.com" ]
}

Besides this very issue: I tend to tell users to use IP addresses for Endpoints and Hosts. While it might be more common to change IPs than have DNS failing completely I think that changing IPs is always planned, while a DNS outage might hit you unexpectedly and a major problem like this calls for monitoring to keep an eye on what is still working.

dnsmichi · 2018-01-16T08:09:48Z

I still think this overly complicates everything, and solves a niche problem where other tools are doing better. A clear 'no' from my side.

Crunsher · 2018-01-19T16:03:12Z

The more I think about this the more problems I see with this idea :/

What, for example, about recovery, once we switched to a working fallback how do we know when to change back? And how would you know your main master failed, unless it's monitored by the fallbacks (and the fallbacks monitor each other) - which would then lead to a complicated zone configuration on the nodes.

I think this is a much bigger feature with harsher implications to our code and users, a rabbit hole so to say.

lazyfrosch · 2018-01-21T15:57:36Z

Please consider:

We need to support dual-stack (we should try AAAA and A on connections)
We should allow using IP addresses
Other tools like carbon-relay allow multiple destination addresses per route

Implementation is relatively simple, try to connect to each address until a connection can be established.

This is similar to what dual-stack usually does.

Config should be straight forward: host = [ "1.1.1.1", "fe80::111" ]

Al2Klimov · 2020-05-19T16:09:54Z

IMAO an admin can use DNS or HAproxy LB.

@N-o-X What do you think?

N-o-X · 2020-05-20T09:32:03Z

@Al2Klimov yes, although I like the idea of having such a feature, I don't think this is currently worth our time. We can still keep this on our wish list tough.

refs #5976

... for the case that #host isn't reachable. refs #5976

Al2Klimov · 2023-06-07T09:03:37Z

I'd really not want to use DNS round robin for this because I don't have any control over which connection to use.

@widhalmt Oh, you DO have control. Recently I've set up an authoritative NS. ($ dig NS kli.mov) E.g. BIND IIRC even has an option whether to shuffle answers or not. Icinga should try the addresses in the order they appear.

Have I already mentioned iptables can also do LB?

gvde · 2023-06-07T09:15:55Z

Icinga should try the addresses in the order they appear.

Does it try all addresses (i.e. all A and AAAA RRs directly or via CNAME) or does it try only the first one? I have seen many applications which take the first address they find and if that isn't working they fail...

Al2Klimov · 2023-06-07T10:03:56Z

It tries all, admittedly even "too fanatic":

Connect(): don't try next DNS record if operation is canceled #9711

Al2Klimov · 2023-08-15T16:07:00Z

@julianbrost Shall we just put a fullstop on this discussion and close this issue + PR? DNS can provide the LB and IMAO LB should shuffle/rotate the addresses in its responses.

Crunsher added enhancement New feature or request area/distributed Distributed monitoring (master, satellites, clients) labels Jan 15, 2018

dnsmichi added the needs feedback We'll only proceed once we hear from you again label Jan 15, 2018

dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Apr 5, 2018

dnsmichi added discussion TBD To be defined - We aren't certain about this yet queue/wishlist and removed TBD To be defined - We aren't certain about this yet discussion labels May 7, 2019

dnsmichi mentioned this issue Feb 5, 2020

Draft concept: Cluster: Message Routing, Performance, Connection Handling, Inventory, Discovery #7814

Open

Al2Klimov assigned N-o-X May 19, 2020

Al2Klimov added the needs feedback We'll only proceed once we hear from you again label May 19, 2020

N-o-X removed their assignment May 20, 2020

N-o-X removed the needs feedback We'll only proceed once we hear from you again label May 20, 2020

Al2Klimov added a commit that referenced this issue Nov 3, 2020

Introduce Endpoint#fallback_hosts

6952e0a

refs #5976

Al2Klimov removed the TBD To be defined - We aren't certain about this yet label Nov 3, 2020

Al2Klimov self-assigned this Nov 3, 2020

Al2Klimov linked a pull request Nov 3, 2020 that will close this issue

Introduce Endpoint#fallback_hosts #8433

Open

Al2Klimov added a commit that referenced this issue Nov 3, 2020

Introduce Endpoint#fallback_hosts

29edcda

refs #5976

Al2Klimov added a commit that referenced this issue Nov 3, 2020

Introduce Endpoint#fallback_hosts

5310472

refs #5976

Al2Klimov added a commit that referenced this issue Dec 14, 2020

Introduce Endpoint#fallback_hosts

63ca274

... for the case that #host isn't reachable. refs #5976

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to give more than one address for an Endpoint #5976

Allow to give more than one address for an Endpoint #5976

widhalmt commented Jan 14, 2018

widhalmt commented Jan 14, 2018

widhalmt commented Jan 15, 2018

Crunsher commented Jan 15, 2018

lazyfrosch commented Jan 15, 2018

dnsmichi commented Jan 15, 2018

widhalmt commented Jan 15, 2018

dnsmichi commented Jan 15, 2018 •

edited

Loading

widhalmt commented Jan 15, 2018

dnsmichi commented Jan 16, 2018

Crunsher commented Jan 19, 2018

lazyfrosch commented Jan 21, 2018

Al2Klimov commented May 19, 2020

N-o-X commented May 20, 2020

Al2Klimov commented Jun 7, 2023

gvde commented Jun 7, 2023

Al2Klimov commented Jun 7, 2023

Al2Klimov commented Aug 15, 2023

Allow to give more than one address for an Endpoint #5976

Allow to give more than one address for an Endpoint #5976

Comments

widhalmt commented Jan 14, 2018

widhalmt commented Jan 14, 2018

widhalmt commented Jan 15, 2018

Crunsher commented Jan 15, 2018

lazyfrosch commented Jan 15, 2018

dnsmichi commented Jan 15, 2018

widhalmt commented Jan 15, 2018

dnsmichi commented Jan 15, 2018 • edited Loading

widhalmt commented Jan 15, 2018

dnsmichi commented Jan 16, 2018

Crunsher commented Jan 19, 2018

lazyfrosch commented Jan 21, 2018

Al2Klimov commented May 19, 2020

N-o-X commented May 20, 2020

Al2Klimov commented Jun 7, 2023

gvde commented Jun 7, 2023

Al2Klimov commented Jun 7, 2023

Al2Klimov commented Aug 15, 2023

dnsmichi commented Jan 15, 2018 •

edited

Loading