Using the Statistics info in ETCD support using all nodes of the cluster #9

drusellers · 2014-05-01T14:39:12Z

The goal here is that if server A is down and we know that there is also server B and C - be able to fall back to those.

smalldave · 2014-05-01T14:44:49Z

Just looking at this.
So get the members of the cluster from the statistics?
How / when does the node list get updated?

drusellers · 2014-05-01T14:52:00Z

Thoughts right now.

Expose a method that will allow us to 'store' a list of these nodes - that the users and ourselves can call when needed to update this internal 'registery'
work to get our code to connect 'based' on this 'registry'
use this for a bit, see how it plays

Later

if we can figure out where we are putting our calls to this new 'method' use that to inform us of where to put it in the codez. I'm guessing, when we get a 'cannot connect' type exception and we start connecting to a new node, once we have we reup the list.

But I really think step one is just keeping the initial list of options and see how that goes. We can make it smarter once this is baked and we are happy.

smalldave · 2014-05-01T15:01:01Z

Ok. So list of nodes in the constructor and the ability to adjust that list afterwards?
To start with should we just round robin through all nodes in list order regardless of previous success failure?
This won't be particularly efficient if the first node in the list is out of the cluster.
There is also load balancing to consider but I guess we should leave that for now.
A simple implementation is likely to put all the load on one node though.

drusellers · 2014-05-01T15:02:40Z

My initial concern is less about 'load' and more about when a failure is
detected -> shift gears to another node. Keep it simple and wait for actual
issues to come up before we get all fancy on it. :)

On Thu, May 1, 2014 at 10:01 AM, David Smith [email protected]:

Ok. So list of nodes in the constructor and the ability to adjust that
list afterwards?
To start with should we just round robin through all nodes in list order
regardless of previous success failure?
This won't be particularly efficient if the first node in the list is out
of the cluster.
There is also load balancing to consider but I guess we should leave that
for now.
A simple implementation is likely to put all the load on one node though.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/9#issuecomment-41917647
.

smalldave · 2014-05-01T15:04:13Z

Agreed. I'm some of the way to this already. Needs a bit of work.

drusellers · 2014-05-01T15:07:06Z

:)

On Thu, May 1, 2014 at 10:04 AM, David Smith [email protected]:

Agreed. I'm some of the way to this already. Needs a bit of work.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/9#issuecomment-41917953
.

smalldave · 2014-05-01T15:53:21Z

Any thoughts on testing this?

drusellers · 2014-05-01T16:00:43Z

Not yet, I will as i write the code but the big tests would be something like.

given 2 servers A and B
start at A
should connect to A
make A fail
should try A - get error
should shift to B
connect to B
get success
future calls go to B

smalldave · 2014-05-01T16:03:39Z

Current tests require one etcd node. Will this require multiple?

smalldave · 2014-05-01T16:12:20Z

or do something more mocky?

smalldave · 2014-05-01T16:39:55Z

The response back from a server that is down is slow. If that is first in the list then response is very slow every time. Worse if multiple servers down.
Unless can speed up that initial response I guess we need something more sophisticated than a round robin every time.
Maybe put failed nodes to the bottom of the list?

drusellers · 2014-05-01T17:33:20Z

So we could have a 'composite' EtcdClient to start with that is like 'FailOverEtcdClient' you can seed it with multiple nodes, but it could keep an ordered list or queue. We can peek the top one and use that. If it ever fails we keep popping and reque until we find one that works. Then whenever we find new ones we just 'enqueue' them. Later we can add more sophisticated policies like 'slowness' and such. That keeps the core EtcdClient code nice and simple too.

smalldave · 2014-05-02T15:40:06Z

I've got something kind of working and I'm looking at unit testing
I've got a cluster going. When all the nodes in the cluster are up the unit tests seem to pass.
However when I take the first node out of the cluster tests start failing randomly (not consistently).
I've check the node retry and that seems to work find. In fact I've taken out the retry and hard coded the node and that had the same problem.
It looks like etcd is returning before actually having written to the node.
I realise writes to the cluster aren't consistent but I would have thought that writes and reads to the same node were consistent?
I guess I'm wrong?
Also the unit tests rely on this.
I'll have a read but do you know anything about this?

drusellers · 2014-05-02T15:46:36Z

Nope. You are ahead of me here, but my guess is that the gossip protocol
takes a second to sync everything. Might add a wait in the test to see how
big it has to get to confirm. ???

On Fri, May 2, 2014 at 10:40 AM, David Smith [email protected]:

I've got something kind of working and I'm looking at unit testing
I've got a cluster going. When all the nodes in the cluster are up the
unit tests seem to pass.
However when I take the first node out of the cluster tests start failing
randomly (not consistently).
I've check the node retry and that seems to work find. In fact I've taken
out the retry and hard coded the node and that had the same problem.
It looks like etcd is returning before actually having written to the
node.
I realise writes to the cluster aren't consistent but I would have thought
that writes and reads to the same node were consistent?
I guess I'm wrong?
Also the unit tests rely on this.
I'll have a read but do you know anything about this?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/9#issuecomment-42045291
.

smalldave · 2014-05-02T16:35:17Z

Yep. Thread.Sleep does it again. I'll commit that now :)
I'll have a read. Hope this is by design.
May need to take a different approach with the tests or perhaps fine to assume it will appear consistent with a single node

My understanding is Raft isn't a gossip protocol?

drusellers · 2014-05-02T19:09:03Z

derp. just that its distributed, and it may need time to percolate. :)

On Fri, May 2, 2014 at 11:35 AM, David Smith [email protected]:

Yep. Thread.Sleep does it again. I'll commit that now :)
I'll have a read. Hope this is by design.
May need to take a different approach with the tests or perhaps fine to
assume it will appear consistent with a single node

My understanding is Raft isn't a gossip protocol?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/9#issuecomment-42051242
.

smalldave · 2014-05-06T09:09:53Z

Answered by the coreos guys.
Writes to the leader are consistent otherwise not. I wasn't writing to the leader so hence inconsistent reads.
The current tests are fine as long as testing against a single node
Still not really got my head around how to run tests against a cluster.
If you have some spare time this is worth watching
https://www.youtube.com/watch?v=XiXZOF6dZuE
I'm guessing this is why the leadership and lock modules have been deprecated.

smalldave · 2014-05-06T11:31:31Z

I'm looking at spinning up etcd instances as part of testing.
I need to know where the etcd binary is.
I'm wondering about creating a tools directory and putting the binary in there.
I could download it as part of the build but I'm not really sure how that would work (could do it easily with rake but VS not so much)
Also just been looking at building in mono. All builds nicely after I change the toolsversion in the proj files. I'd like to have the option to run tests from rake but again I need to know where the xunit console is.

smalldave · 2014-05-06T11:38:56Z

I forgot to reply about the composite EtcdClient. The approach I am taking is to have EtcdClient accept a params of uris in it's constructor. I've got a Cluster class that EtcdClient initalises. This has a list of nodes to iterate over and a method to demote (needs a less loaded name) a node if you think it's iffy. makeXRequest then just loops over the nodes until it succeeds, demoting any notes that fail.
If there are no successful responses it throws an error.
The makeXRequest methods would need consolidating but it seems fairly straightforward so far.
I'll commit to a branch of my fork shortly so you can have a look

smalldave · 2014-05-06T14:18:04Z

Here is what I have so far
https://github.com/smalldave/etcetera/commit/b038f219ce774a17bab4c18664f8a2b6cdab2d1b
Ideally there'd be something in EtcdClient very much like makeKeyRequest but with the key / lock path already appended. This could be shared between the EtcdClient and the EtcdLockModule. I'm not sure how to achieve that though without adding something public to EtcdClient.
I've chosen to force users of the stats module to specify the node they are talking about, seem reasonable?

drusellers · 2014-05-06T15:06:47Z

re: etcd binary - i'd be ok with that. a tools dir seems appropriate
re: rake 👍 (msbuild 👎 )

drusellers · 2014-05-06T15:10:14Z

put another interface on the EtcdClient and have it implement it 'explicitly' then have the 'modules' take the client in as that interface. That way they will see the method but nothing else will.

smalldave · 2014-05-06T15:11:08Z

nice. like that.

smalldave · 2014-05-07T13:36:34Z

problem with tools directory is that binary needed isdependent on os

drusellers · 2014-05-07T14:24:10Z

i find that windows people tend to need more help than the unix folks.
maybe just check in one for windows? Or just don't worry about it.

On Wed, May 7, 2014 at 8:36 AM, David Smith [email protected]:

problem with tools directory is that binary needed isdependent on os

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/9#issuecomment-42427221
.

smalldave · 2014-05-08T13:02:12Z

My new plan is to use vagrant to spin up 3 machines with known IP addresses and then manipulate them over ssh (where necessary) as part of the tests.
Vagrant file can go in the source control and will just download the released version of etcd for now.
That is OS independent and allows me to test the cluster.

drusellers · 2014-05-08T16:12:28Z

+1-d

On Thu, May 8, 2014 at 8:02 AM, David Smith [email protected]
wrote:

My new plan is to use vagrant to spin up 3 machines with known IP addresses and then manipulate them over ssh (where necessary) as part of the tests.
Vagrant file can go in the source control and will just download the released version of etcd for now.

That is OS independent and allows me to test the cluster.

Reply to this email directly or view it on GitHub:
#9 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the Statistics info in ETCD support using all nodes of the cluster #9

Using the Statistics info in ETCD support using all nodes of the cluster #9

drusellers commented May 1, 2014

smalldave commented May 1, 2014

drusellers commented May 1, 2014

smalldave commented May 1, 2014

drusellers commented May 1, 2014

smalldave commented May 1, 2014

drusellers commented May 1, 2014

smalldave commented May 1, 2014

drusellers commented May 1, 2014

smalldave commented May 1, 2014

smalldave commented May 1, 2014

smalldave commented May 1, 2014

drusellers commented May 1, 2014

smalldave commented May 2, 2014

drusellers commented May 2, 2014

smalldave commented May 2, 2014

drusellers commented May 2, 2014

smalldave commented May 6, 2014

smalldave commented May 6, 2014

smalldave commented May 6, 2014

smalldave commented May 6, 2014

drusellers commented May 6, 2014

drusellers commented May 6, 2014

smalldave commented May 6, 2014

smalldave commented May 7, 2014

drusellers commented May 7, 2014

smalldave commented May 8, 2014

drusellers commented May 8, 2014

That is OS independent and allows me to test the cluster.

Using the Statistics info in ETCD support using all nodes of the cluster #9

Using the Statistics info in ETCD support using all nodes of the cluster #9

Comments

drusellers commented May 1, 2014

smalldave commented May 1, 2014

drusellers commented May 1, 2014

smalldave commented May 1, 2014

drusellers commented May 1, 2014

smalldave commented May 1, 2014

drusellers commented May 1, 2014

smalldave commented May 1, 2014

drusellers commented May 1, 2014

smalldave commented May 1, 2014

smalldave commented May 1, 2014

smalldave commented May 1, 2014

drusellers commented May 1, 2014

smalldave commented May 2, 2014

drusellers commented May 2, 2014

smalldave commented May 2, 2014

drusellers commented May 2, 2014

smalldave commented May 6, 2014

smalldave commented May 6, 2014

smalldave commented May 6, 2014

smalldave commented May 6, 2014

drusellers commented May 6, 2014

drusellers commented May 6, 2014

smalldave commented May 6, 2014

smalldave commented May 7, 2014

drusellers commented May 7, 2014

smalldave commented May 8, 2014

drusellers commented May 8, 2014

That is OS independent and allows me to test the cluster.