Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

fleetctl: optimize stop/unload commands #1344

Closed
wants to merge 1 commit into from

Conversation

iamruinous
Copy link

Use the Unit API instead of Units API to speed up lookups on
environments with a large number of distinct services.

Fixes #1343

    Use the Unit API instead of Units API to speed up lookups on
    environments with a large number of distinct services.

    Fixes coreos#1343
@wuqixuan
Copy link
Contributor

wuqixuan commented Sep 6, 2015

Cool modification.

@mischief
Copy link
Contributor

lgtm, @jonboulle ?

@jonboulle
Copy link
Contributor

I'm slightly wary of this because presumably there's some threshold where calling Units() is more efficient than all the round trips of individual look ups - perhaps we could have a sanity limit?

@wuqixuan
Copy link
Contributor

@jonboulle , yes cAPI.Units are very useful in some scenarios. So we should not remove cAPI.Units. We just remove findUnits. In this case, absolutely, no need to call cAPI.Units.
So, lgtm for this solution.

@jonboulle
Copy link
Contributor

@wuqixuan I don't really understand; my point is that in the case of someone running, say, fleetctl stop my_service{1..200}, this will now involve 200 x RPCs to fleet instead of one.

@wuqixuan
Copy link
Contributor

@jonboulle , I got what you mean. Maybe you guess, if the number of units is bigger than a certain one, the individual look ups method will be slower than the SINGLE Units() RPC. That's why you suggested to have a limitation check. Some case use Units(), some case use individual look ups.

I think some stress testing can be tested. If it's ok for the individual look ups method, that's fine. If the case is like what you said, maybe we can have a better algorithm.

@jonboulle
Copy link
Contributor

Right, that is what I meant. I also wonder how this will be affected by #1260 - perhaps that will obviate it?

@jonboulle jonboulle added this to the v0.13.0 milestone Feb 10, 2016
@tixxdz
Copy link
Contributor

tixxdz commented Feb 29, 2016

@jonboulle @wuqixuan @iamruinous @mischief yes this needs more thoughts, for me this doesn't look the best way to go. It seems that this is the same logic that was used in this patch that was merged: #1341 , that patch removed two extra loops that were iterating over all units! So was it really an optimization of multiple "Unit()" calls vs one "Units()" call or is it due to those extra loops ?

The right thing to do is to fetch all units at the beginning and filter them, with one RPC call "Units()" then have one loop do the real operation, state check or whatever and you have to deal with races in case a unit disappear or does not have the expected state and so on... that's part of fleet design so just check or ignore those errors and go on, and later if you want to wait for the desired state you will have another loop that waits for all previous units.

So in this case removing that Units() call and replacing it with hard coded cAPI.Unit() inside a loop does not seem a real optimization, unless I'm missing something ? could we please have some benchmark numbers ? In that case yes we have to change it and merge the patches.

Thank you!

@kayrus
Copy link
Contributor

kayrus commented Feb 29, 2016

@jonboulle I agree with @tixxdz. There should be one loop for the real operations.

@jonboulle
Copy link
Contributor

So in this case removing that Units() call and replacing it with hard coded cAPI.Unit() inside a loop does not seem a real optimization, unless I'm missing something ?

Clearly it's not this simple, or this patch would have never been submitted. Per #1344 (comment) there is presumably a point at which single calls become more efficient.

@tixxdz can you try some basic benchmarking to get some more information?

@tixxdz
Copy link
Contributor

tixxdz commented Feb 29, 2016

@jonboulle yes I'm on it, I've identified also some call sites, we clearly need to reduce of course the number of loops by hooking some callback functions to directly operate on single units that were retrieved with one single Units() call. Yes I'll do some benchmarking and try to write that as generic as possible. This needs to be done before the overwrite #1295 stuff since with that we have to operate on different units that are in different states, so for the moment I'm on this one. Thanks

@tixxdz
Copy link
Contributor

tixxdz commented Feb 29, 2016

One thing also if we do that single Unit() call this just moves the real problem from the client to the agent which in return add more processing and load on and so on... we clearly need a better way here. The trade off should be on the client side

@antrik
Copy link
Contributor

antrik commented Feb 29, 2016

@tixxdz although I haven't tried actually profiling this, I'm fairly confident the performance has got nothing to do with how many loops the client does -- the cost is most likely pretty much entirely in the RPCs to the server. So the actual question at hand is when it's cheaper to do a single Units() RPC that retrieves all units (including those we are not interested in), and when it's cheaper to do individual Unit() RPCs for each of the units we are interested in.

The problem this PR is trying to address is that if there is a huge amount of units present, the Units() call will always be very expensive, even if we are actually interested in only one of them. If on the other hand we indeed want to operate on a large percentage of all units at once, retrieving all say 1000 units in one call might still be cheaper than getting only the 500 we are interested in one by one.

However, creating a heuristic for that would be hairy; and quite frankly, the latter situation seems like a niche case -- I presume that in an overwhelming majority of cases, people will only operate on few units, in which case this patch provides a huge boost; while for the rare cases where a request really covers a large percentage of all units, it would be slow anyways, and the extra slowdown doesn't really change the overall picture all that much... So IMHO that should be fine to merge as is.

If we really want to provide for all eventualities, the right solution seems introducing a new RPC that takes a list of units, and retrieves just those in the list. However, until someone actually has a use case where this really matters, I'd tend to consider it overkill...

@kayrus
Copy link
Contributor

kayrus commented Mar 1, 2016

Here is the script which submits several units into fleet.

@tixxdz
Copy link
Contributor

tixxdz commented Apr 6, 2016

@iamruinous would it be possible if you could provide some benchmark comparisons ? currently Units() performs only one RPC call to retrieve recursively the list of all units, same thing for job entries, so the pressure will be a little bit on etcd to do extra I/O operations... the difference is not that high only if etcd I/O operations take longer...
Having a comparison here may help, and ideally as noted by others something smart enough to chose which path to take based on the number of units to query will be the best solution.

@tixxdz
Copy link
Contributor

tixxdz commented Apr 8, 2016

@jonboulle

Clearly it's not this simple, or this patch would have never been submitted. Per #1344 (comment) there is presumably a point at which single calls become more efficient.
@tixxdz can you try some basic benchmarking to get some more information?

So for Unit() vs Units() basic benchmarking with 500 units from a client perspective fleetctl status (printing to output + ssh connection...) showed up:
real 0m1.621s
user 0m0.026s
sys 0m0.013s

real 0m0.721s
user 0m0.021s
sys 0m0.012s

Not sure if this is enough... but Units() API currently performs only one RPC for all units, it instructs etcd to perform a recursive call into /unit/ , same thing for /job/
https://github.com/coreos/fleet/blob/master/registry/job.go#L92
https://github.com/coreos/fleet/blob/master/registry/job.go#L104

Why I think just using Units() is ok for now: it allows us to have the same consistency through all code paths, same error checks, and it doesn't take that much unless etcd disk I/O operations are heavy and there is lot of information to transfer (data fragmentation... etc) which in that case using Unit() to query a small number of units would be fine.

For me and as all the comments here have suggested: the approach is to have concluding benchmark results, do the change in a high level function (code consistency) and make it smart enough so it can switch from Unit() to Units() based on the supplied units.

@antrik
Copy link
Contributor

antrik commented Apr 8, 2016

For the record, I stand by my earlier analysis: I think this is a good change, heavily optimising the common case in exchange for a smaller regression in the less common case. (And a case distinction seems problematic, as it depends on the number of existing units, not only the number of units affected by the current command...)

Regarding consistency, I do agree in a way: it should be changed to use Unit for all the commands where this consideration applies.

@tixxdz
Copy link
Contributor

tixxdz commented Apr 8, 2016

@antrik could you please confirm with some benchmark results etc... ?

@kayrus
Copy link
Contributor

kayrus commented Apr 19, 2016

I suppose optimization was already implemented within these PRs: #1376, #1433, #1520

@antrik
Copy link
Contributor

antrik commented Apr 19, 2016

@kayrus well, from a quick glance, it seems #1376 somewhat mitigates the cost of doing a Units() call, by avoiding making individual calls for each unit in addition to doing the initial lookup of all units? This means that this PR here would have been an even clearer win before (not a tradeoff like it is now) -- but the discussions above apply to the code as it is now.

@tixxdz tixxdz added this to the v1.0.0 milestone Apr 25, 2016
@dongsupark dongsupark force-pushed the master branch 2 times, most recently from 14580b0 to 63b20dc Compare June 22, 2016 10:22
@dongsupark dongsupark force-pushed the master branch 3 times, most recently from 23d4b13 to 6fb1256 Compare July 1, 2016 11:07
@dongsupark dongsupark force-pushed the master branch 2 times, most recently from 54409ab to a1a21b8 Compare July 8, 2016 11:38
@dongsupark dongsupark force-pushed the master branch 3 times, most recently from 150d30c to 4002bf5 Compare August 16, 2016 08:54
@dongsupark dongsupark force-pushed the master branch 7 times, most recently from 3b60e93 to 875d938 Compare August 23, 2016 11:00
@dongsupark dongsupark force-pushed the master branch 2 times, most recently from bdc94e8 to fa5aa3a Compare August 30, 2016 12:26
@dongsupark dongsupark force-pushed the master branch 3 times, most recently from 20a3e96 to 3aaa1ab Compare November 10, 2016 15:24
@dongsupark dongsupark force-pushed the master branch 2 times, most recently from eb6872f to 365565e Compare November 24, 2016 15:35
@dongsupark
Copy link
Contributor

I did some benchmarks. (time duration, in seconds)

  1. number of units == 100

1.1. stop

  • Without the patch : 3.46
  • With the patch : 2.97

1.2. unload

  • Without the patch : 2.78
  • With the patch : 2.83

With the patch it's better.

  1. number of units == 300

2.1. stop

  • Without the patch : 11.00
  • With the patch : 9.89

2.2. unload

  • Without the patch : 7.04
  • With the patch : 6.91

It's nearly the same, with the patch it's slightly better.

  1. number of units == 500

3.1. stop

  • Without the patch : 36.31
  • With the patch : 41.70

3.2. unload

  • Without the patch : 11.78
  • With the patch : 12.01

Without the patch it's better.

When the number of units is higher than 500, it's always better without the patch. So I'd assume, we need to get data only up to 500 units.
So the turning point would be around 300 units. Technically speaking, it's better to change to Unit() when it's less than 300, while it's better to keep Units() when it's more than 300.
However, most of scalability issues occur when there are huge number of units, more than 1k, or even 10k units. So in the long run, it's definitely better to keep Units() call.

Do we need to set a threshold to toggle a some kind of Units() mode, around 300 units? Doable, but I don't think it's worth doing that.

So I suggest to not merge this PR. Let's just close.

(I also tried to introduce an internal cache for units, to optimize performance in every case. That didn't bring anything, but other unexpected side-effects. Let's not go that way..)

@dongsupark
Copy link
Contributor

I'll close it. If anyone has other ideas, please let me know.

@jonboulle
Copy link
Contributor

Thanks for the detailed analysis!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants