Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[performance] Avoid linear walk of graph children. #2959

Merged
merged 8 commits into from
Feb 6, 2020

Conversation

ejgallego
Copy link
Collaborator

@ejgallego ejgallego commented Dec 9, 2019

When calling Dune in scenarios where targets have a large number of
deps, Dune will take a long time to start. A common case is when
depending on (package coq), which brings into the DAG a few thousand
files as children of a single node.

perf data show this is due to the linear walk in Dag.is_child;
indeed, doing a naive replacement of the list for a more efficient
access structure solves the problem. For example a dummy target that depends on (package coq) does:

with this PR:

real	0m1,684s
user	0m1,552s
sys	0m0,128s

with master:

real	0m11,450s
user	0m10,587s
sys	0m0,264s

This PR is just a proof-of-concept and it must not be merged yet; I've
opened it to track the issue and to devise what would the best
solution be.

As far as I can see there are several options, I am not familiar
enough with the cycle detection algo as to propose something now.

@ejgallego ejgallego requested a review from a user December 9, 2019 16:27
@ejgallego
Copy link
Collaborator Author

@corwin-of-amber when this is ready it should also resolve the bad build latency you were suffering in jscoq.

@ejgallego
Copy link
Collaborator Author

cc @aalekseyev @snowleopard

@ghost
Copy link

ghost commented Dec 11, 2019

@ejgallego is there more to do on this PR? Appart from updating the tests.

@ejgallego
Copy link
Collaborator Author

@ejgallego is there more to do on this PR? Appart from updating the tests.

I am not sure the current fix is the right one; it was just done as a POC.

Note that we are casting from the map to a list often so this PR as-is could introduce performance regressions in other areas.

IMHO these castings should be avoided, I can take care of that but it will take me some time as I need to understand well that part of the code; feel free to take over.

@aalekseyev
Copy link
Collaborator

I thought that maybe this doesn't affect asymptotic time complexity if the call sites are already linear-time. Well, it's not clear that they're linear time: the list returned by get_outgoing is used in interruptible_fold which interrupts after expending some amount of "fuel".

An easy way to get rid of this potential complexity increase is to keep both the list and the map.

@ejgallego
Copy link
Collaborator Author

An easy way to get rid of this potential complexity increase is to keep both the list and the map.

That's what I was thinking as a first fix; but I didn't go that route as maybe it would be worth to look into this with more detail and avoid such duplication if possible.

@ghost
Copy link

ghost commented Dec 12, 2019

What about implementing interruptible_fold with a Map.fold and raise_notrace Exit?

@aalekseyev
Copy link
Collaborator

@diml, yeah, that should work.
(by the way, a paranoid concern is that the asymptotic analysis might rely on the order and we're changing the order, but I'm happy to say that seems unlikely)

@ghost
Copy link

ghost commented Dec 12, 2019

BTW, incremental_cycles is a vendored library. I suggest to submit a PR upstream and discuss this with the author.

@ejgallego
Copy link
Collaborator Author

ejgallego commented Feb 4, 2020

Hi @Armael , after some discussion we were thinking of changing the type of get_outgoing to an id-indexed map instead of a list; that would mean the interruptible_fold in

interruptible_fold
(fun y stack ->
if is_marked g y visited then
(* We found a path to a marked vertex *)
Break y
else
let y_level = get_level g y in
set_parent g y x;
if y_level < new_level then (
set_level g y new_level;
clear_incoming g y;
add_incoming g y x;
Continue (y :: stack)
) else if y_level = new_level then (
add_incoming g y x;
Continue stack
) else
(* y_level > new_level *)
Continue stack)
(get_outgoing g x) stack
would be a fold over a map [instead of the current list, which is walked in reverse addition order]

@Armael , do you think that would matter for incremental cycles? Is the order of the walk important? Thanks!

@ejgallego
Copy link
Collaborator Author

Ok folks, I've reworked the patch; this could be closer to something we could merge. Main point is maybe what to do with get_deps_from_graph_exn (cc: @snowleopard ) , also @Armael may confirm that by changing the fold in incremental_cycles we are not breaking any invariant.

I've reproduced the perf numbers on my own setup [building jsCoq] , would be interesting to understand better the possible performance implications if anyone has a large project.

@Armael
Copy link
Member

Armael commented Feb 4, 2020

Thanks for pinging me.

This discussion is in fact related to a shortcoming of the interface provided by incremental cycles, that is on my todo-list of things to improve, but that I haven't addressed yet.

More concretely, the current specification for the algorithm requires the get_outgoing function provided by the client to return a list of all successors, and do so in constant time. This is quite demanding: this basically requires the client to already have the list at hand.

Would get_outgoing have a complexity linear in the output list, then we would get an incorrect asymptotic complexity for the overall algorithm. This is because the algorithm might need to explore only a part of the list; and constructing the complete list beforehand might be already too expensive. (in particular in the case of a node with many children.)

What I want to implement, and which I believe is the best solution in the long run, is to require get_outgoing to return an iterator ('a Seq.t). This allows both the producer of the iterator and interruptible_fold to be implemented efficiently.

I'll have a look, but it will probably take me a bit of time to get through the proofs.

In the short term, I can see 3 options:

  • Having get_outgoing return a map would work for this particular application; this is quite ad-hoc however, and doesn't solve the general issue: get_outgoing would still be tied to a specific return datatype.
  • Construct and maintain lists of successors along the map, in the graph datatype. This allows you to produce the list in constant time, and satisfies the existing interface for incremental cycles. The downside is the extra bookkeeping required to make sure the map and the list stay in sync.
  • Implement the Seq.t solution. If you want, you can temporarily do the modification in dune (including in the vendored code), with the hope that I eventually prove that it is correct and re-synchronize the vendored code with the upstream.

@Armael
Copy link
Member

Armael commented Feb 4, 2020

A variation on point 3: you could submit the implementation using Seq.t as a PR on the upstream repo, targeting a feature branch. Then use that branch as vendoring target in dune.

@ejgallego
Copy link
Collaborator Author

Hi @Armael , thanks a lot for your quick turnaround,

More concretely, the current specification for the algorithm requires the get_outgoing function provided by the client to return a list of all successors, and do so in constant time.

I think this is (fortunately) the case in dune master and in this PR, we already have the list of successors at hand, the performance problem happens in a different path due to the way we store the successors.

What I want to implement, and which I believe is the best solution in the long run, is to require get_outgoing to return an iterator ('a Seq.t). This allows both the producer of the iterator and interruptible_fold to be implemented efficiently.

It seems the main client is indeed interruptible_fold, so I'm going to try to provide an abstract interface for that; I think that's better than Seq.t, as it introduces a bit more of overhead that it is not needed for quite a few datatypes; the clients could use indeed to implement the interruptible fold using Seq.t instead.

@Armael
Copy link
Member

Armael commented Feb 4, 2020

I think this is (fortunately) the case in dune master and in this PR, we already have the list of successors at hand, the performance problem happens in a different path due to the way we store the successors.

Ah, you do? Currently, the PR seems to be using NodeMap.values which I suspect is linear.

It seems the main client is indeed interruptible_fold, so I'm going to try to provide an abstract interface for that; I think that's better than Seq.t, as it introduces a bit more of overhead that it is not needed for quite a few datatypes; the clients could use indeed to implement the interruptible fold using Seq.t instead.

Ah, I see, we could indeed simply represent the sequence by its elimination principle (!), i.e. interruptible_fold. A few comments:

  • for consistency, we need to mirror whatever change we do for get_outgoing in get_incoming as well (you would get the same issue if you decided to stop representing rev deps as lists);
  • I'm unsure whether using interruptible_fold as specification instead of Seq.t is actually better. It makes the specification very higher-order (of order 3), i.e. the client has to provide a function that adheres to a higher-order specification (of order 2). Thinking how to specify side-effects, etc.
    That's not to say that specifying a Seq.t is easy, though François Pottier has some previous work about that (cf http://gallium.inria.fr/~fpottier/publis/fpottier-hashtable.pdf). But we should think about it carefully.
  • If you were thinking about performance, is Seq.t really inefficient? It might be worth getting some actual measurements.
  • I think it would be a bit nicer if we could move the discussion about API changes in incremental_cycles to its own repo, and do the changes there first (you can do a PR targeting a feature branch; we don't need to think about proofs right away).

@Armael
Copy link
Member

Armael commented Feb 4, 2020

(if you're reluctant using incremental cycles' repo because it's on inria's gitlab, then we could also discuss moving it to github. I don't mind doing that, it's just a bit annoying.)

@ejgallego
Copy link
Collaborator Author

Ah, you do? Currently, the PR seems to be using NodeMap.values which I suspect is linear.

Not in get_outgoing, see

    let get_outgoing _ v = v.info.deps
  • I'm unsure whether using interruptible_fold as specification instead of Seq.t is actually better. It makes the specification very higher-order (of order 3), i.e. the client has to provide a function that adheres to a higher-order specification (of order 2). Thinking how to specify side-effects, etc.

I think the specification would should say the traversal order is not specified; thus side effects are not safe there.

  • If you were thinking about performance, is Seq.t really inefficient? It might be worth getting some actual measurements.

I have scenarios on the order of 100.000 children [and will grow more] so indeed I'd like to avoid reifying into Seq.

See #3081 for a first try; indeed INRIA gitlab is hard to use [I used not to be able to submit a pull request there], the change seems pretty simple to me tho, maybe we can have the discussion here and you cherry pick the commit once we converge?

I'm OK to submit a pull request there tho.

@ejgallego
Copy link
Collaborator Author

PR updated, note that now we pay a bit more of cost when adding a vertex [insertion in the tree vs list cons] we should indeed measure this patch; maybe it is not worth to modify get_incoming if that incurs in a slowdown [given the performance impact of this code we can live with the asymmetry]

@ejgallego ejgallego force-pushed the children_walk branch 3 times, most recently from 02540bd to 359ef70 Compare February 4, 2020 15:41
@Armael
Copy link
Member

Armael commented Feb 4, 2020

I would also be interested in benchmarks of client-provided-interruptible_fold vs Seq.t+generic interruptible_fold. (sure 100.000 is a big number, but we should be wary of micro optimizations)

maybe it is not worth to modify get_incoming if that incurs in a slowdown [given the performance impact of this code we can live with the asymmetry]

Well, I can't ;). That sounds like a micro-optimization at the cost of an inconsistent API tbh (note that you can keep using lists for rev deps while using a more generic API) In any case I think it would be good to have some benchmarks.

@ejgallego
Copy link
Collaborator Author

I need help w.r.t. dune benchmarking, I'm not sure what the current situation is.

@ejgallego
Copy link
Collaborator Author

I think the field should be the set of outgoing edges on the node. I don't think there are any known disadvantages.

Oh yes I misread, thanks for the clarification. I can go this route indeed, I didn't do in the first place just to avoid the memory overhead.

@ejgallego
Copy link
Collaborator Author

After some discussion with @aalekseyev I have added reverted the interface change and added an extra dep set. Will also add some documentation summarizing the discussion and the invariants of the vendored lib.

Note that this PR does modify the complexity of add_raw_edge, @Armael , is that OK?

@rgrinberg , maybe this should go into 2.2.1 ?

@Armael
Copy link
Member

Armael commented Feb 6, 2020

I'm not seeing the changes, did you forget to push, or are they somewhere else?

@ejgallego
Copy link
Collaborator Author

Internet is pretty sketchy here, sorry @Armael

ejgallego and others added 2 commits February 6, 2020 21:01
When calling Dune in scenarios where targets have a large number of
deps, Dune will take a long time to start. A common case is when
depending on `(package coq)`, which brings into the DAG a few thousand
files.

`perf` data show this is due to the linear walk in `Dag.is_child`;
indeed, doing a naive replacement of the list for a more efficient
access structure solves the problem:

```
with this PR:

real	0m1,684s
user	0m1,552s
sys	0m0,128s

with master:

real	0m11,450s
user	0m10,587s
sys	0m0,264s
```

We fix this by adding an efficient representation of `deps` that
allows checking if an edge is already in the graph `log n` time, so
the complexity of `is_child` goes from O(n²) to O(n log(n)).

Note that `raw_add_edge` has also changed complexity from O(1) to
O(log n) due to extra map insertion.

Signed-off-by: Emilio Jesus Gallego Arias <[email protected]>
Signed-off-by: Emilio Jesus Gallego Arias <[email protected]>
Co-authored-by: Arseniy Alekseyev <[email protected]>
Co-authored-by: Armaël Guéneau <[email protected]>
@Armael
Copy link
Member

Armael commented Feb 6, 2020

Thanks, the new version looks good to me.

Wrt the complexity of raw_add_edge, are you referring to the fact that it is now O(logn) vs O(1)? This is not covered by the spec strictly speaking (the spec assumes that it is O(1)), but raw_add_edge is called exactly once per edge insertion, so in the end it is completely fine.
And anyway O(logn) vs O(1) are pretty close in practice.

@ejgallego
Copy link
Collaborator Author

Thanks a lot for all the help @Armael , I propose @aalekseyev gives this a last review and merges if he thinks we have converged. @aalekseyev , I tried to summarize some of the discussion in the README, please amend / let me know if that's enough.

Signed-off-by: Arseniy Alekseyev <[email protected]>
Copy link
Collaborator

@aalekseyev aalekseyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the detailed readme!

Signed-off-by: Arseniy Alekseyev <[email protected]>
@aalekseyev
Copy link
Collaborator

I moved the code that checks the presence of the edge into a new function Dag.add_idempotent. I feel this change is small enough and related enough to make it in the same PR. Please let me know if you disagree and I'll split it into a separate PR.

@ejgallego
Copy link
Collaborator Author

Looks good, thanks @aalekseyev ; I wonder if we should also rename add to add_nonexisting or add_unique.

Dag.add seems too tempting to me as to use without understanding the constraints of that particular fun.

Signed-off-by: Arseniy Alekseyev <[email protected]>
@aalekseyev
Copy link
Collaborator

I might just delete Dag.add because we don't use it in Dune.

@ejgallego
Copy link
Collaborator Author

I might just delete Dag.add because we don't use it in Dune.

Sounds pretty reasonable I'd say.

@aalekseyev
Copy link
Collaborator

Did you run the Coq benchmark on the version with the list+set? I'm happy to merge without that because it's clearly going to be similar, but I'd like to reword the commit message if we didn't measure this particular version.

Copy link
Collaborator

@snowleopard snowleopard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

@ejgallego
Copy link
Collaborator Author

Did you run the Coq benchmark on the version with the list+set? I'm happy to merge without that because it's clearly going to be similar, but I'd like to reword the commit message if we didn't measure this particular version.

Similar numbers [on battery now]

real	0m43,145s
user	0m42,817s
sys	0m0,216s

vs
                       
real	0m2,823s
user	0m2,625s
sys	0m0,189s

@aalekseyev aalekseyev merged commit 588e0e9 into ocaml:master Feb 6, 2020
@ejgallego ejgallego deleted the children_walk branch February 7, 2020 00:20
rgrinberg added a commit to rgrinberg/opam-repository that referenced this pull request Feb 15, 2020
…lugin, dune-private-libs and dune-glob (2.3.0)

CHANGES:

- Improve validation and error handling of arguments to `dune init` (ocaml/dune#3103, fixes
  ocaml/dune#3046, @shonfeder)

- `dune init exec NAME` now uses the `NAME` argument for private modules (ocaml/dune#3103,
  fixes ocaml/dune#3088, @shonfeder)

- Avoid linear walk to detect children, this should greatly improve
  performance when a target has a large number of dependencies (ocaml/dune#2959,
  @ejgallego, @aalekseyev, @Armael)

- [coq] Add `(boot)` option to `(coq.theories)` to enable bootstrap of
  Coq's stdlib (ocaml/dune#3096, @ejgallego)

- [coq] Deprecate `public_name` field in favour of `package` (ocaml/dune#2087, @ejgallego)

- Better error reporting for "data only" and "vendored" dirs. Using these with
  anything else than a strict subdirectory or `*` will raise an error. The
  previous behavior was to just do nothing  (ocaml/dune#3056, fixes ocaml/dune#3019, @voodoos)

- Fix bootstrap on bytecode only switches on windows or where `-j1` is set.
  (ocaml/dune#3112, @xclerc, @rgrinberg)

- Allow `enabled_if` fields in `executable(s)` stanzas (ocaml/dune#3137, fixes ocaml/dune#1690
  @voodoos)

- Do not fail if `ocamldep`, `ocamlmklib`, or `ocaml` are absent. Wait for them
  to be used to fail (ocaml/dune#3138, @rgrinberg)

- Introduce a `strict_package_deps` mode that verifies that dependencies between
  packages in the workspace are specified correctly. (@rgrinberg, ocaml/dune#3117)
rgrinberg added a commit to rgrinberg/opam-repository that referenced this pull request Feb 15, 2020
…lugin, dune-private-libs and dune-glob (2.3.0)

CHANGES:

- Improve validation and error handling of arguments to `dune init` (ocaml/dune#3103, fixes
  ocaml/dune#3046, @shonfeder)

- `dune init exec NAME` now uses the `NAME` argument for private modules (ocaml/dune#3103,
  fixes ocaml/dune#3088, @shonfeder)

- Avoid linear walk to detect children, this should greatly improve
  performance when a target has a large number of dependencies (ocaml/dune#2959,
  @ejgallego, @aalekseyev, @Armael)

- [coq] Add `(boot)` option to `(coq.theories)` to enable bootstrap of
  Coq's stdlib (ocaml/dune#3096, @ejgallego)

- [coq] Deprecate `public_name` field in favour of `package` (ocaml/dune#2087, @ejgallego)

- Better error reporting for "data only" and "vendored" dirs. Using these with
  anything else than a strict subdirectory or `*` will raise an error. The
  previous behavior was to just do nothing  (ocaml/dune#3056, fixes ocaml/dune#3019, @voodoos)

- Fix bootstrap on bytecode only switches on windows or where `-j1` is set.
  (ocaml/dune#3112, @xclerc, @rgrinberg)

- Allow `enabled_if` fields in `executable(s)` stanzas (ocaml/dune#3137, fixes ocaml/dune#1690
  @voodoos)

- Do not fail if `ocamldep`, `ocamlmklib`, or `ocaml` are absent. Wait for them
  to be used to fail (ocaml/dune#3138, @rgrinberg)

- Introduce a `strict_package_deps` mode that verifies that dependencies between
  packages in the workspace are specified correctly. (@rgrinberg, ocaml/dune#3117)

- Make sure the `@all` alias is defined when no `dune` file is present
  in a directory (ocaml/dune#2946, fix ocaml/dune#2927, @diml)
mgree pushed a commit to mgree/opam-repository that referenced this pull request Feb 19, 2020
…lugin, dune-private-libs and dune-glob (2.3.0)

CHANGES:

- Improve validation and error handling of arguments to `dune init` (ocaml/dune#3103, fixes
  ocaml/dune#3046, @shonfeder)

- `dune init exec NAME` now uses the `NAME` argument for private modules (ocaml/dune#3103,
  fixes ocaml/dune#3088, @shonfeder)

- Avoid linear walk to detect children, this should greatly improve
  performance when a target has a large number of dependencies (ocaml/dune#2959,
  @ejgallego, @aalekseyev, @Armael)

- [coq] Add `(boot)` option to `(coq.theories)` to enable bootstrap of
  Coq's stdlib (ocaml/dune#3096, @ejgallego)

- [coq] Deprecate `public_name` field in favour of `package` (ocaml/dune#2087, @ejgallego)

- Better error reporting for "data only" and "vendored" dirs. Using these with
  anything else than a strict subdirectory or `*` will raise an error. The
  previous behavior was to just do nothing  (ocaml/dune#3056, fixes ocaml/dune#3019, @voodoos)

- Fix bootstrap on bytecode only switches on windows or where `-j1` is set.
  (ocaml/dune#3112, @xclerc, @rgrinberg)

- Allow `enabled_if` fields in `executable(s)` stanzas (ocaml/dune#3137, fixes ocaml/dune#1690
  @voodoos)

- Do not fail if `ocamldep`, `ocamlmklib`, or `ocaml` are absent. Wait for them
  to be used to fail (ocaml/dune#3138, @rgrinberg)

- Introduce a `strict_package_deps` mode that verifies that dependencies between
  packages in the workspace are specified correctly. (@rgrinberg, ocaml/dune#3117)

- Make sure the `@all` alias is defined when no `dune` file is present
  in a directory (ocaml/dune#2946, fix ocaml/dune#2927, @diml)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants