Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph method edge_disjoint_spanning_trees() can hang #38831

Closed
orlitzky opened this issue Oct 20, 2024 · 18 comments
Closed

Graph method edge_disjoint_spanning_trees() can hang #38831

orlitzky opened this issue Oct 20, 2024 · 18 comments

Comments

@orlitzky
Copy link
Contributor

I have a new machine where it looks like #32169 has returned. With the setup from https://github.com/sagemath/sage/blob/develop/src/sage/graphs/generic_graph.py#L7214,

sage: d6 = r'[E_S?_hKIH@eos[BSg???Q@FShGC?hTHUGM?IPug?JOEYCdOzdkQGo'                                                       
sage: d6 += r'@ADA@AAg?GAQW?[aIaSwHYcD@qQb@Dd?\hJTI@OHlJ_?C_OEIKoeC'                                                       
sage: d6 += r'R@_BC?Q??YBFosqITEA?IvCU_'                                                                                   
sage: G = DiGraph(d6, format='dig6')

The following complete quickly,

sage: G.edge_disjoint_spanning_trees(1)
[Digraph on 28 vertices]
sage: G.edge_disjoint_spanning_trees(2)
[Digraph on 28 vertices, Digraph on 28 vertices]
sage: G.edge_disjoint_spanning_trees(3)
[Digraph on 28 vertices, Digraph on 28 vertices, Digraph on 28 vertices]
sage: G.edge_disjoint_spanning_trees(4)
[Digraph on 28 vertices,
 Digraph on 28 vertices,
 Digraph on 28 vertices,
 Digraph on 28 vertices]

But the example that is doctested...

sage: G.edge_disjoint_spanning_trees(5)

hangs "indefinitely" (I stopped waiting after 12 hours). If I Ctrl-C it from an interactive session, it looks like it gets stuck while trying to solve the MIP:

^C---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[10], line 1
----> 1 G.edge_disjoint_spanning_trees(Integer(5))

File ~/tmp/usr/local/lib/python3.12/site-packages/sage/graphs/generic_graph.py:7362, in GenericGraph.edge_disjoint_spanning_trees(self, k, algorithm, root, solver, verbose, integrality_tolerance)
   7360 # We now solve this program and extract the solution
   7361 try:
-> 7362     p.solve(log=verbose)
   7363 except MIPSolverException:
   7364     raise EmptySetError("this graph does not contain the required number of trees/arborescences")

File mip.pyx:2638, in sage.numerical.mip.MixedIntegerLinearProgram.solve()

File glpk_backend.pyx:1130, in sage.numerical.backends.glpk_backend.GLPKBackend.solve()
@orlitzky
Copy link
Contributor Author

@dcoudert
Copy link
Contributor

On macOS 14.7 and sagemath 10.5.beta7, I get

sage: %time G.edge_disjoint_spanning_trees(5, solver='glpk', verbose=3)
GLPK Simplex Optimizer 5.0
1883 rows, 1355 columns, 10390 non-zeros
      0: obj =  -0.000000000e+00 inf =   6.600e+02 (415)
    443: obj =  -0.000000000e+00 inf =   1.588e-14 (0) 1
OPTIMAL LP SOLUTION FOUND
GLPK Integer Optimizer 5.0
1883 rows, 1355 columns, 10390 non-zeros
1215 integer variables, all of which are binary
Preprocessing...
140 hidden covering inequaliti(es) were detected
1864 rows, 1310 columns, 10110 non-zeros
1170 integer variables, all of which are binary
Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  2.800e+01  ratio =  2.800e+01
GM: min|aij| =  9.438e-01  max|aij| =  1.060e+00  ratio =  1.123e+00
EQ: min|aij| =  8.945e-01  max|aij| =  1.000e+00  ratio =  1.118e+00
2N: min|aij| =  5.000e-01  max|aij| =  1.562e+00  ratio =  3.125e+00
Constructing initial basis...
Size of triangular part is 1859
Solving LP relaxation...
GLPK Simplex Optimizer 5.0
1864 rows, 1310 columns, 10110 non-zeros
    443: obj =  -0.000000000e+00 inf =   5.626e+02 (302)
    900: obj =  -0.000000000e+00 inf =   1.019e-13 (0) 4
OPTIMAL LP SOLUTION FOUND
Integer optimization begins...
Long-step dual simplex will be used
+   900: mip =     not found yet <=              +inf        (1; 0)
+  2634: >>>>>   0.000000000e+00 <=   0.000000000e+00   0.0% (215; 17)
+  2634: mip =   0.000000000e+00 <=     tree is empty   0.0% (0; 461)
INTEGER OPTIMAL SOLUTION FOUND
CPU times: user 1.42 s, sys: 2.12 ms, total: 1.42 s
Wall time: 1.42 s
[Digraph on 28 vertices,
 Digraph on 28 vertices,
 Digraph on 28 vertices,
 Digraph on 28 vertices,
 Digraph on 28 vertices]
sage:
sage: %time G.edge_disjoint_spanning_trees(5, solver='cplex', verbose=3)
Version identifier: 22.1.1.0 | 2022-11-28 | 9160aff4d
Tried aggregator 1 time.
MIP Presolve eliminated 184 rows and 45 columns.
MIP Presolve modified 2340 coefficients.
Reduced MIP has 1694 rows, 1310 columns, and 6540 nonzeros.
Reduced MIP has 1170 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.00 sec. (5.19 ticks)
Probing time = 0.00 sec. (1.47 ticks)
Tried aggregator 1 time.
Detecting symmetries...
Reduced MIP has 1694 rows, 1310 columns, and 6540 nonzeros.
Reduced MIP has 1170 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.00 sec. (5.03 ticks)
Probing time = 0.00 sec. (1.50 ticks)
Clique table members: 559.
MIP emphasis: balance optimality and feasibility.
MIP search method: dynamic search.
Parallel mode: deterministic, using up to 10 threads.
Root relaxation solution time = 0.00 sec. (4.58 ticks)

        Nodes                                         Cuts/
   Node  Left     Objective  IInf  Best Integer    Best Bound    ItCnt     Gap

      0     0        0.0000   263                      0.0000      372         
      0     0        0.0000    16                     Cuts: 6      387         
      0     0        0.0000    22                     Cuts: 9      410         
      0     0        0.0000     4                   Covers: 1      412         
      0     0        0.0000    16                     Cuts: 2      429         
*     0+    0                            0.0000        0.0000             0.00%
      0     0        cutoff              0.0000        0.0000      429    0.00%
Elapsed time = 0.09 sec. (155.42 ticks, tree = 0.01 MB, solutions = 1)

Cover cuts applied:  3
Flow cuts applied:  2
Mixed integer rounding cuts applied:  2
Lift and project cuts applied:  1
Gomory fractional cuts applied:  1

Root node processing (before b&c):
  Real time             =    0.09 sec. (155.50 ticks)
Parallel b&c, 10 threads:
  Real time             =    0.00 sec. (0.00 ticks)
  Sync time (average)   =    0.00 sec.
  Wait time (average)   =    0.00 sec.
                          ------------
Total (root+branch&cut) =    0.09 sec. (155.50 ticks)
CPU times: user 698 ms, sys: 17.1 ms, total: 716 ms
Wall time: 134 ms
[Digraph on 28 vertices,
 Digraph on 28 vertices,
 Digraph on 28 vertices,
 Digraph on 28 vertices,
 Digraph on 28 vertices]

@dcoudert
Copy link
Contributor

Clearly, one should implement a combinatorial tree-packing algorithm like https://doc.sagemath.org/html/en/reference/references/index.html#gabow1995. So far we only have the part to compute the edge connectivity https://doc.sagemath.org/html/en/reference/graphs/sage/graphs/edge_connectivity.html. But the tree-packing algorithm is not easy to implement. Help is more than welcome here.

For undirected graphs, we have the Roskind-Tarjan algorithm.

@orlitzky
Copy link
Contributor Author

With verbose=3 I can see where it gets stuck,

G.edge_disjoint_spanning_trees(5, solver='glpk', verbose=3)
GLPK Simplex Optimizer 5.0
1883 rows, 1355 columns, 10390 non-zeros
      0: obj =  -0.000000000e+00 inf =   6.600e+02 (415)
    449: obj =  -0.000000000e+00 inf =   3.453e-14 (0) 1
OPTIMAL LP SOLUTION FOUND
GLPK Integer Optimizer 5.0
1883 rows, 1355 columns, 10390 non-zeros
1215 integer variables, all of which are binary
Preprocessing...
140 hidden covering inequaliti(es) were detected
1864 rows, 1310 columns, 10110 non-zeros
1170 integer variables, all of which are binary
Scaling...
 A: min|aij| =  1.000e+00  max|aij| =  2.800e+01  ratio =  2.800e+01
GM: min|aij| =  9.438e-01  max|aij| =  1.060e+00  ratio =  1.123e+00
EQ: min|aij| =  8.945e-01  max|aij| =  1.000e+00  ratio =  1.118e+00
2N: min|aij| =  5.000e-01  max|aij| =  1.562e+00  ratio =  3.125e+00
Constructing initial basis...
Size of triangular part is 1859
Solving LP relaxation...
GLPK Simplex Optimizer 5.0
1864 rows, 1310 columns, 10110 non-zeros
    449: obj =  -0.000000000e+00 inf =   5.626e+02 (302)
    903: obj =  -0.000000000e+00 inf =   7.080e-14 (0) 4
OPTIMAL LP SOLUTION FOUND
Integer optimization begins...
Long-step dual simplex will be used
+   903: mip =     not found yet <=              +inf        (1; 0)
+  1474: mip =     not found yet <=   0.000000000e+00        (55; 1)
+  2363: mip =     not found yet <=   0.000000000e+00        (254; 31)
+  4073: mip =     not found yet <=   0.000000000e+00        (542; 207)
+  6315: mip =     not found yet <=   0.000000000e+00        (733; 488)
+  8632: mip =     not found yet <=   0.000000000e+00        (961; 712)
+ 11046: mip =     not found yet <=   0.000000000e+00        (1126; 1014)
+ 13350: mip =     not found yet <=   0.000000000e+00        (1312; 1281)
+ 15672: mip =     not found yet <=   0.000000000e+00        (1360; 1728)
+ 18084: mip =     not found yet <=   0.000000000e+00        (1479; 2066)
+ 20439: mip =     not found yet <=   0.000000000e+00        (1752; 2238)
+ 22763: mip =     not found yet <=   0.000000000e+00        (2030; 2391)
Time used: 60.0 secs.  Memory used: 5.2 Mb.
+ 24917: mip =     not found yet <=   0.000000000e+00        (2297; 2553)
+ 27072: mip =     not found yet <=   0.000000000e+00        (2562; 2689)
+ 29532: mip =     not found yet <=   0.000000000e+00        (2418; 3427)
+ 31732: mip =     not found yet <=   0.000000000e+00        (2648; 3599)
+ 33906: mip =     not found yet <=   0.000000000e+00        (2886; 3763)
+ 36033: mip =     not found yet <=   0.000000000e+00        (3097; 3956)
+ 38158: mip =     not found yet <=   0.000000000e+00        (3332; 4118)
+ 40259: mip =     not found yet <=   0.000000000e+00        (3412; 4482)
+ 42616: mip =     not found yet <=   0.000000000e+00        (3320; 5118)
+ 44386: mip =     not found yet <=   0.000000000e+00        (3319; 5417)
+ 46162: mip =     not found yet <=   0.000000000e+00        (3507; 5599)
+ 48165: mip =     not found yet <=   0.000000000e+00        (3631; 5861)
Time used: 120.0 secs.  Memory used: 6.5 Mb.
+ 50139: mip =     not found yet <=   0.000000000e+00        (3769; 6114)
+ 51991: mip =     not found yet <=   0.000000000e+00        (3942; 6287)
+ 53804: mip =     not found yet <=   0.000000000e+00        (4069; 6552)
+ 55640: mip =     not found yet <=   0.000000000e+00        (4165; 6822)
+ 57616: mip =     not found yet <=   0.000000000e+00        (4274; 7055)
+ 59425: mip =     not found yet <=   0.000000000e+00        (4472; 7208)
+ 61301: mip =     not found yet <=   0.000000000e+00        (4683; 7347)
...

GLPK upstream is probably not going to be much help if that turns out to be the issue.

@dcoudert
Copy link
Contributor

dcoudert commented Oct 20, 2024

I don't know what to do here. I have the same behavior on macOS and Fedora 39 (i.e., works well). The only solution I see, as already discussed with @dimpase long time ago is to implement a combinatorial algorithm and avoid linear programming here. We already changed the formulation (see #32169), but it's apparently not sufficient to avoid issues.

@orlitzky
Copy link
Contributor Author

Rebuilding GLPK with CFLAGS="-O0" fixes it. I'm running gcc-14.2.1_p20240921 on this system and may just be the first one to hit the issue. Clang-19.1.2 had the same problem at -O2, though, so the blame probably lies with GLPK and not the compiler.

@orlitzky
Copy link
Contributor Author

orlitzky commented Oct 21, 2024

It's OK at -O1 with my -march flag, too.

edit: OK at -O1 with GCC

@orlitzky
Copy link
Contributor Author

Clang, on the other hand, shows no improvement at -O0 or -O1.

@dcoudert
Copy link
Contributor

then this should be reported upstream.

@orlitzky
Copy link
Contributor Author

GLPK upstream is not quite dead -- the maintainer responds on the "help" list sometimes -- but there's no public source tree and no response to e.g. basic build fixes reported years ago (https://lists.gnu.org/archive/html/bug-glpk/2020-03/msg00003.html, https://lists.gnu.org/archive/html/bug-glpk/2022-08/msg00000.html).

I'll still do it, but I wouldn't get your hopes up that it will lead to a fix.

@orlitzky
Copy link
Contributor Author

In a surprise plot twist, if I write the LP to a file and then hand it to glpsol --lp, it works regardless of the compiler/flags used to build GLPK.

@dcoudert
Copy link
Contributor

does it means that there is something wrong with our interface ?

@orlitzky
Copy link
Contributor Author

Not necessarily. It could be that glpsol is using the API in a different way that bypasses a bug in the library, or it could be cython that's generating buggy code, or it could even still be a compiler issue that is compiling the cython code different than the glpsol code. Of course it could also be our interface.

I'll have to dig into the GLPK internals later to see what I can learn from inside the two "solve" routines.

@dimpase
Copy link
Member

dimpase commented Oct 24, 2024

as sage: G.edge_disjoint_spanning_trees(2,solver="ppl") gets stuck just as well (it works with 1 instead of 2)
I'd say it's most probably a problem in our interface.

@orlitzky
Copy link
Contributor Author

I've spent the better part of two days debugging this, painstakingly adding recursive printf statements to dump the entire internal representation of the LP, only to find... that write_lp() and read_lp() aren't stable. When I said,

if I write the LP to a file and then hand it to glpsol --lp, it works regardless of the compiler/flags used to build GLPK

it's because the resulting LP (while hopefully equivalent) is not identical. If I use write_mps() and glpsol --mps instead, then glpsol too gets stuck.

Since #34575 reports the same issue on ARM, and since I'm seeing it on RISC-V but only with certain compilers and optimization levels, my current guess is that it's some obscure numerical issue that gets consumed by architecture-dependent compiler optimizations.

@dimpase
Copy link
Member

dimpase commented Oct 24, 2024

it doesn't explain why ppl gets stuck too, though

@orlitzky
Copy link
Contributor Author

PPL fails consistently though, that's easier to explain.

@orlitzky
Copy link
Contributor Author

orlitzky added a commit to orlitzky/sage that referenced this issue Oct 24, 2024
One doctest in this file is "hanging" on ARM64 and RISC-V as GLPK
tries courageously to solve a MIP. A tweak to the solver options
allows this problem to be solved on those two architectures without
affecting any others. This is unlikely to solve the general problem,
but it may buy us some time.

Closes sagemath#34575
Closes sagemath#38831
orlitzky added a commit to orlitzky/sage that referenced this issue Oct 25, 2024
One doctest in this file is "hanging" on ARM64 and RISC-V as GLPK
tries courageously to solve a MIP. A tweak to the solver options
allows this problem to be solved on those two architectures without
affecting any others. This is unlikely to solve the general problem,
but it may buy us some time.

Closes sagemath#34575
Closes sagemath#38831
@vbraun vbraun closed this as completed in f592087 Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants