[TVM] Fix GatherBound to avoid allocating too much #2104

sgrechanik-h · 2018-11-14T12:49:32Z

This is a fix for the problem mentioned here on the forum.

During compilation TVM sometimes reshapes tensors using the information from its uses. This is a very important feature because sometimes a schedule may introduce a new tensor (e.g. via cache_write), but use only a handful of elements out of it, in which case this transformation may reduce the tensor's size and remove unnecessary computations of unused elements. However, the same transformation sometimes expands tensor sizes. This happens because computing precise ranges for expressions is impossible in general, and often leads to overapproximation. In the example from the forum the expression is i + (0 - i), for which the range is evaluated to be [-9; 9] because the range evaluation algorithm assumes that subexpressions i and -i are independent. Of course, this particular example may be simplified but it is always possible to construct an example that the simplifier won't be able to simplify.

The solution I propose is to compare the range computed from the uses and the original range from the tensor declaration, and use the smaller one. There are several concerns though:

I am not sure if the are any passes or schedules that rely on tensor expansion. If there are, this may lead to out-of-bounds errors.
Sometimes we cannot compare the ranges. It happens when there are variables involved. In this case I decided to fallback to the old behavior because otherwise the test test_bound_nest_thread from test_schedule_bound_inference.py fails (it uses a global variable m as a tensor size).
The test test_bound_nest_thread may also be caused to fail by fixing m to be any integer less than 32. The failure is due to additional checks at message_passing.cc:36 and schedule_ops.cc:184, and I'm not sure if they are really necessary.
The test test_schedule_bound_condition fails because with this fix it cannot reproduce the original problem anymore ([SCHEDULE] Generate Lower Bound Conditions #1014). I removed it for now.

(Also I considered another solution, namely intersect the two ranges. In theory this should be perfect, but in practice it results in too complex expressions, which eventually lead to messages like this:
loop_partition.cc:356: Cannot prove: ((((((1 + min((c.outer.h.fused/112), 7)) - max((c.outer.h.fused/112), 0)) - 1) - (8 - max((c.outer.h.fused/112), 0))) + 1) >= 0), when generating the post doubt loop
and also to errors like this:
TVMError: [17:49:02] split_host_device.cc:116: Check failed: !use_count_.count(v) variable c.outer.h.fused has been used before definition!.
The reason is that some local variables leak into tensor bounds which are then used to generate certain conditions.)

@tqchen Could you please provide some comments regarding this issue?

junrushao · 2018-11-18T17:34:19Z

FYI, the PR fixing issue #1014 was incorrect, and I just added a correct one #2126.

tqchen · 2018-11-19T20:36:15Z

@junrushao1994 @were can you help review this PR?

junrushao · 2018-11-19T20:39:45Z

Will do this Wednesday.

junrushao · 2018-11-22T07:18:31Z

The problem here is non-trivial.

sergei-mironov · 2018-11-27T12:21:53Z

Hi, I'd like to share some results of investigation regarding BoundsChecker problem #2079 . I think they are closely related to the current issue as well. Both examples show cases, where problem is clearly exists, but BoundsChecker can't be used effectively.

Both programs below perform operations on intermediate tensor c while computing final tensor d:

Uninitialized data passed to computations

In this case we use c merely by copying its elements with certain offset. Originally, this example was aimed at simulating user error causing a SegFault, but the compiler treated it in a more interesting way.

def demo_array2():
  with tvm.build_config(instrument_bound_checkers=True):
    a = tvm.placeholder((10,), name='a')
    b = tvm.placeholder((10,), name='b')
    c = tvm.compute((10,), lambda i: a[i]+b[i], name='c')
    d = tvm.compute((10,), lambda i: c[i+100500], name='d')

     run_tvm(0,1,
            { a:np.ones(get_shape(a)).astype(np.float32)
            , b:np.ones(get_shape(b)).astype(np.float32)
            },d, debug=True)

In this case TVM generates the code below. Note the correct allocation and if(0) in the produce c statement. As a result, c is passed to d uninitialized. User have no chance to notice their mistake quickly.

// attr [c] storage_scope = "global"
allocate c[float32 * 10]
produce c {
  for (i, 0, 10) {
    if (likely((uint1)0)) {
      c[i] = (a[(i + 100500)] + b[(i + 100500)])
    }
  }
}
produce d {
  for (i, 0, 10) {
    d[i] = c[i]
  }
}

Uninitialized data + fallacious allocations

In this version we make the c usage pattern a bit more complex.

def demo_array3():
  with tvm.build_config(instrument_bound_checkers=True):
    a = tvm.placeholder((10,), name='a')
    b = tvm.placeholder((10,), name='b')
    c = tvm.compute((10,), lambda i: a[i]+b[i], name='c')
    d = tvm.compute((10,), lambda i: c[i]*c[i+100500], name='d')

    run_tvm(0,1,
            { a:np.ones(get_shape(a)).astype(np.float32)
            , b:np.ones(get_shape(b)).astype(np.float32)
            },d, debug=True)

As before, c is passed uninitialized, but In this case TVM makes decision to increase the size of c and its axis, consuming lots of memory potentially. Resulting IR:

// attr [c] storage_scope = "global"
allocate c[float32 * 100510]
produce c {
  for (i, 0, 100510) {
    if (likely((i < 10))) {
      c[i] = (a[i] + b[i])
    }
  }
}
produce d {
  for (i, 0, 10) {
    d[i] = (c[i]*c[(i + 100500)])
  }
}

Results

This examples demonstrate how a typo from user may remain undetected and lead to bugs which are hard to debug. We see that uninitialized data is passed to computations in both examples. It may spread NaNs or even become a security concern. In the last example TVM decides to alter the size of intermediate tensor.

In general, It is clear that the compiler should be able to reduce tensors during optimizations, i.e. in cache_write case mentioned by @sgrechanik-h. In the same time, I can't imagine when do we need to increase the size of a tensor.

So far we saw negative results of this feature. @tqchen @junrushao1994 could you please name some cases where tensor expansion is a desired behavior?

tqchen · 2019-07-23T03:02:29Z

superceded by #3526

sgrechanik-h force-pushed the gatherbound-pr branch from a4d3592 to 65967a1 Compare November 16, 2018 11:05

[TVM] Fix GatherBound to avoid allocating too much

65967a1

tqchen added the status: need review label Nov 19, 2018

sgrechanik-h mentioned this pull request Feb 6, 2019

[TVM] Automatic differentiation for tensor expressions #2498

Closed

sgrechanik-h mentioned this pull request Jun 19, 2019

[RFC][WIP] Tensor Expression level automatic differentiation #1996

Closed

sgrechanik-h mentioned this pull request Jul 10, 2019

[ARITH][BOUND] Fix bound inference to avoid allocating too much #3526

Merged

tqchen closed this Jul 23, 2019

sgrechanik-h deleted the gatherbound-pr branch November 3, 2019 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TVM] Fix GatherBound to avoid allocating too much #2104

[TVM] Fix GatherBound to avoid allocating too much #2104

sgrechanik-h commented Nov 14, 2018

junrushao commented Nov 18, 2018

tqchen commented Nov 19, 2018

junrushao commented Nov 19, 2018

junrushao commented Nov 22, 2018

sergei-mironov commented Nov 27, 2018 •

edited

Loading

tqchen commented Jul 23, 2019

[TVM] Fix GatherBound to avoid allocating too much #2104

[TVM] Fix GatherBound to avoid allocating too much #2104

Conversation

sgrechanik-h commented Nov 14, 2018

junrushao commented Nov 18, 2018

tqchen commented Nov 19, 2018

junrushao commented Nov 19, 2018

junrushao commented Nov 22, 2018

sergei-mironov commented Nov 27, 2018 • edited Loading

Uninitialized data passed to computations

Uninitialized data + fallacious allocations

Results

tqchen commented Jul 23, 2019

sergei-mironov commented Nov 27, 2018 •

edited

Loading