Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TVM] Fix GatherBound to avoid allocating too much #2104

Closed
wants to merge 1 commit into from

Conversation

sgrechanik-h
Copy link
Contributor

This is a fix for the problem mentioned here on the forum.

During compilation TVM sometimes reshapes tensors using the information from its uses. This is a very important feature because sometimes a schedule may introduce a new tensor (e.g. via cache_write), but use only a handful of elements out of it, in which case this transformation may reduce the tensor's size and remove unnecessary computations of unused elements. However, the same transformation sometimes expands tensor sizes. This happens because computing precise ranges for expressions is impossible in general, and often leads to overapproximation. In the example from the forum the expression is i + (0 - i), for which the range is evaluated to be [-9; 9] because the range evaluation algorithm assumes that subexpressions i and -i are independent. Of course, this particular example may be simplified but it is always possible to construct an example that the simplifier won't be able to simplify.

The solution I propose is to compare the range computed from the uses and the original range from the tensor declaration, and use the smaller one. There are several concerns though:

  • I am not sure if the are any passes or schedules that rely on tensor expansion. If there are, this may lead to out-of-bounds errors.
  • Sometimes we cannot compare the ranges. It happens when there are variables involved. In this case I decided to fallback to the old behavior because otherwise the test test_bound_nest_thread from test_schedule_bound_inference.py fails (it uses a global variable m as a tensor size).
  • The test test_bound_nest_thread may also be caused to fail by fixing m to be any integer less than 32. The failure is due to additional checks at message_passing.cc:36 and schedule_ops.cc:184, and I'm not sure if they are really necessary.
  • The test test_schedule_bound_condition fails because with this fix it cannot reproduce the original problem anymore ([SCHEDULE] Generate Lower Bound Conditions #1014). I removed it for now.

(Also I considered another solution, namely intersect the two ranges. In theory this should be perfect, but in practice it results in too complex expressions, which eventually lead to messages like this:
loop_partition.cc:356: Cannot prove: ((((((1 + min((c.outer.h.fused/112), 7)) - max((c.outer.h.fused/112), 0)) - 1) - (8 - max((c.outer.h.fused/112), 0))) + 1) >= 0), when generating the post doubt loop
and also to errors like this:
TVMError: [17:49:02] split_host_device.cc:116: Check failed: !use_count_.count(v) variable c.outer.h.fused has been used before definition!.
The reason is that some local variables leak into tensor bounds which are then used to generate certain conditions.)

@tqchen Could you please provide some comments regarding this issue?

@junrushao
Copy link
Member

FYI, the PR fixing issue #1014 was incorrect, and I just added a correct one #2126.

@tqchen
Copy link
Member

tqchen commented Nov 19, 2018

@junrushao1994 @were can you help review this PR?

@junrushao
Copy link
Member

Will do this Wednesday.

@junrushao
Copy link
Member

The problem here is non-trivial.

@sergei-mironov
Copy link
Contributor

sergei-mironov commented Nov 27, 2018

Hi, I'd like to share some results of investigation regarding BoundsChecker problem #2079 . I think they are closely related to the current issue as well. Both examples show cases, where problem is clearly exists, but BoundsChecker can't be used effectively.

Both programs below perform operations on intermediate tensor c while computing final tensor d:

Uninitialized data passed to computations

In this case we use c merely by copying its elements with certain offset. Originally, this example was aimed at simulating user error causing a SegFault, but the compiler treated it in a more interesting way.

def demo_array2():
  with tvm.build_config(instrument_bound_checkers=True):
    a = tvm.placeholder((10,), name='a')
    b = tvm.placeholder((10,), name='b')
    c = tvm.compute((10,), lambda i: a[i]+b[i], name='c')
    d = tvm.compute((10,), lambda i: c[i+100500], name='d')

     run_tvm(0,1,
            { a:np.ones(get_shape(a)).astype(np.float32)
            , b:np.ones(get_shape(b)).astype(np.float32)
            },d, debug=True)

In this case TVM generates the code below. Note the correct allocation and if(0) in the produce c statement. As a result, c is passed to d uninitialized. User have no chance to notice their mistake quickly.

// attr [c] storage_scope = "global"
allocate c[float32 * 10]
produce c {
  for (i, 0, 10) {
    if (likely((uint1)0)) {
      c[i] = (a[(i + 100500)] + b[(i + 100500)])
    }
  }
}
produce d {
  for (i, 0, 10) {
    d[i] = c[i]
  }
}

Uninitialized data + fallacious allocations

In this version we make the c usage pattern a bit more complex.

def demo_array3():
  with tvm.build_config(instrument_bound_checkers=True):
    a = tvm.placeholder((10,), name='a')
    b = tvm.placeholder((10,), name='b')
    c = tvm.compute((10,), lambda i: a[i]+b[i], name='c')
    d = tvm.compute((10,), lambda i: c[i]*c[i+100500], name='d')

    run_tvm(0,1,
            { a:np.ones(get_shape(a)).astype(np.float32)
            , b:np.ones(get_shape(b)).astype(np.float32)
            },d, debug=True)

As before, c is passed uninitialized, but In this case TVM makes decision to increase the size of c and its axis, consuming lots of memory potentially. Resulting IR:

// attr [c] storage_scope = "global"
allocate c[float32 * 100510]
produce c {
  for (i, 0, 100510) {
    if (likely((i < 10))) {
      c[i] = (a[i] + b[i])
    }
  }
}
produce d {
  for (i, 0, 10) {
    d[i] = (c[i]*c[(i + 100500)])
  }
}

Results

This examples demonstrate how a typo from user may remain undetected and lead to bugs which are hard to debug. We see that uninitialized data is passed to computations in both examples. It may spread NaNs or even become a security concern. In the last example TVM decides to alter the size of intermediate tensor.

In general, It is clear that the compiler should be able to reduce tensors during optimizations, i.e. in cache_write case mentioned by @sgrechanik-h. In the same time, I can't imagine when do we need to increase the size of a tensor.

So far we saw negative results of this feature. @tqchen @junrushao1994 could you please name some cases where tensor expansion is a desired behavior?

@tqchen
Copy link
Member

tqchen commented Jul 23, 2019

superceded by #3526

@tqchen tqchen closed this Jul 23, 2019
@sgrechanik-h sgrechanik-h deleted the gatherbound-pr branch November 3, 2019 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants