Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Symbolic shape runtime #2451

Closed
antinucleon opened this issue Jan 17, 2019 · 17 comments
Closed

[RFC] Symbolic shape runtime #2451

antinucleon opened this issue Jan 17, 2019 · 17 comments

Comments

@antinucleon
Copy link
Contributor

antinucleon commented Jan 17, 2019

Problem: In real world workload, not everything is static. We may want to inference image in different size, or batch size. In some work load, we may need to contact different number of embedding for each instance.

Design: Introduce upper_bound idea for memory allocation, use different view for different inputs. For example, the inference input shape is [n, 3, 224, 224], we will give hint of n's upper bound 5, then first time memory allocation is based on [5, 3, 224, 224], but we may create view of [1, 3, 224, 224], [3, 3, 224, 224] during running. In this way we are using some memory to trade of dynamic.

Implementation:

  1. During memory planing, given a map of {var : upper_bound}, memory planing process creates two columns: [storage_id, max_bytes] for runtime setup.

  2. The shape in graph json is no longer a int, but an infix string expression of variable and constant.

  3. When setup runtime, storage pool is set up by [storage_id, max_bytes] columns. The infix expression of shape is converted to postfix expression for fast evaluation. When variable in shape changes, the runtime will run evaluation of all expressions contains variable, then update view if this view is not in cache.

PR: #2447

@tqchen
Copy link
Member

tqchen commented Jan 17, 2019

cc @jroesch @dmlc/tvm-comitter

I think we also need to discuss the scope of this proposal. Seems the current proposal it is good for doing dynamic shape in the case of object detector, a simple but effective generalization to the current version.

We also need to discuss how to handle the kernels when the shape change present.

@nhynes
Copy link
Member

nhynes commented Jan 17, 2019

+1 to the general idea of this proposal. We should also consider making the max_bytes optional and (for now) adding stubs for delegating to the device allocator just in case the user needs more flexibility.

One example of when allocation might be necessary is during online learning. If the user doesn't fully anticipate the dataset, a large input could come in and crash the system (which is a Bad Thing™). This becomes even more relevant when we consider the complex programs (like loops and graphs) that can be executed by Relay.

@eqy
Copy link
Contributor

eqy commented Jan 17, 2019

Does upper_bound mean anything between 1 and upper_bound is likely to be used---or is it more often the case that we only care about a few values? If we want kernels to actually specialize for the difference in values (e.g., actually batching inference instead of running batch-1 n times), then this choice will affect how much time we need to spend optimizing different kernel shapes.

@antinucleon
Copy link
Contributor Author

@eqy upper_bound is a hint for storage planing. With this hint, we can safely allocate memory trunk large enough without reallocating when shape changes between [1, upper_bound].

So far when I pass tvm.var`` to generate kernel I have no issues because this axis is just a loop. For specialization between [1, upper_bound]```, I think it is not a hard problem if we have schedules and hints.

@antinucleon
Copy link
Contributor Author

@nhynes In current PR, max_bytes is optional if all shape is static. This solution is able to solve a huge range of problems in efficient way. For example, in production the bound is determined by developer but not the users. For on the fly dynamic execution, we can't use current graph runtime.

@antinucleon
Copy link
Contributor Author

It has been a while and I hope I can hear from you guys about plan on this PR. Also I suggest update roadmap on time so everyone is able to know what is going on for all future features.

@FrozenGene
Copy link
Member

As discusses with @antinucleon in PR #2488 , I advice we extend the functionality supporting auto increasing the memory allocation when the input size larger than hint. The factor could be 1.5 like Folly's vector.

@kevinthesun
Copy link
Contributor

kevinthesun commented Jan 24, 2019

Memory planning is a great starting point! To achieve dynamic batch inference, we might want to select kernels for different batch sizes, to get descent performance comparing to sequential inference. This depends on the way we support multi-batch kernel schedules. Setting some buckets between [1, upper_bound] might be a possible way. It's also possible to optimize for some values of batch sizes, such as power of 2, and then combine them to generate other batches. @yuruofeifei and I can start to look at this problem.

@antinucleon
Copy link
Contributor Author

@FrozenGene Because we don't want to introduce shape inference in runtime, if we simply linear expand storage pool like a vector, it will make potential bug. If we want to dynamic change, we have to encode storage pool shape expression in json file.

@antinucleon
Copy link
Contributor Author

It has been two weeks. Please let us know what else we should change to land this PR.. Thanks.

@tqchen
Copy link
Member

tqchen commented Jan 31, 2019

Let us wait after 0.5 release and then we revisit the proposal. The current proposal seems to be a good intermediate-term solution for some of the problems.

@antinucleon can you post some end to end examples using the new set of changes to give everyone a sense on what can it do? I am still not very sure whether we propose to use bucketing, or generic shape size. Perhaps @ajtulloch @jroesch can also comment

@ZihengJiang
Copy link
Contributor

ZihengJiang commented Jan 31, 2019

The RFC addressed how to support symbolic shape in terms of memory planing. We'd better come up a way how to generate efficient dynamic kernels also.

@antinucleon
Copy link
Contributor Author

antinucleon commented Feb 1, 2019

Let's decompose this problem.

In order to support dynamic shape, we have to fix two problem:

  1. Memory management.

  2. Kernel support.

Memory management will determine whether a program is able to run or not, then may determine efficiency of the runtime. However Kernel support only determines efficiency.

If you agree with this, we can future decompose what blocks us: memory management.

There are two ways to handle dynamic memory: static allocation and create many views on static memory, or do it in dynamic, where we can call it AOT or JIT. Ideally, if we have a very powerful allocator, two ways are equivalent.

In this RFC we are going to use AOT approach.
To know how much memory we need to allocate for each storage pool node, we also have two approaches: 1. Give a hint, then get a number 2. Pre compute expression of each pool node size. To quickly get it work this PR is using a max_bytes hint to get a fixed size, then create view. Note if we can get expression of storage node, JIT allocator will be much easier too.

At this point, we already have solutions to deal with memory management. Let's come back to efficiency problem: kernels.

  1. Assume we are doing tensorization + vectorization + loop partition, we will have a good default kernel (most of human code today is written in this way).

  2. Assume for each work load, we have different kernel. If during compile we can have a map between the compiled function, we can handle it during create views. (see view_cache_)

----------------------------This is a separation line------------------------

End to end changes:

Python API:

b1 = tvm.var("n1")
b2 = tvm.var("n2")
b3 = tvm.var("n3")

shape_var_bounds = {
    b1 : 128,
    b2 : 32,
    b3 : 12
}



x = relay.var("x",
              shape=[b1,
                     4,
                     2, 3], dtype="float32")


y = relay.var("y",
              shape=[b2,
                     4,
                     2, 3], dtype="float32")

z = relay.op.tensor.concatenate([x, y], axis=0)


a = relay.var("a",
              shape=[b3,
                     4,
                     2, 3], dtype="float32")
b = relay.var("b",
              shape=[27,
                     4,
                     2, 3], dtype="float32")
c = relay.op.tensor.concatenate([a, b], axis=0)

out = relay.op.tensor.concatenate([z, c], axis=0)

func = relay.Function([x, y, a, b], out)


graph, mod, param = relay.build_module.build(func, target="llvm", shape_var_bounds=shape_var_bounds)

print(graph)

rt = tvm.contrib.graph_runtime.create(graph, mod, tvm.cpu())

In graph json, three attributes are added:

"symbolic_shape": 1,
"max_bytes": [
      "list_int",
      [
        12288,
        3072,
        1152,
        2592,
        19104
      ]
    ],
    "var_upper_bound": {
      "n1": 128,
      "n2": 32,
      "n3": 12
    },

In runtime, if symbolic_shape is true, memory allocation will according to max_bytes field.
When shape variables changes, new view will be created.

This change is good for dynamic batch, as long as kernel loop on batch axis is not significantly slower.

@kevinthesun
Copy link
Contributor

kevinthesun commented Feb 1, 2019

For dynamic kernel, one straightforward way is just including all possible kernel functions into shared lib. However, this can result in too many functions in lib file. We might want a way which just generate less kernel functions, such as just including more common batch sizes, and provide fair performance for other cases.

@zhiics
Copy link
Member

zhiics commented Feb 1, 2019

I am just curious if anybody has any insight about what are the most frequently used batch sizes for a certain model? Is it just a few or many different possible ones? And how much memory would it cost?

My concern is that it would be much harder to determine the max_bytes and also the number of frequently used sizes would be significant when shape also varies. In addition, I am also curious how much it would affect the size of the current graph runtime.

@antinucleon
Copy link
Contributor Author

@zhiics Ideally, JIT + a very good dynamic memory allocator will be most elegant solution. However this plus dynamic control flow, is not something we can quickly use . On server side inference, when memory is not very sensitive, we can use current solution.

The graph runtime size is not changed too much as we only introduced a tiny expression for symbolic shape evaluation. The new size of libtvm_runtime.dylib is 673KB, with dynamic shape and view cache. Old one is 597kb.

@tqchen
Copy link
Member

tqchen commented Oct 8, 2019

superceded by relay vm solution, close for now

@tqchen tqchen closed this as completed Oct 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants