[TOPI] Fast tanh #3255

hlu1 · 2019-05-29T22:11:22Z

Borrowing the fast_tanh_float implementation from Eigen (https://github.com/eigenteam/eigen-git-mirror/blob/80f488a7bc9b7c64c9d0c0e8fb301fd905ad1b95/Eigen/src/Core/MathFunctionsImpl.h#L26) can bring about 28x speedup to tanh.

benchmark:

target = "llvm -mcpu=core-avx2"
num_iter = 1000
num_cycles = 5
dtype = "float32"

def bench_tanh(func, m, n):
    a = relay.var("a", shape=(m, n))
    out = func(a)
    f = relay.ir_pass.infer_type(relay.Function([a], out))
    opt_level = 3

    with relay.build_config(opt_level=opt_level):
        graph, lib, params = relay.build(f, target, params={})
    print(graph)

    remote = tvm.rpc.LocalSession()
    tmp = tvm.contrib.util.tempdir()
    lib_fname = tmp.relpath("net.tar")
    with tvm.target.create(target):
        lib.export_library(lib_fname)

    remote.upload(lib_fname)
    lib = remote.load_module("net.tar")
    ctx = remote.cpu(0)

    module = graph_runtime.create(graph, lib, ctx)

    logging.debug(graph)

    input = {'a': np.random.uniform(low=-10, high=10, size=(m, n)).astype(np.float32)}
    module.set_input(**input)

    ftimer = module.module.time_evaluator("run", ctx, num_iter)
    for _ in range(num_cycles):
        prof_res = ftimer()
        print("TVM time: ", prof_res.mean * 1e6, " us")
        time.sleep(1)

bench_tanh(relay.tanh, 1024, 128)

Results:

before:
TVM time:  1512.9183090000001  us
TVM time:  1406.613658  us
TVM time:  1444.041799  us
TVM time:  1445.61708  us
TVM time:  1407.4704649999999  us

after:
TVM time:  49.699045999999996  us
TVM time:  57.133776999999995  us
TVM time:  57.434446  us
TVM time:  59.131979  us
TVM time:  57.127435999999996  us

speedup = 28x

The speedup is about the same on intel skylakes.

hlu1 · 2019-05-29T22:34:56Z

@ajtulloch, could you review pls?

jroesch · 2019-05-30T06:38:45Z

LGTM, would be good to get a review from @ajtulloch and then we can merge. Does this have any approximation or numerical stability issues?

pavpanchekha · 2019-05-30T17:05:14Z

@jroesch Asked me to take a look accuracy-wise—it's not a review, just a quick take.

The Eigen tanh is implemented using rational approximation on [-9, 9] and is set to ±1 outside that range. (See comment in the source though note that the implementation is actually done by clamping.) In GLIBC, which I assume is the currently-used implementation, atanh is computed via log1p (See comment in the source in this mirror).

Let's start with single precision. Generally speaking, I expect the Eigen implementation to be faster (evaluating two polynomials, plus one division, is going to be much faster than a logarithm!) and I assume the polynomials are well-chosen so that the accuracy is going to be acceptable (the comment says that it's within a few ULPs... that they don't say how many doesn't inspire confidence, but they're using a 13/6 approximation which seems good enough). Plus, I assume you're using this implementation for an activation function, which which exact accuracy is likely unimportant. And the rational approximation is going to be monotonic, which is nice.

Now let's do double precision. Here, the Eigen implementation will only be as accurate as a single-precision computation, because it's missing terms in the polynomial. And, while clamping at ±9 is appropriate in single precision, in double precision you have to clamp at 19 (and so need a rational approximation accurate that far out). I don't know what exactly your users think about accuracy, but I suspect they wouldn't be happy with double-precision being no more accurate than single precision.

The safe but practical thing, I think, is using the Eigen atanh for single precision but not double precision. If you wanted to, you could derive an analogous polynomial and get a double-precision version that way, or you could keep using the GLIBC implementation in that case and hope the higher memory bandwidth of double precision masks and additional CPU time computing atanh.

hlu1 · 2019-05-30T23:17:40Z

@pavpanchekha, thanks for the comment. It makes a lot of sense.
I added the logic to only invoke the Eigen fast_tanh_float for fp32 and fallback to default GLIBC tanh implementation for all other datatypes. Double precision test for tanh is also added.

ajtulloch

Looks great, excellent idea - just a suggestion for a ulp-bound test.

topi/include/topi/elemwise.h

topi/tests/python/test_topi_math.py

ajtulloch · 2019-06-01T02:07:49Z

Looks like tests fail because of CUDA expf being > 1 ULP (IIRC it's something like 5 ULP max), but maybe we should just enable ULP checking for the tanh impl?

ajtulloch · 2019-06-01T02:08:58Z

topi/tests/python/test_topi_math.py

+        high,
+        shape=(20, 3),
+        dtype=tvm.float32,
+        maxulp=1,


Maybe just make this an optional setting that only tanh uses for now?

hlu1 · 2019-06-01T05:59:47Z

I did a bit more testing and noticed that the error between numpy.tanh and topi.tanh actually can be pretty big, even for the original implementation. The maxulp can be as big as 194. I think using absolute error and relative error is probably fine.

import numpy as np
import tvm
import topi
import topi.testing
from topi import util

m = tvm.var('m')
l = tvm.var('l')
A = tvm.placeholder((m, l), name='A')

shape = (20, 3)
B = topi.tanh(A)

for _ in range(10):
    a_np = np.random.uniform(low=-1, high=1, size=shape).astype(A.dtype)
    b_np = np.tanh(a_np)
    device = "llvm"
    ctx = tvm.context(device, 0)

    with tvm.target.create(device):
        s = topi.generic.schedule_injective(B)
    foo = tvm.build(s, [A, B], device, name="tanh")
    a = tvm.nd.array(a_np, ctx)
    b = tvm.nd.array(np.zeros_like(b_np), ctx)
    foo(a, b)
    try:
        np.testing.assert_array_almost_equal_nulp(b.asnumpy(), b_np)
    except AssertionError as error:
        print(error)

Original:

    X and Y are not equal to 1 ULP (max is 20)
    X and Y are not equal to 1 ULP (max is 2)
    X and Y are not equal to 1 ULP (max is 194)
    X and Y are not equal to 1 ULP (max is 11)
    X and Y are not equal to 1 ULP (max is 3)
    X and Y are not equal to 1 ULP (max is 10)
    X and Y are not equal to 1 ULP (max is 8)
    X and Y are not equal to 1 ULP (max is 5)
    X and Y are not equal to 1 ULP (max is 32)
    X and Y are not equal to 1 ULP (max is 40)

Eigen:

X and Y are not equal to 1 ULP (max is 13)
X and Y are not equal to 1 ULP (max is 3)
X and Y are not equal to 1 ULP (max is 2)
X and Y are not equal to 1 ULP (max is 5)
X and Y are not equal to 1 ULP (max is 27)
X and Y are not equal to 1 ULP (max is 2)
X and Y are not equal to 1 ULP (max is 26)
X and Y are not equal to 1 ULP (max is 14)
X and Y are not equal to 1 ULP (max is 74)
X and Y are not equal to 1 ULP (max is 4)

ajtulloch · 2019-06-01T08:35:48Z

Sounds good, looks great then. Thanks for digging into it.

hlu1 · 2019-06-03T23:24:40Z

@tqchen, @jroesch, it's ready to be merged.

tqchen · 2019-06-05T23:23:41Z

Thanks, @pavpanchekha @jroesch @hlu1 @ajtulloch @antinucleon , this PR is now merged

hlu1 · 2019-06-05T23:26:08Z

Thanks @tqchen

hlu1 force-pushed the fast_tanh branch from 48b60d0 to b74a4f9 Compare May 29, 2019 22:13

hlu1 changed the title ~~[TOPI] fast tanh~~ [TOPI] Fast tanh May 29, 2019

jroesch added the status: need review label May 30, 2019

hlu1 force-pushed the fast_tanh branch 2 times, most recently from 4bd414a to 83eb574 Compare May 30, 2019 23:12

hlu1 force-pushed the fast_tanh branch 2 times, most recently from ce650de to fe05f22 Compare May 31, 2019 21:31

ajtulloch approved these changes May 31, 2019

View reviewed changes

topi/include/topi/elemwise.h Outdated Show resolved Hide resolved

topi/tests/python/test_topi_math.py Outdated Show resolved Hide resolved

hlu1 force-pushed the fast_tanh branch 2 times, most recently from 5869797 to de7a162 Compare June 1, 2019 00:05

ajtulloch reviewed Jun 1, 2019

View reviewed changes

fast tanh

8109ca4

hlu1 force-pushed the fast_tanh branch from de7a162 to 8109ca4 Compare June 1, 2019 06:01

antinucleon approved these changes Jun 3, 2019

View reviewed changes

antinucleon removed the status: need review label Jun 3, 2019

tqchen assigned jroesch Jun 4, 2019

tqchen merged commit 165aa0d into apache:master Jun 5, 2019

tqchen unassigned jroesch Jun 5, 2019

tqchen added the status: accepted label Jun 5, 2019

wweic pushed a commit to wweic/tvm that referenced this pull request Jun 26, 2019

fast tanh (apache#3255)

b130c9c

wweic pushed a commit to neo-ai/tvm that referenced this pull request Jun 27, 2019

fast tanh (apache#3255)

092a675

tqchen mentioned this pull request Nov 8, 2019

[RELEASE][DRAFT] TVM v0.6 Release candidate #4259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] Fast tanh #3255

[TOPI] Fast tanh #3255

hlu1 commented May 29, 2019 •

edited

Loading

hlu1 commented May 29, 2019

jroesch commented May 30, 2019 •

edited

Loading

pavpanchekha commented May 30, 2019 •

edited

Loading

hlu1 commented May 30, 2019

ajtulloch left a comment

ajtulloch commented Jun 1, 2019

ajtulloch Jun 1, 2019

hlu1 commented Jun 1, 2019

ajtulloch commented Jun 1, 2019

hlu1 commented Jun 3, 2019

tqchen commented Jun 5, 2019

hlu1 commented Jun 5, 2019

[TOPI] Fast tanh #3255

[TOPI] Fast tanh #3255

Conversation

hlu1 commented May 29, 2019 • edited Loading

hlu1 commented May 29, 2019

jroesch commented May 30, 2019 • edited Loading

pavpanchekha commented May 30, 2019 • edited Loading

hlu1 commented May 30, 2019

ajtulloch left a comment

Choose a reason for hiding this comment

ajtulloch commented Jun 1, 2019

ajtulloch Jun 1, 2019

Choose a reason for hiding this comment

hlu1 commented Jun 1, 2019

ajtulloch commented Jun 1, 2019

hlu1 commented Jun 3, 2019

tqchen commented Jun 5, 2019

hlu1 commented Jun 5, 2019

hlu1 commented May 29, 2019 •

edited

Loading

jroesch commented May 30, 2019 •

edited

Loading

pavpanchekha commented May 30, 2019 •

edited

Loading