[Frontend][PyTorch] Cast to Long gets translated into cast to float32 #6300

interesaaat · 2020-08-18T23:25:25Z

I have the following pytorch program:

tree_nodes = indexes
feature_nodes = torch.index_select(self.features, 0, tree_nodes).view(-1, self.num_trees)
 feature_values = torch.gather(x, 1, feature_nodes)

thresholds = torch.index_select(self.thresholds, 0, indexes).view(-1, self.num_trees)
lefts = torch.index_select(self.lefts, 0, indexes).view(-1, self.num_trees)
rights = torch.index_select(self.rights, 0, indexes).view(-1, self.num_trees)

indexes = torch.where(torch.ge(feature_values, thresholds), rights, lefts).long()
indexes = indexes + self.nodes_offset
indexes = indexes.view(-1)

when I compile it into TVM, I get the following interesting trace:

%0 = (%v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset);
  %1 = concatenate(%0);
  %2 = reshape(%1, newshape=[-1]);
  %3 = take(%v_operator_map.SklearnLGBMRegressor.features, %2, axis=0);
  %4 = reshape(%3, newshape=[-1, 3]);
  %5 = gather(%input, %4, axis=1);
  %6 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %2, axis=0);
  %7 = reshape(%6, newshape=[-1, 3]);
  %8 = greater_equal(%5, %7);
  %9 = take(%v_operator_map.SklearnLGBMRegressor.rights, %2, axis=0);
  %10 = reshape(%9, newshape=[-1, 3]);
  %11 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %2, axis=0);
  %12 = reshape(%11, newshape=[-1, 3]);
  %13 = where(%8, %10, %12);
  %14 = cast(%13, dtype="float32");
  %15 = add(%14, %v_operator_map.SklearnLGBMRegressor.nodes_offset) an internal invariant was violated while typechecking your program [16:15:28] /Users/mainterl/Develop/tvm/src/relay/op/type_relations.cc:107: Check failed: t0->dtype == t1->dtype (float32 vs. int64) : 
; ;

Apparently in line 14 the cast into long is translated into a cast into float32.

To reproduce is you can pull this branch, pip install -e .[extra] and run this test file. I can try to generate a minimal running example if it helps.

The text was updated successfully, but these errors were encountered:

masahi · 2020-08-19T04:30:36Z

Thanks, the cast to long was missing, #6301 should fix this.

interesaaat · 2020-08-19T17:44:03Z

Thanks @masahi for the super quick PR! However the problem still persists. I think there is something else going on because now it casts to long and then to float 32 afterwards!

  %0 = (%v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset);
  %1 = concatenate(%0);
  %2 = reshape(%1, newshape=[-1]);
  %3 = take(%v_operator_map.SklearnLGBMRegressor.features, %2, axis=0);
  %4 = reshape(%3, newshape=[-1, 3]);
  %5 = gather(%input, %4, axis=1);
  %6 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %2, axis=0);
  %7 = reshape(%6, newshape=[-1, 3]);
  %8 = greater_equal(%5, %7);
  %9 = take(%v_operator_map.SklearnLGBMRegressor.rights, %2, axis=0);
  %10 = reshape(%9, newshape=[-1, 3]);
  %11 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %2, axis=0);
  %12 = reshape(%11, newshape=[-1, 3]);
  %13 = where(%8, %10, %12);
  %14 = cast(%13, dtype="int64");
  %15 = cast(%14, dtype="float32");
  %16 = add(%15, %v_operator_map.SklearnLGBMRegressor.nodes_offset) an internal invariant was violated while typechecking your program [10:41:33] /Users/mainterl/Develop/tvm/src/relay/op/type_relations.cc:107: Check failed: t0->dtype == t1->dtype (float32 vs. int64) :

Perhaps the add only works over floats and is forcing a cast? No idea..

interesaaat · 2020-08-19T20:55:57Z

One think that I noticed is that I get a bunch of WARNING:root:Untyped Tensor found, assume it is float32 when I compile the pytorch model using TVM. Is there any way to check which specific tensor create problems?

masahi · 2020-08-19T22:08:53Z

One think that I noticed is that I get a bunch of WARNING:root:Untyped Tensor found, assume it is float32 when I compile the pytorch model using TVM. Is there any way to check which specific tensor create problems?

Unfortunately no. This is indeed an annoying issue. Even if you are tracing, we get this warning because it seems (?) the weight and bias parameters from torch are not typed. So most likely the warning comes from tensors corresponding to parameters of conv, dense, batch norm etc

interesaaat · 2020-08-19T22:11:50Z

Ok I found the solution, but I think I am hitting some bug. So in the __init__ of the module I have:

self.nodes_offset = torch.nn.Parameter(torch.LongTensor(nodes_offset), requires_grad=False)

and in the middle of the forward method I am doing:

indexes = indexes + self.nodes_offset

(this is what it throwing the 2 casts and related exception).

Interestingly, if I do:

def forward(self, x):
        self.nodes_offset.long()

(meaning forcing a cast at the beginning of the forward method) it works. If instead I add the cast in __init__ I still get the same error. Is this something related with tracing?

masahi · 2020-08-19T22:13:27Z

%v_operator_map.SklearnLGBMRegressor.nodes_offset

Could %v_operator_map.SklearnLGBMRegressor.nodes_offset be float (or incorrectly cast to float by TVM)? It seems unless both arguments are long, one of them would be cast to float. If both lhs and rhs are long, it should work as expected.

masahi · 2020-08-19T22:15:56Z

%v_operator_map.SklearnLGBMRegressor.nodes_offset sounds like a argument to the Relay function. We may be making an assumption that all arguments are float.

Can you check the signature of the translated function and see if %v_operator_map.SklearnLGBMRegressor.nodes_offset is typed as float? You can

print(mod["main"])

interesaaat · 2020-08-19T22:17:47Z

Sure! Any suggestion on how to do that? (I am still getting familiar with TVM, to some extent :) )

masahi · 2020-08-19T22:20:26Z

yes see my updated comment

interesaaat · 2020-08-19T22:21:19Z

%v_operator_map.SklearnLGBMRegressor.nodes_offset: Tensor[(1, 3), int64]

So it looks that it gets it right.

masahi · 2020-08-19T22:22:13Z

yeah the error also says Check failed: t0->dtype == t1->dtype (float32 vs. int64)

interesaaat · 2020-08-19T22:28:42Z

This is the full log in case:

WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
fn (%v_operator_map.SklearnLGBMRegressor.values: Tensor[(15, 1), float32], %input: Tensor[(3, 2), float32], %v_operator_map.SklearnLGBMRegressor.features: Tensor[(15), int64], %v_operator_map.SklearnLGBMRegressor.nodes_offset: Tensor[(1, 3), int64], %v_operator_map.SklearnLGBMRegressor.thresholds: Tensor[(15), float32], %v_operator_map.SklearnLGBMRegressor.rights: Tensor[(15), int64], %v_operator_map.SklearnLGBMRegressor.lefts: Tensor[(15), int64]) -> Tensor[(3, 1), float32] {
  %0 = cast(%v_operator_map.SklearnLGBMRegressor.nodes_offset, dtype="int64") /* ty=Tensor[(1, 3), int64] */;
  %1 = (%0, %0, %0);
  %2 = concatenate(%1) /* ty=Tensor[(3, 3), int64] */;
  %3 = reshape(%2, newshape=[-1]) /* ty=Tensor[(9), int64] */;
  %4 = take(%v_operator_map.SklearnLGBMRegressor.features, %3, axis=0) /* ty=Tensor[(9), int64] */;
  %5 = reshape(%4, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %6 = gather(%input, %5, axis=1) /* ty=Tensor[(3, 3), float32] */;
  %7 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %3, axis=0) /* ty=Tensor[(9), float32] */;
  %8 = reshape(%7, newshape=[-1, 3]) /* ty=Tensor[(3, 3), float32] */;
  %9 = greater_equal(%6, %8) /* ty=Tensor[(3, 3), bool] */;
  %10 = take(%v_operator_map.SklearnLGBMRegressor.rights, %3, axis=0) /* ty=Tensor[(9), int64] */;
  %11 = reshape(%10, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %12 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %3, axis=0) /* ty=Tensor[(9), int64] */;
  %13 = reshape(%12, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %14 = where(%9, %11, %13) /* ty=Tensor[(3, 3), int64] */;
  %15 = cast(%14, dtype="int64") /* ty=Tensor[(3, 3), int64] */;
  %16 = add(%15, %0) /* ty=Tensor[(3, 3), int64] */;
  %17 = reshape(%16, newshape=[-1]) /* ty=Tensor[(9), int64] */;
  %18 = take(%v_operator_map.SklearnLGBMRegressor.features, %17, axis=0) /* ty=Tensor[(9), int64] */;
  %19 = reshape(%18, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %20 = gather(%input, %19, axis=1) /* ty=Tensor[(3, 3), float32] */;
  %21 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %17, axis=0) /* ty=Tensor[(9), float32] */;
  %22 = reshape(%21, newshape=[-1, 3]) /* ty=Tensor[(3, 3), float32] */;
  %23 = greater_equal(%20, %22) /* ty=Tensor[(3, 3), bool] */;
  %24 = take(%v_operator_map.SklearnLGBMRegressor.rights, %17, axis=0) /* ty=Tensor[(9), int64] */;
  %25 = reshape(%24, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %26 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %17, axis=0) /* ty=Tensor[(9), int64] */;
  %27 = reshape(%26, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %28 = where(%23, %25, %27) /* ty=Tensor[(3, 3), int64] */;
  %29 = cast(%28, dtype="int64") /* ty=Tensor[(3, 3), int64] */;
  %30 = add(%29, %0) /* ty=Tensor[(3, 3), int64] */;
  %31 = reshape(%30, newshape=[-1]) /* ty=Tensor[(9), int64] */;
  %32 = take(%v_operator_map.SklearnLGBMRegressor.values, %31, axis=0) /* ty=Tensor[(9, 1), float32] */;
  %33 = reshape(%32, newshape=[-1, 3, 1]) /* ty=Tensor[(3, 3, 1), float32] */;
  %34 = reshape(%33, newshape=[-1, 1, 3]) /* ty=Tensor[(3, 1, 3), float32] */;
  sum(%34, axis=[2]) /* ty=Tensor[(3, 1), float32] */

interesaaat · 2020-08-19T22:29:10Z

So even if the tensor is probably typed, if I remove the initial cast to long the program will fail.

interesaaat · 2020-08-19T22:31:27Z

This is the log if I remove the explicit cast.

In `main`: 
#[version = "0.0.5"]
fn (%v_operator_map.SklearnLGBMRegressor.values: Tensor[(15, 1), float32], %input: Tensor[(3, 2), float32], %v_operator_map.SklearnLGBMRegressor.features: Tensor[(15), int64], %v_operator_map.SklearnLGBMRegressor.nodes_offset: Tensor[(1, 3), int64], %v_operator_map.SklearnLGBMRegressor.thresholds: Tensor[(15), float32], %v_operator_map.SklearnLGBMRegressor.rights: Tensor[(15), int64], %v_operator_map.SklearnLGBMRegressor.lefts: Tensor[(15), int64]) {
  %0 = (%v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset);
  %1 = concatenate(%0);
  %2 = reshape(%1, newshape=[-1]);
  %3 = take(%v_operator_map.SklearnLGBMRegressor.features, %2, axis=0);
  %4 = reshape(%3, newshape=[-1, 3]);
  %5 = gather(%input, %4, axis=1);
  %6 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %2, axis=0);
  %7 = reshape(%6, newshape=[-1, 3]);
  %8 = greater_equal(%5, %7);
  %9 = take(%v_operator_map.SklearnLGBMRegressor.rights, %2, axis=0);
  %10 = reshape(%9, newshape=[-1, 3]);
  %11 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %2, axis=0);
  %12 = reshape(%11, newshape=[-1, 3]);
  %13 = where(%8, %10, %12);
  %14 = cast(%13, dtype="int64");
  %15 = cast(%14, dtype="float32");
  %16 = add(%15, %v_operator_map.SklearnLGBMRegressor.nodes_offset) an internal invariant was violated while typechecking your program [15:30:33] /Develop/tvm/src/relay/op/type_relations.cc:107: Check failed: t0->dtype == t1->dtype (float32 vs. int64) : 
; ;
  %17 = reshape(%16, newshape=[-1]);
  %18 = take(%v_operator_map.SklearnLGBMRegressor.features, %17, axis=0);
  %19 = reshape(%18, newshape=[-1, 3]);
  %20 = gather(%input, %19, axis=1);
  %21 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %17, axis=0);
  %22 = reshape(%21, newshape=[-1, 3]);
  %23 = greater_equal(%20, %22);
  %24 = take(%v_operator_map.SklearnLGBMRegressor.rights, %17, axis=0);
  %25 = reshape(%24, newshape=[-1, 3]);
  %26 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %17, axis=0);
  %27 = reshape(%26, newshape=[-1, 3]);
  %28 = where(%23, %25, %27) an internal invariant was violated while typechecking your program [15:30:33] /Develop/tvm/src/relay/op/tensor/transform.cc:1646: Check failed: condition != nullptr && x != nullptr && y != nullptr: 
; ;
  %29 = cast(%28, dtype="int64");
  %30 = cast(%29, dtype="float32");
  %31 = add(%30, %v_operator_map.SklearnLGBMRegressor.nodes_offset);
  %32 = reshape(%31, newshape=[-1]);
  %33 = take(%v_operator_map.SklearnLGBMRegressor.values, %32, axis=0);
  %34 = reshape(%33, newshape=[-1, 3, 1]);
  %35 = reshape(%34, newshape=[-1, 1, 3]);
  sum(%35, axis=[2])
}

masahi · 2020-08-19T22:44:23Z

Can you come up with a standalone repro script? I couldn't create a simple test that gives the same error.

interesaaat · 2020-08-19T22:53:24Z

I can try. I have pushed the code in this Hummingbird's commit. The test file to run is tests\test_lightgbm_converter.py. If you pull the commit and pip install -e . should work (assuming you have pytorch and lightgbm installed). I will also try to have a minimal script for this, but it might be not that easy.

masahi · 2020-08-19T23:05:38Z

Thanks, I'll try hummingbird. Compiling decision trees to TVM via PyTorch sounds very interesting!

interesaaat · 2020-08-19T23:10:07Z

Thanks! Before were manually translating the PyTorch implementation into Relay, but long term we want to use your PyTorch frontend. Any suggestion or help (or contributions) are welcome! In our experiments with the Relay implementation of the models we were getting about 3x performance improvement against PyTorch, we hope to continue to see a similar gains with the frontend!

masahi · 2020-08-20T01:07:46Z

ok found the problem. The node_offset is indeed giving an warning at

https://github.com/apache/incubator-tvm/blob/939a42b4e976a41e8513b720421d3c3678493715/python/tvm/relay/frontend/pytorch.py#L2146-L2150

Since default_type is float32, from that point TVM thinks node_offset is float32 tensor. That results in unnecessary cast.

This is the same problem I mentioned earlier about conv weight, bias etc. But since they are float32 anyway, the default type we picked for them doesn't do any harm. Your use case is the first one we met where one of parameters to your model is indeed an integer tensor.

Even though the dtype of nodes_offset tensor is int64, when we traverse a TorchScript graph and get the input type of rhs of add node via inputs[1].type().scalarType(), it returns None. Surprisingly, doing explicit self.nodes_offset.long() in the Torch module does make scalarType() of node_offset above int64. Note that this is a PyTorch problem, so I don't consider this a TVM bug.

I'll look into a bit more on what's going on with typing of parameter tensors at the Torchscript level. We might end up solving "Untyped warning" problem of parameters.

masahi · 2020-08-20T01:28:05Z

By comparing two Torchscript IRs, with or without self.nodes_offset.long(), I now understands what's going on.

With the explicit cast, the rhs input of aten::add becomes the output of aten::to(...), which does cast to int64. So when we query the type of rhs input, it correctly returns int64.

  %29 : Tensor = prim::GetAttr[name="nodes_offset"](%3)
  %indexes.1 : Long(1:3, 3:1) = aten::to(%29, %23, %22, %22, %21)
  ...
  %indexes.5 : Long(3:3, 3:1) = aten::add(%indexes.4, %indexes.1, %17)

Without the explicit cast, the rhs input of aten::add is the output of GetAttr (%29 below). And it seems Torchscript cannot return the right type when asked the type of an output of GetAttr, so it simply returns None in this case.

  %29 : Tensor = prim::GetAttr[name="nodes_offset"](%3)
  ...
  %indexes.5 : Long(3:3, 3:1) = aten::add(%indexes.4, %29, %19)

interesaaat · 2020-08-20T02:44:13Z

Interesting. Do you suggest to open an issue against pytorch? It looks to me that there is no clear solution (beside adding the explicit casting) because this is more a design problem with torchscript. Altough should tracing help? Like if we run records through the ops shluld be able to detect that indeed the tensor is of type long.

masahi · 2020-08-20T02:46:43Z

I don't know if Torch people would consider this as a problem. Anyway I've just opened #6311 to workaround dtype of GetAttr. With that change, the explicit cast is no longer necessary. The test case I added there should demonstrate the problem you are having.

interesaaat · 2020-08-20T02:53:46Z

Fantastic! This is great, thanks. I will test this and let you know but from your test case it looks that it solves our problem. Thanks!

interesaaat · 2020-08-20T18:33:41Z

It worked, no more casts are needed and the warnings disappeared.

masahi · 2020-08-21T03:29:49Z

#6311 is merged

masahi mentioned this issue Aug 19, 2020

[Torch] Fix cast to long #6301

Merged

masahi closed this as completed Aug 19, 2020

masahi reopened this Aug 19, 2020

masahi mentioned this issue Aug 20, 2020

[Torch] Fix dtype handling for modules with integer parameters #6311

Merged

interesaaat mentioned this issue Aug 20, 2020

Add TVM backend microsoft/hummingbird#232

Closed

masahi closed this as completed Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend][PyTorch] Cast to Long gets translated into cast to float32 #6300

[Frontend][PyTorch] Cast to Long gets translated into cast to float32 #6300

interesaaat commented Aug 18, 2020

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020 •

edited

Loading

interesaaat commented Aug 19, 2020

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020 •

edited

Loading

masahi commented Aug 19, 2020

masahi commented Aug 19, 2020 •

edited

Loading

interesaaat commented Aug 19, 2020

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020 •

edited

Loading

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020

interesaaat commented Aug 19, 2020

interesaaat commented Aug 19, 2020 •

edited

Loading

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020 •

edited

Loading

masahi commented Aug 20, 2020 •

edited

Loading

masahi commented Aug 20, 2020 •

edited

Loading

interesaaat commented Aug 20, 2020

masahi commented Aug 20, 2020

interesaaat commented Aug 20, 2020

interesaaat commented Aug 20, 2020

masahi commented Aug 21, 2020

[Frontend][PyTorch] Cast to Long gets translated into cast to float32 #6300

[Frontend][PyTorch] Cast to Long gets translated into cast to float32 #6300

Comments

interesaaat commented Aug 18, 2020

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020 • edited Loading

interesaaat commented Aug 19, 2020

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020 • edited Loading

masahi commented Aug 19, 2020

masahi commented Aug 19, 2020 • edited Loading

interesaaat commented Aug 19, 2020

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020 • edited Loading

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020

interesaaat commented Aug 19, 2020

interesaaat commented Aug 19, 2020 • edited Loading

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020

masahi commented Aug 19, 2020

interesaaat commented Aug 19, 2020 • edited Loading

masahi commented Aug 20, 2020 • edited Loading

masahi commented Aug 20, 2020 • edited Loading

interesaaat commented Aug 20, 2020

masahi commented Aug 20, 2020

interesaaat commented Aug 20, 2020

interesaaat commented Aug 20, 2020

masahi commented Aug 21, 2020

interesaaat commented Aug 19, 2020 •

edited

Loading

interesaaat commented Aug 19, 2020 •

edited

Loading

masahi commented Aug 19, 2020 •

edited

Loading

interesaaat commented Aug 19, 2020 •

edited

Loading

interesaaat commented Aug 19, 2020 •

edited

Loading

interesaaat commented Aug 19, 2020 •

edited

Loading

masahi commented Aug 20, 2020 •

edited

Loading

masahi commented Aug 20, 2020 •

edited

Loading