Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend][PyTorch] Cast to Long gets translated into cast to float32 #6300

Closed
interesaaat opened this issue Aug 18, 2020 · 25 comments · Fixed by #6301
Closed

[Frontend][PyTorch] Cast to Long gets translated into cast to float32 #6300

interesaaat opened this issue Aug 18, 2020 · 25 comments · Fixed by #6301

Comments

@interesaaat
Copy link
Contributor

I have the following pytorch program:

tree_nodes = indexes
feature_nodes = torch.index_select(self.features, 0, tree_nodes).view(-1, self.num_trees)
 feature_values = torch.gather(x, 1, feature_nodes)

thresholds = torch.index_select(self.thresholds, 0, indexes).view(-1, self.num_trees)
lefts = torch.index_select(self.lefts, 0, indexes).view(-1, self.num_trees)
rights = torch.index_select(self.rights, 0, indexes).view(-1, self.num_trees)

indexes = torch.where(torch.ge(feature_values, thresholds), rights, lefts).long()
indexes = indexes + self.nodes_offset
indexes = indexes.view(-1)           

when I compile it into TVM, I get the following interesting trace:

%0 = (%v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset);
  %1 = concatenate(%0);
  %2 = reshape(%1, newshape=[-1]);
  %3 = take(%v_operator_map.SklearnLGBMRegressor.features, %2, axis=0);
  %4 = reshape(%3, newshape=[-1, 3]);
  %5 = gather(%input, %4, axis=1);
  %6 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %2, axis=0);
  %7 = reshape(%6, newshape=[-1, 3]);
  %8 = greater_equal(%5, %7);
  %9 = take(%v_operator_map.SklearnLGBMRegressor.rights, %2, axis=0);
  %10 = reshape(%9, newshape=[-1, 3]);
  %11 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %2, axis=0);
  %12 = reshape(%11, newshape=[-1, 3]);
  %13 = where(%8, %10, %12);
  %14 = cast(%13, dtype="float32");
  %15 = add(%14, %v_operator_map.SklearnLGBMRegressor.nodes_offset) an internal invariant was violated while typechecking your program [16:15:28] /Users/mainterl/Develop/tvm/src/relay/op/type_relations.cc:107: Check failed: t0->dtype == t1->dtype (float32 vs. int64) : 
; ;

Apparently in line 14 the cast into long is translated into a cast into float32.

To reproduce is you can pull this branch, pip install -e .[extra] and run this test file. I can try to generate a minimal running example if it helps.

@masahi
Copy link
Member

masahi commented Aug 19, 2020

Thanks, the cast to long was missing, #6301 should fix this.

@masahi masahi closed this as completed Aug 19, 2020
@interesaaat
Copy link
Contributor Author

interesaaat commented Aug 19, 2020

Thanks @masahi for the super quick PR! However the problem still persists. I think there is something else going on because now it casts to long and then to float 32 afterwards!

  %0 = (%v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset);
  %1 = concatenate(%0);
  %2 = reshape(%1, newshape=[-1]);
  %3 = take(%v_operator_map.SklearnLGBMRegressor.features, %2, axis=0);
  %4 = reshape(%3, newshape=[-1, 3]);
  %5 = gather(%input, %4, axis=1);
  %6 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %2, axis=0);
  %7 = reshape(%6, newshape=[-1, 3]);
  %8 = greater_equal(%5, %7);
  %9 = take(%v_operator_map.SklearnLGBMRegressor.rights, %2, axis=0);
  %10 = reshape(%9, newshape=[-1, 3]);
  %11 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %2, axis=0);
  %12 = reshape(%11, newshape=[-1, 3]);
  %13 = where(%8, %10, %12);
  %14 = cast(%13, dtype="int64");
  %15 = cast(%14, dtype="float32");
  %16 = add(%15, %v_operator_map.SklearnLGBMRegressor.nodes_offset) an internal invariant was violated while typechecking your program [10:41:33] /Users/mainterl/Develop/tvm/src/relay/op/type_relations.cc:107: Check failed: t0->dtype == t1->dtype (float32 vs. int64) : 

Perhaps the add only works over floats and is forcing a cast? No idea..

@interesaaat
Copy link
Contributor Author

One think that I noticed is that I get a bunch of WARNING:root:Untyped Tensor found, assume it is float32 when I compile the pytorch model using TVM. Is there any way to check which specific tensor create problems?

@masahi
Copy link
Member

masahi commented Aug 19, 2020

One think that I noticed is that I get a bunch of WARNING:root:Untyped Tensor found, assume it is float32 when I compile the pytorch model using TVM. Is there any way to check which specific tensor create problems?

Unfortunately no. This is indeed an annoying issue. Even if you are tracing, we get this warning because it seems (?) the weight and bias parameters from torch are not typed. So most likely the warning comes from tensors corresponding to parameters of conv, dense, batch norm etc

@interesaaat
Copy link
Contributor Author

interesaaat commented Aug 19, 2020

Ok I found the solution, but I think I am hitting some bug. So in the __init__ of the module I have:

self.nodes_offset = torch.nn.Parameter(torch.LongTensor(nodes_offset), requires_grad=False)

and in the middle of the forward method I am doing:

indexes = indexes + self.nodes_offset

(this is what it throwing the 2 casts and related exception).

Interestingly, if I do:

def forward(self, x):
        self.nodes_offset.long()

(meaning forcing a cast at the beginning of the forward method) it works. If instead I add the cast in __init__ I still get the same error. Is this something related with tracing?

@masahi
Copy link
Member

masahi commented Aug 19, 2020

%v_operator_map.SklearnLGBMRegressor.nodes_offset

Could %v_operator_map.SklearnLGBMRegressor.nodes_offset be float (or incorrectly cast to float by TVM)? It seems unless both arguments are long, one of them would be cast to float. If both lhs and rhs are long, it should work as expected.

@masahi
Copy link
Member

masahi commented Aug 19, 2020

%v_operator_map.SklearnLGBMRegressor.nodes_offset sounds like a argument to the Relay function. We may be making an assumption that all arguments are float.

Can you check the signature of the translated function and see if %v_operator_map.SklearnLGBMRegressor.nodes_offset is typed as float? You can

print(mod["main"])

@interesaaat
Copy link
Contributor Author

Sure! Any suggestion on how to do that? (I am still getting familiar with TVM, to some extent :) )

@masahi
Copy link
Member

masahi commented Aug 19, 2020

yes see my updated comment

@interesaaat
Copy link
Contributor Author

interesaaat commented Aug 19, 2020

%v_operator_map.SklearnLGBMRegressor.nodes_offset: Tensor[(1, 3), int64]

So it looks that it gets it right.

@masahi
Copy link
Member

masahi commented Aug 19, 2020

yeah the error also says Check failed: t0->dtype == t1->dtype (float32 vs. int64)

@interesaaat
Copy link
Contributor Author

This is the full log in case:

WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
WARNING:root:Untyped Tensor found, assume it is float32
fn (%v_operator_map.SklearnLGBMRegressor.values: Tensor[(15, 1), float32], %input: Tensor[(3, 2), float32], %v_operator_map.SklearnLGBMRegressor.features: Tensor[(15), int64], %v_operator_map.SklearnLGBMRegressor.nodes_offset: Tensor[(1, 3), int64], %v_operator_map.SklearnLGBMRegressor.thresholds: Tensor[(15), float32], %v_operator_map.SklearnLGBMRegressor.rights: Tensor[(15), int64], %v_operator_map.SklearnLGBMRegressor.lefts: Tensor[(15), int64]) -> Tensor[(3, 1), float32] {
  %0 = cast(%v_operator_map.SklearnLGBMRegressor.nodes_offset, dtype="int64") /* ty=Tensor[(1, 3), int64] */;
  %1 = (%0, %0, %0);
  %2 = concatenate(%1) /* ty=Tensor[(3, 3), int64] */;
  %3 = reshape(%2, newshape=[-1]) /* ty=Tensor[(9), int64] */;
  %4 = take(%v_operator_map.SklearnLGBMRegressor.features, %3, axis=0) /* ty=Tensor[(9), int64] */;
  %5 = reshape(%4, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %6 = gather(%input, %5, axis=1) /* ty=Tensor[(3, 3), float32] */;
  %7 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %3, axis=0) /* ty=Tensor[(9), float32] */;
  %8 = reshape(%7, newshape=[-1, 3]) /* ty=Tensor[(3, 3), float32] */;
  %9 = greater_equal(%6, %8) /* ty=Tensor[(3, 3), bool] */;
  %10 = take(%v_operator_map.SklearnLGBMRegressor.rights, %3, axis=0) /* ty=Tensor[(9), int64] */;
  %11 = reshape(%10, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %12 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %3, axis=0) /* ty=Tensor[(9), int64] */;
  %13 = reshape(%12, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %14 = where(%9, %11, %13) /* ty=Tensor[(3, 3), int64] */;
  %15 = cast(%14, dtype="int64") /* ty=Tensor[(3, 3), int64] */;
  %16 = add(%15, %0) /* ty=Tensor[(3, 3), int64] */;
  %17 = reshape(%16, newshape=[-1]) /* ty=Tensor[(9), int64] */;
  %18 = take(%v_operator_map.SklearnLGBMRegressor.features, %17, axis=0) /* ty=Tensor[(9), int64] */;
  %19 = reshape(%18, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %20 = gather(%input, %19, axis=1) /* ty=Tensor[(3, 3), float32] */;
  %21 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %17, axis=0) /* ty=Tensor[(9), float32] */;
  %22 = reshape(%21, newshape=[-1, 3]) /* ty=Tensor[(3, 3), float32] */;
  %23 = greater_equal(%20, %22) /* ty=Tensor[(3, 3), bool] */;
  %24 = take(%v_operator_map.SklearnLGBMRegressor.rights, %17, axis=0) /* ty=Tensor[(9), int64] */;
  %25 = reshape(%24, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %26 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %17, axis=0) /* ty=Tensor[(9), int64] */;
  %27 = reshape(%26, newshape=[-1, 3]) /* ty=Tensor[(3, 3), int64] */;
  %28 = where(%23, %25, %27) /* ty=Tensor[(3, 3), int64] */;
  %29 = cast(%28, dtype="int64") /* ty=Tensor[(3, 3), int64] */;
  %30 = add(%29, %0) /* ty=Tensor[(3, 3), int64] */;
  %31 = reshape(%30, newshape=[-1]) /* ty=Tensor[(9), int64] */;
  %32 = take(%v_operator_map.SklearnLGBMRegressor.values, %31, axis=0) /* ty=Tensor[(9, 1), float32] */;
  %33 = reshape(%32, newshape=[-1, 3, 1]) /* ty=Tensor[(3, 3, 1), float32] */;
  %34 = reshape(%33, newshape=[-1, 1, 3]) /* ty=Tensor[(3, 1, 3), float32] */;
  sum(%34, axis=[2]) /* ty=Tensor[(3, 1), float32] */

@interesaaat
Copy link
Contributor Author

So even if the tensor is probably typed, if I remove the initial cast to long the program will fail.

@interesaaat
Copy link
Contributor Author

interesaaat commented Aug 19, 2020

This is the log if I remove the explicit cast.

In `main`: 
#[version = "0.0.5"]
fn (%v_operator_map.SklearnLGBMRegressor.values: Tensor[(15, 1), float32], %input: Tensor[(3, 2), float32], %v_operator_map.SklearnLGBMRegressor.features: Tensor[(15), int64], %v_operator_map.SklearnLGBMRegressor.nodes_offset: Tensor[(1, 3), int64], %v_operator_map.SklearnLGBMRegressor.thresholds: Tensor[(15), float32], %v_operator_map.SklearnLGBMRegressor.rights: Tensor[(15), int64], %v_operator_map.SklearnLGBMRegressor.lefts: Tensor[(15), int64]) {
  %0 = (%v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset, %v_operator_map.SklearnLGBMRegressor.nodes_offset);
  %1 = concatenate(%0);
  %2 = reshape(%1, newshape=[-1]);
  %3 = take(%v_operator_map.SklearnLGBMRegressor.features, %2, axis=0);
  %4 = reshape(%3, newshape=[-1, 3]);
  %5 = gather(%input, %4, axis=1);
  %6 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %2, axis=0);
  %7 = reshape(%6, newshape=[-1, 3]);
  %8 = greater_equal(%5, %7);
  %9 = take(%v_operator_map.SklearnLGBMRegressor.rights, %2, axis=0);
  %10 = reshape(%9, newshape=[-1, 3]);
  %11 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %2, axis=0);
  %12 = reshape(%11, newshape=[-1, 3]);
  %13 = where(%8, %10, %12);
  %14 = cast(%13, dtype="int64");
  %15 = cast(%14, dtype="float32");
  %16 = add(%15, %v_operator_map.SklearnLGBMRegressor.nodes_offset) an internal invariant was violated while typechecking your program [15:30:33] /Develop/tvm/src/relay/op/type_relations.cc:107: Check failed: t0->dtype == t1->dtype (float32 vs. int64) : 
; ;
  %17 = reshape(%16, newshape=[-1]);
  %18 = take(%v_operator_map.SklearnLGBMRegressor.features, %17, axis=0);
  %19 = reshape(%18, newshape=[-1, 3]);
  %20 = gather(%input, %19, axis=1);
  %21 = take(%v_operator_map.SklearnLGBMRegressor.thresholds, %17, axis=0);
  %22 = reshape(%21, newshape=[-1, 3]);
  %23 = greater_equal(%20, %22);
  %24 = take(%v_operator_map.SklearnLGBMRegressor.rights, %17, axis=0);
  %25 = reshape(%24, newshape=[-1, 3]);
  %26 = take(%v_operator_map.SklearnLGBMRegressor.lefts, %17, axis=0);
  %27 = reshape(%26, newshape=[-1, 3]);
  %28 = where(%23, %25, %27) an internal invariant was violated while typechecking your program [15:30:33] /Develop/tvm/src/relay/op/tensor/transform.cc:1646: Check failed: condition != nullptr && x != nullptr && y != nullptr: 
; ;
  %29 = cast(%28, dtype="int64");
  %30 = cast(%29, dtype="float32");
  %31 = add(%30, %v_operator_map.SklearnLGBMRegressor.nodes_offset);
  %32 = reshape(%31, newshape=[-1]);
  %33 = take(%v_operator_map.SklearnLGBMRegressor.values, %32, axis=0);
  %34 = reshape(%33, newshape=[-1, 3, 1]);
  %35 = reshape(%34, newshape=[-1, 1, 3]);
  sum(%35, axis=[2])
}

@masahi
Copy link
Member

masahi commented Aug 19, 2020

Can you come up with a standalone repro script? I couldn't create a simple test that gives the same error.

@interesaaat
Copy link
Contributor Author

I can try. I have pushed the code in this Hummingbird's commit. The test file to run is tests\test_lightgbm_converter.py. If you pull the commit and pip install -e . should work (assuming you have pytorch and lightgbm installed). I will also try to have a minimal script for this, but it might be not that easy.

@masahi masahi reopened this Aug 19, 2020
@masahi
Copy link
Member

masahi commented Aug 19, 2020

Thanks, I'll try hummingbird. Compiling decision trees to TVM via PyTorch sounds very interesting!

@interesaaat
Copy link
Contributor Author

interesaaat commented Aug 19, 2020

Thanks! Before were manually translating the PyTorch implementation into Relay, but long term we want to use your PyTorch frontend. Any suggestion or help (or contributions) are welcome! In our experiments with the Relay implementation of the models we were getting about 3x performance improvement against PyTorch, we hope to continue to see a similar gains with the frontend!

@masahi
Copy link
Member

masahi commented Aug 20, 2020

ok found the problem. The node_offset is indeed giving an warning at

https://github.com/apache/incubator-tvm/blob/939a42b4e976a41e8513b720421d3c3678493715/python/tvm/relay/frontend/pytorch.py#L2146-L2150

Since default_type is float32, from that point TVM thinks node_offset is float32 tensor. That results in unnecessary cast.

This is the same problem I mentioned earlier about conv weight, bias etc. But since they are float32 anyway, the default type we picked for them doesn't do any harm. Your use case is the first one we met where one of parameters to your model is indeed an integer tensor.

Even though the dtype of nodes_offset tensor is int64, when we traverse a TorchScript graph and get the input type of rhs of add node via inputs[1].type().scalarType(), it returns None. Surprisingly, doing explicit self.nodes_offset.long() in the Torch module does make scalarType() of node_offset above int64. Note that this is a PyTorch problem, so I don't consider this a TVM bug.

I'll look into a bit more on what's going on with typing of parameter tensors at the Torchscript level. We might end up solving "Untyped warning" problem of parameters.

@masahi
Copy link
Member

masahi commented Aug 20, 2020

By comparing two Torchscript IRs, with or without self.nodes_offset.long(), I now understands what's going on.

With the explicit cast, the rhs input of aten::add becomes the output of aten::to(...), which does cast to int64. So when we query the type of rhs input, it correctly returns int64.

  %29 : Tensor = prim::GetAttr[name="nodes_offset"](%3)
  %indexes.1 : Long(1:3, 3:1) = aten::to(%29, %23, %22, %22, %21)
  ...
  %indexes.5 : Long(3:3, 3:1) = aten::add(%indexes.4, %indexes.1, %17)

Without the explicit cast, the rhs input of aten::add is the output of GetAttr (%29 below). And it seems Torchscript cannot return the right type when asked the type of an output of GetAttr, so it simply returns None in this case.

  %29 : Tensor = prim::GetAttr[name="nodes_offset"](%3)
  ...
  %indexes.5 : Long(3:3, 3:1) = aten::add(%indexes.4, %29, %19)

@interesaaat
Copy link
Contributor Author

Interesting. Do you suggest to open an issue against pytorch? It looks to me that there is no clear solution (beside adding the explicit casting) because this is more a design problem with torchscript. Altough should tracing help? Like if we run records through the ops shluld be able to detect that indeed the tensor is of type long.

@masahi
Copy link
Member

masahi commented Aug 20, 2020

I don't know if Torch people would consider this as a problem. Anyway I've just opened #6311 to workaround dtype of GetAttr. With that change, the explicit cast is no longer necessary. The test case I added there should demonstrate the problem you are having.

@interesaaat
Copy link
Contributor Author

Fantastic! This is great, thanks. I will test this and let you know but from your test case it looks that it solves our problem. Thanks!

@interesaaat
Copy link
Contributor Author

It worked, no more casts are needed and the warnings disappeared.

@masahi
Copy link
Member

masahi commented Aug 21, 2020

#6311 is merged

@masahi masahi closed this as completed Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants