[Bug Report] Binary op with scalar is very slow #13644

dmakoviichuk-tt · 2024-10-09T17:54:00Z

Describe the bug
Every time we call binary op with scalar we create tensor from this scalar and then also call ttnn::repeat:

template <BinaryOpType binary_op_type>
Tensor BinaryOperation<binary_op_type>::invoke(
    uint8_t queue_id,
    const ttnn::Tensor &input_tensor_a,
    const float scalar,
    const std::optional<const DataType> &dtype,
    const std::optional<ttnn::MemoryConfig> &memory_config,
    const std::optional<Tensor> &optional_output_tensor,
    std::optional<unary::FusedActivations> activations,
    std::optional<unary::UnaryWithParam> input_tensor_a_activation) {
    using namespace tt::constants;
    // Cast Float Scalar to a device tensor
    auto host_buffer = owned_buffer::create<::bfloat16>(static_cast<std::size_t>(TILE_HEIGHT * TILE_WIDTH));
    host_buffer[0] = scalar;
    Tensor scalar_tensor_host = Tensor(
        OwnedStorage{host_buffer},
        ttnn::Shape(std::array<std::uint32_t, 2>{1, 1}, std::array<std::uint32_t, 2>{TILE_HEIGHT, TILE_WIDTH}),
        DataType::BFLOAT16,
        Layout::TILE);
    Tensor scalar_tensor_device = scalar_tensor_host.to(input_tensor_a.device());
    // TODO(arakhmati): #7637 pass in memory_config instead of operation::DEFAULT_OUTPUT_MEMORY_CONFIG
    return BinaryOperation::invoke(
        input_tensor_a,
        scalar_tensor_device,
        dtype,
        memory_config,
        optional_output_tensor,
        activations,
        input_tensor_a_activation);
}

We are using the in the optimizer step for each layer https://github.com/tenstorrent/TT-Tron/blob/main/sources/ttml/optimizers/sgd.cpp.

SGD performance is 10 times slower than pytorch cpu version.
To Reproduce
Just run any binary op with tensor and scalar.

Expected behavior
Scalar parameter should be passed as runtime arg to the program.
We should never create a new tensor from cpu from every call.

Additional context
@eyonland I assigned this ticket to you as elementwise owner.
My expectation that you can drive it to the LLK team and make sure they and your team can add needed changes in both metal and ttnn level. If you cannot do it for some reason please let me know I'll find a new owner.
It significantly reduces performance of our training code.

The text was updated successfully, but these errors were encountered:

umadevimcw · 2024-10-10T04:58:23Z

@dmakoviichuk-tt Can you provide details on how you collected the performance?

eyonland · 2024-10-10T17:13:02Z

@dmakoviichuk-tt , my assumption is that you swapped out the ttnn function with a direct pytorch function that runs on host and saw the overall perf difference. If during training a small tensor is called multiple times I can see how the CPU branching predictions would be blazingly fast compared to pushing the tensor on and off device. Did you measure it by overall performance of the training time?

eyonland · 2024-10-10T17:19:41Z

@dmakoviichuk-tt , what was the size of the Tensor?

dmakoviichuk-tt · 2024-10-16T16:53:48Z

@umadevimcw with timer.
@eyonland it doesn't matter.
As I mentioned in optimizer we need to multiply gradients by scalars. Gradients have shape of the weights so it could be like (1,1, 512, 1024).
But we are using this ops not only with gradients, in this case shape could be:
(64,1,256,2048) for example.

dmakoviichuk-tt · 2024-10-16T17:00:31Z

@eyonland it is obviously really bad and slow code for the very simple operation like this.
Why ask questions like that?
I've already demonstrated and showed two problems why it is so slow.

@dmakoviichuk-tt , my assumption is that you swapped out the ttnn function with a direct pytorch function that runs on host and saw the overall perf difference.
You assumption is wrong in all possible ways. How can I swap something with pytorch call if I don't use pytorch?

Please be respectful to your colleagues. Right now it looks like you are trying to avoid fixing that obvious issue!

eyonland · 2024-10-16T21:37:44Z

Sorry for the misunderstanding here. I was trying to figure out how you measured it originally.

We absolutely should be passing a scalar as a runtime arg and never ever create a tensor. My time has been stretched thin on this issue and as well as rebuilding eltwise ops to properly handle broadcasting given that bcast does not adequately do this and the use of repeat is absolutely terrible given we make multiple calls.

jvasilje · 2024-10-30T15:02:28Z

moving to P1.
This is not a P0 bug, it's a P0 feature / OP improvement.
@yan-zaretskiy keep it a priority, but we need to keep P0 bugs labeled properly.

yan-zaretskiy · 2024-10-30T15:18:21Z

@jvasilje We actually merged a PR (#14172) to address this issue.

dmakoviichuk-tt added bug Something isn't working P0 labels Oct 9, 2024

dmakoviichuk-tt assigned eyonland Oct 9, 2024

eyonland added the op_cat: eltwise label Oct 18, 2024

eyonland mentioned this issue Oct 18, 2024

Eltwise Master Tracking #13795

Open

KalaivaniMCW added a commit that referenced this issue Oct 21, 2024

#13644: test arith binary scalar with unary sfpu

1b8e91a

eyonland assigned yan-zaretskiy and unassigned eyonland Oct 21, 2024

yan-zaretskiy added a commit that referenced this issue Oct 23, 2024

#13644: Add support for tensor-scalar binary ops

0cfe801

yan-zaretskiy mentioned this issue Oct 23, 2024

#13644: Add support for tensor-scalar binary ops #14172

Merged

4 tasks

yan-zaretskiy added a commit that referenced this issue Oct 25, 2024

#13644: Add support for tensor-scalar binary ops

92d9f9d

yan-zaretskiy added a commit that referenced this issue Oct 25, 2024

#13644: Add support for tensor-scalar binary ops

054a38d

yan-zaretskiy added a commit that referenced this issue Oct 25, 2024

#13644: Add support for tensor-scalar binary ops

daccc31

yan-zaretskiy added a commit that referenced this issue Oct 26, 2024

#13644: Add support for tensor-scalar binary ops

619ebb9

yan-zaretskiy added a commit that referenced this issue Oct 28, 2024

#13644: Add support for tensor-scalar binary ops

9acad0d

yan-zaretskiy added a commit that referenced this issue Oct 28, 2024

#13644: Add support for tensor-scalar binary ops

dc42d50

yan-zaretskiy added a commit that referenced this issue Oct 28, 2024

#13644: Add support for tensor-scalar binary ops

5d26ab0

yan-zaretskiy added a commit that referenced this issue Oct 29, 2024

#13644: Add support for tensor-scalar binary ops

098f397

yan-zaretskiy added a commit that referenced this issue Oct 29, 2024

#13644: Add support for tensor-scalar binary ops

7c65c01

nemanjagrujic pushed a commit that referenced this issue Oct 29, 2024

#13644: Add support for tensor-scalar binary ops

45490f0

jvasilje added P1 and removed P0 labels Oct 30, 2024

yan-zaretskiy closed this as completed Oct 30, 2024

ct-clmsn pushed a commit to ct-clmsn/tt-metal that referenced this issue Nov 12, 2024

tenstorrent#13644: Add support for tensor-scalar binary ops

9b443e6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Binary op with scalar is very slow #13644

[Bug Report] Binary op with scalar is very slow #13644

dmakoviichuk-tt commented Oct 9, 2024

umadevimcw commented Oct 10, 2024

eyonland commented Oct 10, 2024

eyonland commented Oct 10, 2024

dmakoviichuk-tt commented Oct 16, 2024

dmakoviichuk-tt commented Oct 16, 2024 •

edited

Loading

eyonland commented Oct 16, 2024 •

edited

Loading

jvasilje commented Oct 30, 2024

yan-zaretskiy commented Oct 30, 2024

[Bug Report] Binary op with scalar is very slow #13644

[Bug Report] Binary op with scalar is very slow #13644

Comments

dmakoviichuk-tt commented Oct 9, 2024

umadevimcw commented Oct 10, 2024

eyonland commented Oct 10, 2024

eyonland commented Oct 10, 2024

dmakoviichuk-tt commented Oct 16, 2024

dmakoviichuk-tt commented Oct 16, 2024 • edited Loading

eyonland commented Oct 16, 2024 • edited Loading

jvasilje commented Oct 30, 2024

yan-zaretskiy commented Oct 30, 2024

dmakoviichuk-tt commented Oct 16, 2024 •

edited

Loading

eyonland commented Oct 16, 2024 •

edited

Loading