How to compute MACs or FLOPs of mamba #110

wangmou21 · 2024-01-17T09:52:46Z

How to compute MACs or FLOPs of mamba?

MzeroMiko · 2024-01-30T09:53:17Z

We calc FLOPs based on the ref code, though it is very different from the real speed in practise.

def flops_selective_scan_ref(B=1, L=256, D=768, N=16, with_D=True, with_Z=False, with_Group=True, with_complex=False):
    """
    u: r(B D L)
    delta: r(B D L)
    A: r(D N)
    B: r(B N L)
    C: r(B N L)
    D: r(D)
    z: r(B D L)
    delta_bias: r(D), fp32
    
    ignores:
        [.float(), +, .softplus, .shape, new_zeros, repeat, stack, to(dtype), silu] 
    """
    import numpy as np
    
    # fvcore.nn.jit_handles
    def get_flops_einsum(input_shapes, equation):
        np_arrs = [np.zeros(s) for s in input_shapes]
        optim = np.einsum_path(equation, *np_arrs, optimize="optimal")[1]
        for line in optim.split("\n"):
            if "optimized flop" in line.lower():
                # divided by 2 because we count MAC (multiply-add counted as one flop)
                flop = float(np.floor(float(line.split(":")[-1]) / 2))
                return flop
    

    assert not with_complex

    flops = 0 # below code flops = 0
    if False:
        ...
        """
        dtype_in = u.dtype
        u = u.float()
        delta = delta.float()
        if delta_bias is not None:
            delta = delta + delta_bias[..., None].float()
        if delta_softplus:
            delta = F.softplus(delta)
        batch, dim, dstate = u.shape[0], A.shape[0], A.shape[1]
        is_variable_B = B.dim() >= 3
        is_variable_C = C.dim() >= 3
        if A.is_complex():
            if is_variable_B:
                B = torch.view_as_complex(rearrange(B.float(), "... (L two) -> ... L two", two=2))
            if is_variable_C:
                C = torch.view_as_complex(rearrange(C.float(), "... (L two) -> ... L two", two=2))
        else:
            B = B.float()
            C = C.float()
        x = A.new_zeros((batch, dim, dstate))
        ys = []
        """

    flops += get_flops_einsum([[B, D, L], [D, N]], "bdl,dn->bdln")
    if with_Group:
        flops += get_flops_einsum([[B, D, L], [B, N, L], [B, D, L]], "bdl,bnl,bdl->bdln")
    else:
        flops += get_flops_einsum([[B, D, L], [B, D, N, L], [B, D, L]], "bdl,bdnl,bdl->bdln")
    if False:
        ...
        """
        deltaA = torch.exp(torch.einsum('bdl,dn->bdln', delta, A))
        if not is_variable_B:
            deltaB_u = torch.einsum('bdl,dn,bdl->bdln', delta, B, u)
        else:
            if B.dim() == 3:
                deltaB_u = torch.einsum('bdl,bnl,bdl->bdln', delta, B, u)
            else:
                B = repeat(B, "B G N L -> B (G H) N L", H=dim // B.shape[1])
                deltaB_u = torch.einsum('bdl,bdnl,bdl->bdln', delta, B, u)
        if is_variable_C and C.dim() == 4:
            C = repeat(C, "B G N L -> B (G H) N L", H=dim // C.shape[1])
        last_state = None
        """
    
    in_for_flops = B * D * N   
    if with_Group:
        in_for_flops += get_flops_einsum([[B, D, N], [B, D, N]], "bdn,bdn->bd")
    else:
        in_for_flops += get_flops_einsum([[B, D, N], [B, N]], "bdn,bn->bd")
    flops += L * in_for_flops 
    if False:
        ...
        """
        for i in range(u.shape[2]):
            x = deltaA[:, :, i] * x + deltaB_u[:, :, i]
            if not is_variable_C:
                y = torch.einsum('bdn,dn->bd', x, C)
            else:
                if C.dim() == 3:
                    y = torch.einsum('bdn,bn->bd', x, C[:, :, i])
                else:
                    y = torch.einsum('bdn,bdn->bd', x, C[:, :, :, i])
            if i == u.shape[2] - 1:
                last_state = x
            if y.is_complex():
                y = y.real * 2
            ys.append(y)
        y = torch.stack(ys, dim=2) # (batch dim L)
        """

    if with_D:
        flops += B * D * L
    if with_Z:
        flops += B * D * L
    if False:
        ...
        """
        out = y if D is None else y + u * rearrange(D, "d -> d 1")
        if z is not None:
            out = out * F.silu(z)
        out = out.to(dtype=dtype_in)
        """
    
    return flops


def selective_scan_flop_jit(inputs, outputs):
    # xs, dts, As, Bs, Cs, Ds (skip), z (skip), dt_projs_bias (skip)
    assert inputs[0].debugName().startswith("xs") # (B, D, L)
    assert inputs[2].debugName().startswith("As") # (D, N)
    assert inputs[3].debugName().startswith("Bs") # (D, N)
    with_Group = len(inputs[3].type().sizes()) == 4
    with_D = inputs[5].debugName().startswith("Ds")
    if not with_D:
        with_z = inputs[5].debugName().startswith("z")
    else:
        with_z = inputs[6].debugName().startswith("z")
    B, D, L = inputs[0].type().sizes()
    N = inputs[2].type().sizes()[1]
    flops = flops_selective_scan_ref(B=B, L=L, D=D, N=N, with_D=with_D, with_Z=with_z, with_Group=with_Group)
    return flops

albertfgu · 2024-01-31T16:32:27Z

The formula we used is 9 * d_state * d_model (times batch size times sequence length). This is for a forward pass, so triple that for forward + backward pass.
Note that these are the flops on top of the standard 2 * parameters * tokens FLOP count incurred from linear layers (or 6 * D^2 * tokens total for forward + backward).

This is a brief explanation:

Note that the cost of computing the input-dependent dt/B/C is baked into the linear layer FLOP counts above

We ignore batch and d_model in the calculations below, since it’s all trivially parallelized over these dimensions.

The $Bx$ and $Ch$ calculations have $2LN$ (2 * seqlen * d_state) mults and $LN$ adds (adds are for the $C$ part only)

Remaining flops are associative scan on $N$ (d_state) independent recurrences.

$2L$ associative operations
SSM op is (a1, b1) o (a2, b2) = (a1a2, a2b1 + b2)
two multiply and one add = 3 FLOPs per associative operation

$2L * 3 = 6L$ total

Summing these gives the $9LN$.

MzeroMiko · 2024-02-01T04:34:01Z

2L associative operations

Thank you for your quick reply. Can you explain that why is there 2*L associative operations, but not L?

albertfgu · 2024-02-01T06:39:36Z

If you look at the algorithm for associative scan that's how it works. See https://en.wikipedia.org/wiki/Prefix_sum for example

Also note that the above is not accounting for the expansion factor of the Mamba block. In other words the number of channels of the selective SSM scan is 2*d_model

MzeroMiko · 2024-02-01T07:03:20Z

Many thanks. I think I've got the answer.

llmexperiment · 2024-02-28T04:05:38Z

We calc FLOPs based on the ref code, though it is very different from the real speed in practise.

def flops_selective_scan_ref(B=1, L=256, D=768, N=16, with_D=True, with_Z=False, with_Group=True, with_complex=False):
    """
    u: r(B D L)
    delta: r(B D L)
    A: r(D N)
    B: r(B N L)
    C: r(B N L)
    D: r(D)
    z: r(B D L)
    delta_bias: r(D), fp32
    
    ignores:
        [.float(), +, .softplus, .shape, new_zeros, repeat, stack, to(dtype), silu] 
    """
    import numpy as np
    
    # fvcore.nn.jit_handles
    def get_flops_einsum(input_shapes, equation):
        np_arrs = [np.zeros(s) for s in input_shapes]
        optim = np.einsum_path(equation, *np_arrs, optimize="optimal")[1]
        for line in optim.split("\n"):
            if "optimized flop" in line.lower():
                # divided by 2 because we count MAC (multiply-add counted as one flop)
                flop = float(np.floor(float(line.split(":")[-1]) / 2))
                return flop
    

    assert not with_complex

    flops = 0 # below code flops = 0
    if False:
        ...
        """
        dtype_in = u.dtype
        u = u.float()
        delta = delta.float()
        if delta_bias is not None:
            delta = delta + delta_bias[..., None].float()
        if delta_softplus:
            delta = F.softplus(delta)
        batch, dim, dstate = u.shape[0], A.shape[0], A.shape[1]
        is_variable_B = B.dim() >= 3
        is_variable_C = C.dim() >= 3
        if A.is_complex():
            if is_variable_B:
                B = torch.view_as_complex(rearrange(B.float(), "... (L two) -> ... L two", two=2))
            if is_variable_C:
                C = torch.view_as_complex(rearrange(C.float(), "... (L two) -> ... L two", two=2))
        else:
            B = B.float()
            C = C.float()
        x = A.new_zeros((batch, dim, dstate))
        ys = []
        """

    flops += get_flops_einsum([[B, D, L], [D, N]], "bdl,dn->bdln")
    if with_Group:
        flops += get_flops_einsum([[B, D, L], [B, N, L], [B, D, L]], "bdl,bnl,bdl->bdln")
    else:
        flops += get_flops_einsum([[B, D, L], [B, D, N, L], [B, D, L]], "bdl,bdnl,bdl->bdln")
    if False:
        ...
        """
        deltaA = torch.exp(torch.einsum('bdl,dn->bdln', delta, A))
        if not is_variable_B:
            deltaB_u = torch.einsum('bdl,dn,bdl->bdln', delta, B, u)
        else:
            if B.dim() == 3:
                deltaB_u = torch.einsum('bdl,bnl,bdl->bdln', delta, B, u)
            else:
                B = repeat(B, "B G N L -> B (G H) N L", H=dim // B.shape[1])
                deltaB_u = torch.einsum('bdl,bdnl,bdl->bdln', delta, B, u)
        if is_variable_C and C.dim() == 4:
            C = repeat(C, "B G N L -> B (G H) N L", H=dim // C.shape[1])
        last_state = None
        """
    
    in_for_flops = B * D * N   
    if with_Group:
        in_for_flops += get_flops_einsum([[B, D, N], [B, D, N]], "bdn,bdn->bd")
    else:
        in_for_flops += get_flops_einsum([[B, D, N], [B, N]], "bdn,bn->bd")
    flops += L * in_for_flops 
    if False:
        ...
        """
        for i in range(u.shape[2]):
            x = deltaA[:, :, i] * x + deltaB_u[:, :, i]
            if not is_variable_C:
                y = torch.einsum('bdn,dn->bd', x, C)
            else:
                if C.dim() == 3:
                    y = torch.einsum('bdn,bn->bd', x, C[:, :, i])
                else:
                    y = torch.einsum('bdn,bdn->bd', x, C[:, :, :, i])
            if i == u.shape[2] - 1:
                last_state = x
            if y.is_complex():
                y = y.real * 2
            ys.append(y)
        y = torch.stack(ys, dim=2) # (batch dim L)
        """

    if with_D:
        flops += B * D * L
    if with_Z:
        flops += B * D * L
    if False:
        ...
        """
        out = y if D is None else y + u * rearrange(D, "d -> d 1")
        if z is not None:
            out = out * F.silu(z)
        out = out.to(dtype=dtype_in)
        """
    
    return flops


def selective_scan_flop_jit(inputs, outputs):
    # xs, dts, As, Bs, Cs, Ds (skip), z (skip), dt_projs_bias (skip)
    assert inputs[0].debugName().startswith("xs") # (B, D, L)
    assert inputs[2].debugName().startswith("As") # (D, N)
    assert inputs[3].debugName().startswith("Bs") # (D, N)
    with_Group = len(inputs[3].type().sizes()) == 4
    with_D = inputs[5].debugName().startswith("Ds")
    if not with_D:
        with_z = inputs[5].debugName().startswith("z")
    else:
        with_z = inputs[6].debugName().startswith("z")
    B, D, L = inputs[0].type().sizes()
    N = inputs[2].type().sizes()[1]
    flops = flops_selective_scan_ref(B=B, L=L, D=D, N=N, with_D=with_D, with_Z=with_z, with_Group=with_Group)
    return flops

Hi @MzeroMiko , did you able to figure out how to calculate FLOPs for selective scan? I used your script, and as you noted it is larger than what I expected?

MzeroMiko · 2024-03-01T02:55:40Z

@llmexperiment
As addressed by @albertfgu , you can just return 9BLDN if you only use the core function of selective_scan.

For full script:

def flops_selective_scan_fn(B=1, L=256, D=768, N=16, with_D=True, with_Z=False, with_Group=True, with_complex=False):
    """
    u: r(B D L)
    delta: r(B D L)
    A: r(D N)
    B: r(B N L)
    C: r(B N L)
    D: r(D)
    z: r(B D L)
    delta_bias: r(D), fp32
    
    ignores:
        [.float(), +, .softplus, .shape, new_zeros, repeat, stack, to(dtype), silu] 
    """
    assert not with_complex 
    # https://github.com/state-spaces/mamba/issues/110
    flops = 9 * B * L * D * N
    if with_D:
        flops += B * D * L
    if with_Z:
        flops += B * D * L    
    return flops

def selective_scan_flop_jit(inputs, outputs):
    print_jit_input_names(inputs)
    B, D, L = inputs[0].type().sizes()
    N = inputs[2].type().sizes()[1]
    flops = flops_selective_scan_fn(B=B, L=L, D=D, N=N, with_D=True, with_Z=False, with_Group=True)
    return flops

radarFudan · 2024-03-01T03:13:12Z

I have a naive follow up question:

If we use associative scan algorithm, in the wiki (https://en.wikipedia.org/wiki/Prefix_sum) for prefix sum it shows that the work-efficient version only takes O(T) while the faster span version takes O(T log T).

May I ask whether the mamba kernel is more similar to the work-efficient version or the fast version. Because it seems to me the fast and slow version all takes forward/backward latency of scale O(\log T). But they require different number of cores to compute and have very different asymptotic growth with respect to sequence length T.

tridao · 2024-03-01T03:37:05Z

We use the work-efficient version (Blelloch's scan).

albertfgu · 2024-03-01T16:44:31Z

In a world with infinite parallelism the lower-span version may be faster by a constant. But GPUs have a lot of different constraints; we actually already max out its parallelism and the bottleneck is compute, so the work-efficient version is much faster.

llmexperiment · 2024-03-03T01:08:48Z

The formula we used is 9 * d_state * d_model (times batch size times sequence length). This is for a forward pass, so triple that for forward + backward pass. Note that these are the flops on top of the standard 2 * parameters * tokens FLOP count incurred from linear layers (or 6 * D^2 * tokens total for forward + backward).

This is a brief explanation:

Note that the cost of computing the input-dependent dt/B/C is baked into the linear layer FLOP counts above

We ignore batch and d_model in the calculations below, since it’s all trivially parallelized over these dimensions.

The Bx and Ch calculations have 2LN (2 * seqlen * d_state) mults and LN adds (adds are for the C part only)

Remaining flops are associative scan on N (d_state) independent recurrences.

2L associative operations

SSM op is (a1, b1) o (a2, b2) = (a1a2, a2b1 + b2)

two multiply and one add = 3 FLOPs per associative operation

2L∗3=6L total

Summing these gives the 9LN.

Hi @albertfgu ,

I find your response very informative, and I am trying to understand deeper. I have two quick questions.

Why there are 2L associative operations?
Does the calculation of 9LN is for scan part only?

albertfgu · 2024-03-03T18:07:51Z

If you look at the computation graph of the prefix sum, there are two passes, each of which has L-1 operations. See the "work-efficient" diagram in https://en.wikipedia.org/wiki/Prefix_sum
Yes

cyjie429 · 2024-03-05T04:33:55Z

If you look at the computation graph of the prefix sum, there are two passes, each of which has L-1 operations. See the "work-efficient" diagram in https://en.wikipedia.org/wiki/Prefix_sum

Yes

How should I calculate the FLOPS for a standard Mamba Layer, and what would be an approximate value? Thank you very much

lth456321 · 2024-04-28T12:00:17Z

We calc FLOPs based on the ref code, though it is very different from the real speed in practise.

def flops_selective_scan_ref(B=1, L=256, D=768, N=16, with_D=True, with_Z=False, with_Group=True, with_complex=False):
    """
    u: r(B D L)
    delta: r(B D L)
    A: r(D N)
    B: r(B N L)
    C: r(B N L)
    D: r(D)
    z: r(B D L)
    delta_bias: r(D), fp32
    
    ignores:
        [.float(), +, .softplus, .shape, new_zeros, repeat, stack, to(dtype), silu] 
    """
    import numpy as np
    
    # fvcore.nn.jit_handles
    def get_flops_einsum(input_shapes, equation):
        np_arrs = [np.zeros(s) for s in input_shapes]
        optim = np.einsum_path(equation, *np_arrs, optimize="optimal")[1]
        for line in optim.split("\n"):
            if "optimized flop" in line.lower():
                # divided by 2 because we count MAC (multiply-add counted as one flop)
                flop = float(np.floor(float(line.split(":")[-1]) / 2))
                return flop
    

    assert not with_complex

    flops = 0 # below code flops = 0
    if False:
        ...
        """
        dtype_in = u.dtype
        u = u.float()
        delta = delta.float()
        if delta_bias is not None:
            delta = delta + delta_bias[..., None].float()
        if delta_softplus:
            delta = F.softplus(delta)
        batch, dim, dstate = u.shape[0], A.shape[0], A.shape[1]
        is_variable_B = B.dim() >= 3
        is_variable_C = C.dim() >= 3
        if A.is_complex():
            if is_variable_B:
                B = torch.view_as_complex(rearrange(B.float(), "... (L two) -> ... L two", two=2))
            if is_variable_C:
                C = torch.view_as_complex(rearrange(C.float(), "... (L two) -> ... L two", two=2))
        else:
            B = B.float()
            C = C.float()
        x = A.new_zeros((batch, dim, dstate))
        ys = []
        """

    flops += get_flops_einsum([[B, D, L], [D, N]], "bdl,dn->bdln")
    if with_Group:
        flops += get_flops_einsum([[B, D, L], [B, N, L], [B, D, L]], "bdl,bnl,bdl->bdln")
    else:
        flops += get_flops_einsum([[B, D, L], [B, D, N, L], [B, D, L]], "bdl,bdnl,bdl->bdln")
    if False:
        ...
        """
        deltaA = torch.exp(torch.einsum('bdl,dn->bdln', delta, A))
        if not is_variable_B:
            deltaB_u = torch.einsum('bdl,dn,bdl->bdln', delta, B, u)
        else:
            if B.dim() == 3:
                deltaB_u = torch.einsum('bdl,bnl,bdl->bdln', delta, B, u)
            else:
                B = repeat(B, "B G N L -> B (G H) N L", H=dim // B.shape[1])
                deltaB_u = torch.einsum('bdl,bdnl,bdl->bdln', delta, B, u)
        if is_variable_C and C.dim() == 4:
            C = repeat(C, "B G N L -> B (G H) N L", H=dim // C.shape[1])
        last_state = None
        """
    
    in_for_flops = B * D * N   
    if with_Group:
        in_for_flops += get_flops_einsum([[B, D, N], [B, D, N]], "bdn,bdn->bd")
    else:
        in_for_flops += get_flops_einsum([[B, D, N], [B, N]], "bdn,bn->bd")
    flops += L * in_for_flops 
    if False:
        ...
        """
        for i in range(u.shape[2]):
            x = deltaA[:, :, i] * x + deltaB_u[:, :, i]
            if not is_variable_C:
                y = torch.einsum('bdn,dn->bd', x, C)
            else:
                if C.dim() == 3:
                    y = torch.einsum('bdn,bn->bd', x, C[:, :, i])
                else:
                    y = torch.einsum('bdn,bdn->bd', x, C[:, :, :, i])
            if i == u.shape[2] - 1:
                last_state = x
            if y.is_complex():
                y = y.real * 2
            ys.append(y)
        y = torch.stack(ys, dim=2) # (batch dim L)
        """

    if with_D:
        flops += B * D * L
    if with_Z:
        flops += B * D * L
    if False:
        ...
        """
        out = y if D is None else y + u * rearrange(D, "d -> d 1")
        if z is not None:
            out = out * F.silu(z)
        out = out.to(dtype=dtype_in)
        """
    
    return flops


def selective_scan_flop_jit(inputs, outputs):
    # xs, dts, As, Bs, Cs, Ds (skip), z (skip), dt_projs_bias (skip)
    assert inputs[0].debugName().startswith("xs") # (B, D, L)
    assert inputs[2].debugName().startswith("As") # (D, N)
    assert inputs[3].debugName().startswith("Bs") # (D, N)
    with_Group = len(inputs[3].type().sizes()) == 4
    with_D = inputs[5].debugName().startswith("Ds")
    if not with_D:
        with_z = inputs[5].debugName().startswith("z")
    else:
        with_z = inputs[6].debugName().startswith("z")
    B, D, L = inputs[0].type().sizes()
    N = inputs[2].type().sizes()[1]
    flops = flops_selective_scan_ref(B=B, L=L, D=D, N=N, with_D=with_D, with_Z=with_z, with_Group=with_Group)
    return flops

thanks for your work, can you explain how to use this coumpute FLOPs of mamba with a input[B,L,D]? Thank you

AndssY · 2024-05-01T15:55:53Z

@lth456321
Have you figured out this problem? I am also troubled by this issue.
I try to use it, while got error. Can anyone provide the FLOPs calculation formula for the entire Mamba Block? Including three fully connected layers, a causal conv1d, and some norms？

triton.compiler.errors.CompilationError: at 31:24:    HAS_BIAS: tl.constexpr,
):
    # Map the program id to the row of X and Y it should compute.
    row = tl.program_id(0)
    X += row * stride_x_row
    Y += row * stride_y_row
    if HAS_RESIDUAL:
        RESIDUAL += row * stride_res_row
    if STORE_RESIDUAL_OUT:
        RESIDUAL_OUT += row * stride_res_out_row
    # Compute mean and variance
    cols = tl.arange(0, BLOCK_N)
                        ^
ValueError("arange's arguments must be of type tl.constexpr")

lennart-rth · 2024-06-24T12:21:10Z

@lth456321 Have you figured out this problem? I am also troubled by this issue. I try to use it, while got error. Can anyone provide the FLOPs calculation formula for the entire Mamba Block? Including three fully connected layers, a causal conv1d, and some norms？
triton.compiler.errors.CompilationError: at 31:24:    HAS_BIAS: tl.constexpr,
):
    # Map the program id to the row of X and Y it should compute.
    row = tl.program_id(0)
    X += row * stride_x_row
    Y += row * stride_y_row
    if HAS_RESIDUAL:
        RESIDUAL += row * stride_res_row
    if STORE_RESIDUAL_OUT:
        RESIDUAL_OUT += row * stride_res_out_row
    # Compute mean and variance
    cols = tl.arange(0, BLOCK_N)
                        ^
ValueError("arange's arguments must be of type tl.constexpr")

I had the same error and solved it by changing this:

BLOCK_N = min(MAX_FUSED_SIZE, triton.next_power_of_2(N))

at "mamba/mamba_ssm/ops/triton/layer_norm.py" line 365
to this:

BLOCK_N: tl.constexpr = int(min(MAX_FUSED_SIZE, triton.next_power_of_2(N)))
N = int(N)

During training, BLOCK_N was of <class 'int'> (which worked). But when called by fvcore, it was a torch tensor (which throws an error).
Same for N.
But I haven't found the reason for this, and I have no experience with Triton at all, so this is probably not a good solution.

dumpmemory · 2024-07-31T09:19:31Z

How about flops for mamba2 ? does any one know how to calculate it manually ?

Aristo23333 · 2024-09-18T08:11:30Z

The formula we used is 9 * d_state * d_model (times batch size times sequence length). This is for a forward pass, so triple that for forward + backward pass. Note that these are the flops on top of the standard 2 * parameters * tokens FLOP count incurred from linear layers (or 6 * D^2 * tokens total for forward + backward).

This is a brief explanation:

Note that the cost of computing the input-dependent dt/B/C is baked into the linear layer FLOP counts above

We ignore batch and d_model in the calculations below, since it’s all trivially parallelized over these dimensions.

The B x and C h calculations have 2 L N (2 * seqlen * d_state) mults and L N adds (adds are for the C part only)

Remaining flops are associative scan on N (d_state) independent recurrences.

2
L
associative operations

SSM op is (a1, b1) o (a2, b2) = (a1a2, a2b1 + b2)

two multiply and one add = 3 FLOPs per associative operation

2 L ∗ 3 = 6 L total

Summing these gives the 9 L N .

Dear Author: Thanks for your response, I notice you only consider $Bx$ and $Ch$. How about $Ah_{t-1}$ ? Does multiplying A bring some cost? Thank you!

llmexperiment mentioned this issue Mar 5, 2024

Scan vs. Flash Attention profiling on Figure 8 (a) #218

Open

JiarunLiu mentioned this issue Mar 5, 2024

How to calculate the flops of mamba ? JiarunLiu/Swin-UMamba#4

Closed

exhyy mentioned this issue Mar 26, 2024

Questions about computing FLOPs with fvcore OpenGVLab/VideoMamba#18

Closed

AZZMM mentioned this issue Mar 26, 2024

About Flops badripatro/simba#3

Closed

niuzehai mentioned this issue Apr 20, 2024

thop fails to count the flops and parameters of the custom operator Mamba #303

Open

JiarunLiu mentioned this issue Apr 30, 2024

How do you calculate FLOPs? JiarunLiu/Swin-UMamba#11

Closed

YuHengsss mentioned this issue May 28, 2024

I want to know HOW to calculate the FLOPs of mamba-related modules? YuHengsss/MSVMamba#1

Closed

AndyCao1125 mentioned this issue Jul 29, 2024

Flops calculation NVlabs/MambaVision#21

Closed

alxndrTL mentioned this issue Aug 8, 2024

flops about mamba2 alxndrTL/mamba.py#51

Open

mmm-cc mentioned this issue Sep 15, 2024

About FLOPs EnVision-Research/MTMamba#2

Closed

saarthakk-insitro mentioned this issue Oct 23, 2024

Regarding FLOP calculation in Local ViM hunto/LocalMamba#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to compute MACs or FLOPs of mamba #110

How to compute MACs or FLOPs of mamba #110

wangmou21 commented Jan 17, 2024

MzeroMiko commented Jan 30, 2024

albertfgu commented Jan 31, 2024 •

edited

Loading

MzeroMiko commented Feb 1, 2024

albertfgu commented Feb 1, 2024

MzeroMiko commented Feb 1, 2024

llmexperiment commented Feb 28, 2024

MzeroMiko commented Mar 1, 2024

radarFudan commented Mar 1, 2024

tridao commented Mar 1, 2024

albertfgu commented Mar 1, 2024

llmexperiment commented Mar 3, 2024

albertfgu commented Mar 3, 2024

cyjie429 commented Mar 5, 2024

lth456321 commented Apr 28, 2024

AndssY commented May 1, 2024

lennart-rth commented Jun 24, 2024

dumpmemory commented Jul 31, 2024

Aristo23333 commented Sep 18, 2024

How to compute MACs or FLOPs of mamba #110

How to compute MACs or FLOPs of mamba #110

Comments

wangmou21 commented Jan 17, 2024

MzeroMiko commented Jan 30, 2024

albertfgu commented Jan 31, 2024 • edited Loading

MzeroMiko commented Feb 1, 2024

albertfgu commented Feb 1, 2024

MzeroMiko commented Feb 1, 2024

llmexperiment commented Feb 28, 2024

MzeroMiko commented Mar 1, 2024

radarFudan commented Mar 1, 2024

tridao commented Mar 1, 2024

albertfgu commented Mar 1, 2024

llmexperiment commented Mar 3, 2024

albertfgu commented Mar 3, 2024

cyjie429 commented Mar 5, 2024

lth456321 commented Apr 28, 2024

AndssY commented May 1, 2024

lennart-rth commented Jun 24, 2024

dumpmemory commented Jul 31, 2024

Aristo23333 commented Sep 18, 2024

albertfgu commented Jan 31, 2024 •

edited

Loading