Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient codegen when accessing a vector with literal indices #50759

Closed
jethrogb opened this issue May 14, 2018 · 7 comments
Closed

Inefficient codegen when accessing a vector with literal indices #50759

jethrogb opened this issue May 14, 2018 · 7 comments
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@jethrogb
Copy link
Contributor

jethrogb commented May 14, 2018

use std::fs::{read, write};

fn main() -> Result<(), std::io::Error> {
    let h = read("/tmp/test")?;
    let b = [h[0], h[1], h[ 2], h[ 3], h[ 4], h[ 5], h[ 6], h[ 7], h[8], h[9], h[10], h[11], h[12], h[13], h[14], h[15]];
    write("/tmp/test", &b)?;
    Ok(())
}

The let b = ... statement gets compiled to (1.26.0 stable, release mode):

	testq	%rdx, %rdx
	je	.LBB13_109
	cmpq	$1, %rdx
	je	.LBB13_110
	cmpq	$2, %rdx
	jbe	.LBB13_111
	cmpq	$3, %rdx
	je	.LBB13_112
	cmpq	$4, %rdx
	jbe	.LBB13_113
	cmpq	$5, %rdx
	je	.LBB13_114
	cmpq	$6, %rdx
	jbe	.LBB13_115
	cmpq	$7, %rdx
	je	.LBB13_116
	cmpq	$8, %rdx
	jbe	.LBB13_117
	cmpq	$9, %rdx
	je	.LBB13_118
	cmpq	$10, %rdx
	jbe	.LBB13_119
	cmpq	$11, %rdx
	je	.LBB13_120
	cmpq	$12, %rdx
	jbe	.LBB13_121
	cmpq	$13, %rdx
	je	.LBB13_122
	cmpq	$14, %rdx
	jbe	.LBB13_123
	movq	%rbp, 56(%rsp)
	cmpq	$15, %rdx
	je	.LBB13_124
	movq	208(%rsp), %r12
	movb	(%r12), %al
	movb	1(%r12), %cl
	movb	2(%r12), %dl
	movb	3(%r12), %bl
	movb	4(%r12), %sil
	movb	5(%r12), %dil
	movb	6(%r12), %r8b
	movb	7(%r12), %r9b
	movb	8(%r12), %r10b
	movb	9(%r12), %r11b
	movb	10(%r12), %bpl
	movb	11(%r12), %r14b
	movb	12(%r12), %r15b
	movb	13(%r12), %r13b
	movq	%r12, 128(%rsp)
	movzwl	14(%r12), %r12d
	movb	%al, 32(%rsp)
	movb	%cl, 33(%rsp)
	movb	%dl, 34(%rsp)
	movb	%bl, 35(%rsp)
	movb	%sil, 36(%rsp)
	movb	%dil, 37(%rsp)
	movb	%r8b, 38(%rsp)
	movb	%r9b, 39(%rsp)
	movb	%r10b, 40(%rsp)
	movb	%r11b, 41(%rsp)
	movb	%bpl, 42(%rsp)
	movb	%r14b, 43(%rsp)
	movb	%r15b, 44(%rsp)
	movb	%r13b, 45(%rsp)
	movw	%r12w, 46(%rsp)

I was really hoping the optimizer would be able to do a single length check followed by a the equivalent of a memcpy instead.

@sanxiyn sanxiyn added the I-slow Issue: Problems and improvements with respect to performance of generated code. label May 15, 2018
@JustAPerson
Copy link
Contributor

JustAPerson commented May 15, 2018

There's not much that can be done because rust is preserving the semantics that depending on which byte you access, you'll get a different panic. If you add a assert!(h.len() > 15) before accessing, it will optimize as you expect.

100001cfd:	movq	%rdi, %r15         ; save main() Result out param
100001d00:	leaq	307925(%rip), %rsi ; load str constant
100001d07:	leaq	-96(%rbp), %rbx    ; create read() out param
100001d0b:	movl	$9, %edx           ; length of str constant
100001d10:	movq	%rbx, %rdi         ; load read() out param
100001d13:	callq	-1544 <__ZN3std2fs4read17h5248aa1330795a98E>
100001d18:	cmpq	$1, -96(%rbp)      ; check Result discriminant
100001d1d:	jne	20 <__ZN6_507594main17hb99f5070155b7525E+0x43>
100001d1f:	movq	8(%rbx), %rax      ; move Error case into out param
100001d23:	movq	16(%rbx), %rcx
100001d27:	movq	%rcx, 8(%r15)
100001d2b:	movq	%rax, (%r15)
100001d2e:	jmp	149 <__ZN6_507594main17hb99f5070155b7525E+0xD8>
100001d33:	movq	24(%rbx), %rax     ; move Vec into local stack variable h
100001d37:	movq	%rax, -32(%rbp)
100001d3b:	movq	8(%rbx), %rax
100001d3f:	movq	16(%rbx), %rcx
100001d43:	movq	%rcx, -40(%rbp)
100001d47:	movq	%rax, -48(%rbp)
100001d4b:	cmpq	$15, -32(%rbp)     ; assert length of vec (be = below/eq)
100001d50:	jbe	128 <__ZN6_507594main17hb99f5070155b7525E+0xE6>
100001d56:	movq	-48(%rbp), %r14    ; get Vec ptr
100001d5a:	movq	(%r14), %rax       ; copy 16 bytes
100001d5d:	movq	8(%r14), %rcx
100001d61:	movq	%rax, -112(%rbp)
100001d65:	movq	%rcx, -104(%rbp)

@ishitatsuyuki
Copy link
Contributor

I'd also say that this is working as intended: given a short h, if you don't assert beforehand, the code can panic at different points, with a different panic message depending on h.len().

@jonas-schievink
Copy link
Contributor

I agree that this is somewhat expected, but LLVM could still do better nonetheless.

What you get for this code is essentially a long list of ifs that are only taken rarely (and LLVM should know that the branches are only rarely taken since the panic machinery is #[cold]). It could then introduce a fast path by checking if len > 15 at the start, skipping all unlikely branches since they can't apply. I have no clue if there's an optimization that does this or how it would be called, though. If there is, it's evidently not a default pass.

@hanna-kruppe
Copy link
Contributor

What you get for this code is essentially a long list of ifs that are only taken rarely (and LLVM should know that the branches are only rarely taken since the panic machinery is #[cold]). It could then introduce a fast path by checking if len > 15 at the start, skipping all unlikely branches since they can't apply.

Interesting idea, but seems a bit difficult to generalize this even a little bit (if you want to avoid regressions). While each of the branches is unlikely individually, they might be statistically independent for all LLVM knows, and with a large number of independent unlikely branches it quickly becomes rather likely that at least one of them is taken. For the specific case of "a bunch of constants checked against a common upper bound" you could pick the maximum of the constants and reason that, since max_const > bound is unlikely, so are the other const-vs-bound checks. However, that reasoning should take into account the probability of the max_const > bound check being reached in the first place and that gets really complicated if there's other control flow intermingled with the the bounds checks. And it would still be very specific to bounds checking with constant indices.

@XAMPPRocky XAMPPRocky added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-codegen Area: Code generation labels Oct 2, 2018
@steveklabnik
Copy link
Member

Today's nightly gives:

       movzx   eax, word ptr [rbx + 14]
        mov     word ptr [rsp + 14], ax
        mov     al, byte ptr [rsp + 13]
        mov     byte ptr [rsp + 16], al
        mov     byte ptr [rsp + 17], cl
        mov     byte ptr [rsp + 18], dl
        mov     byte ptr [rsp + 19], sil
        mov     byte ptr [rsp + 20], dil
        mov     byte ptr [rsp + 21], r8b
        mov     byte ptr [rsp + 22], r9b
        mov     byte ptr [rsp + 23], r10b
        mov     byte ptr [rsp + 24], r11b
        mov     byte ptr [rsp + 25], bpl
        mov     byte ptr [rsp + 26], r12b
        mov     byte ptr [rsp + 27], r13b
        mov     byte ptr [rsp + 28], r14b
        mov     byte ptr [rsp + 29], r15b
        movzx   eax, word ptr [rsp + 14]
        mov     word ptr [rsp + 30], ax

https://godbolt.org/z/bStGNJ

@hanna-kruppe
Copy link
Contributor

No real difference as far as I can tell. It still has the 16 separate bounds checks at the start.

@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
@Mark-Simulacrum
Copy link
Member

My sense is that this isn't likely to lead to changes. The only proposed way of making the generated asm better is to inject the length comparison to skip a bunch of individual comparisons in the happy path, but as mentioned it's not clear how LLVM or rustc could gather the needed information that it's the right thing to do in this particular case. Users also have an easy way (via a manual assert!) to get LLVM to avoid individual bounds checks, which seems like the way to go to me rather than depending on an inherently flaky optimization for presumably critical code (otherwise the predictable branches are probably close to free).

I'm going to go ahead and close.

@Mark-Simulacrum Mark-Simulacrum closed this as not planned Won't fix, can't repro, duplicate, stale Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants