Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regressions porting Jetscii from inline assembly to intrinsics #401

Closed
shepmaster opened this issue Mar 25, 2018 · 30 comments
Closed

Comments

@shepmaster
Copy link
Member

shepmaster commented Mar 25, 2018

I ported Jetscii to use stdsimd with the belief that it will be stabilized sooner 😜.

There's a stdsimd branch in case you are interested in following along at home.

The initial port is roughly 60% of the original speed:

 name                                      inline-asm ns/iter     intrinsics ns/iter     diff ns/iter  diff %  speedup
 bench::space_asciichars                   1,023,795 (5121 MB/s)  1,643,905 (3189 MB/s)       620,110  60.57%   x 0.62
 bench::space_asciichars_as_pattern        1,044,517 (5019 MB/s)  1,716,374 (3054 MB/s)       671,857  64.32%   x 0.61
 bench::space_asciichars_macro             993,105 (5279 MB/s)    1,658,466 (3161 MB/s)       665,361  67.00%   x 0.60
 bench::space_find_byte                    3,610,758 (1452 MB/s)  3,526,808 (1486 MB/s)       -83,950  -2.32%   x 1.02
 bench::space_find_char                    633,608 (8274 MB/s)    636,607 (8235 MB/s)           2,999   0.47%   x 1.00
 bench::space_find_char_set                10,600,525 (494 MB/s)  10,561,106 (496 MB/s)       -39,419  -0.37%   x 1.00
 bench::space_find_closure                 10,156,759 (516 MB/s)  10,072,882 (520 MB/s)       -83,877  -0.83%   x 1.01
 bench::space_find_string                  7,506,830 (698 MB/s)   7,507,111 (698 MB/s)            281   0.00%   x 1.00
 bench::substring_as_pattern               1,082,652 (4842 MB/s)  1,496,699 (3502 MB/s)       414,047  38.24%   x 0.72
 bench::substring_find                     1,670,638 (3138 MB/s)  1,687,034 (3107 MB/s)        16,396   0.98%   x 0.99
 bench::substring_with_cached_searcher     997,570 (5255 MB/s)    1,520,424 (3448 MB/s)       522,854  52.41%   x 0.66
 bench::substring_with_created_searcher    1,007,291 (5204 MB/s)  1,533,745 (3418 MB/s)       526,454  52.26%   x 0.66
 bench::xml_delim_3_asciichars             1,014,110 (5169 MB/s)  1,637,181 (3202 MB/s)       623,071  61.44%   x 0.62
 bench::xml_delim_3_asciichars_as_pattern  984,594 (5324 MB/s)    1,628,740 (3218 MB/s)       644,146  65.42%   x 0.60
 bench::xml_delim_3_asciichars_macro       1,023,173 (5124 MB/s)  1,623,991 (3228 MB/s)       600,818  58.72%   x 0.63
 bench::xml_delim_3_find_byte_closure      2,237,287 (2343 MB/s)  2,211,426 (2370 MB/s)       -25,861  -1.16%   x 1.01
 bench::xml_delim_3_find_char_closure      14,359,362 (365 MB/s)  14,204,971 (369 MB/s)      -154,391  -1.08%   x 1.01
 bench::xml_delim_3_find_char_set          17,588,694 (298 MB/s)  17,769,736 (295 MB/s)       181,042   1.03%   x 0.99
 bench::xml_delim_5_asciichars             1,032,586 (5077 MB/s)  1,790,343 (2928 MB/s)       757,757  73.38%   x 0.58
 bench::xml_delim_5_asciichars_as_pattern  1,034,084 (5070 MB/s)  1,612,350 (3251 MB/s)       578,266  55.92%   x 0.64
 bench::xml_delim_5_asciichars_macro       986,644 (5313 MB/s)    1,666,725 (3145 MB/s)       680,081  68.93%   x 0.59
 bench::xml_delim_5_find_byte_closure      2,257,573 (2322 MB/s)  2,408,606 (2176 MB/s)       151,033   6.69%   x 0.94
 bench::xml_delim_5_find_char_closure      8,009,474 (654 MB/s)   7,453,402 (703 MB/s)       -556,072  -6.94%   x 1.07
 bench::xml_delim_5_find_char_set          23,184,513 (226 MB/s)  23,272,996 (225 MB/s)        88,483   0.38%   x 1.00

Takeaways

  • Make sure to use #[target_feature] (and/or -C target-feature)?
@BurntSushi
Copy link
Member

Is it possible to look at and compare the codegen?

@shepmaster
Copy link
Member Author

shepmaster commented Mar 25, 2018

Initial digging showed that I was making calls to the intrinsics:

0x100024cca <+26>: callq  0x100023a80               ; core::coresimd::x86::sse42::_mm_cmpestrm::h56dc3784ab1ef3fc (.llvm.8661840455491943110)

@alexcrichton points out that if you see calls it probably means the target feature business is mixed up: the caller doesn't have the target_feature of the callee.

I added appropriate target_features and got back up to 70%:

 name                                      inline-asm ns/iter     intrinsics-target ns/iter  diff ns/iter  diff %  speedup
 bench::space_asciichars                   1,023,795 (5121 MB/s)  1,323,342 (3961 MB/s)           299,547  29.26%   x 0.77
 bench::space_asciichars_as_pattern        1,044,517 (5019 MB/s)  1,425,165 (3678 MB/s)           380,648  36.44%   x 0.73
 bench::space_asciichars_macro             993,105 (5279 MB/s)    1,549,780 (3382 MB/s)           556,675  56.05%   x 0.64
 bench::space_find_byte                    3,610,758 (1452 MB/s)  3,464,528 (1513 MB/s)          -146,230  -4.05%   x 1.04
 bench::space_find_char                    633,608 (8274 MB/s)    1,217,729 (4305 MB/s)           584,121  92.19%   x 0.52
 bench::space_find_char_set                10,600,525 (494 MB/s)  10,622,009 (493 MB/s)            21,484   0.20%   x 1.00
 bench::space_find_closure                 10,156,759 (516 MB/s)  10,227,875 (512 MB/s)            71,116   0.70%   x 0.99
 bench::space_find_string                  7,506,830 (698 MB/s)   7,206,476 (727 MB/s)           -300,354  -4.00%   x 1.04
 bench::substring_as_pattern               1,082,652 (4842 MB/s)  1,473,809 (3557 MB/s)           391,157  36.13%   x 0.73
 bench::substring_find                     1,670,638 (3138 MB/s)  1,650,991 (3175 MB/s)           -19,647  -1.18%   x 1.01
 bench::substring_with_cached_searcher     997,570 (5255 MB/s)    1,481,513 (3538 MB/s)           483,943  48.51%   x 0.67
 bench::substring_with_created_searcher    1,007,291 (5204 MB/s)  1,431,101 (3663 MB/s)           423,810  42.07%   x 0.70
 bench::xml_delim_3_asciichars             1,014,110 (5169 MB/s)  1,454,010 (3605 MB/s)           439,900  43.38%   x 0.70
 bench::xml_delim_3_asciichars_as_pattern  984,594 (5324 MB/s)    1,360,792 (3852 MB/s)           376,198  38.21%   x 0.72
 bench::xml_delim_3_asciichars_macro       1,023,173 (5124 MB/s)  1,423,571 (3682 MB/s)           400,398  39.13%   x 0.72
 bench::xml_delim_3_find_byte_closure      2,237,287 (2343 MB/s)  2,265,758 (2313 MB/s)            28,471   1.27%   x 0.99
 bench::xml_delim_3_find_char_closure      14,359,362 (365 MB/s)  14,532,484 (360 MB/s)           173,122   1.21%   x 0.99
 bench::xml_delim_3_find_char_set          17,588,694 (298 MB/s)  18,069,001 (290 MB/s)           480,307   2.73%   x 0.97
 bench::xml_delim_5_asciichars             1,032,586 (5077 MB/s)  1,332,554 (3934 MB/s)           299,968  29.05%   x 0.77
 bench::xml_delim_5_asciichars_as_pattern  1,034,084 (5070 MB/s)  1,391,266 (3768 MB/s)           357,182  34.54%   x 0.74
 bench::xml_delim_5_asciichars_macro       986,644 (5313 MB/s)    1,436,379 (3650 MB/s)           449,735  45.58%   x 0.69
 bench::xml_delim_5_find_byte_closure      2,257,573 (2322 MB/s)  2,287,622 (2291 MB/s)            30,049   1.33%   x 0.99
 bench::xml_delim_5_find_char_closure      8,009,474 (654 MB/s)   7,305,064 (717 MB/s)           -704,410  -8.79%   x 1.10
 bench::xml_delim_5_find_char_set          23,184,513 (226 MB/s)  22,793,671 (230 MB/s)          -390,842  -1.69%   x 1.02

@shepmaster
Copy link
Member Author

The next bit of comparison shows that my trait implementation functions are no longer being inlined to their call site.

Adding #[inline] to the implementations has no effect.

Adding #[inline(always)] yields this LLVM error:

LLVM ERROR: Cannot select: intrinsic %llvm.x86.sse42.pcmpestrm128

Rewriting the callsite to (a) be unsafe (b) add #[target_feature] and (c) adding a shim method to call the new unsafe method restores most of the performance. However, it feels "leaky" that I need to perform these transformations.

 name                                      inline-asm ns/iter     intrinsics-inline ns/iter  diff ns/iter   diff %  speedup
 bench::space_asciichars                   1,023,795 (5121 MB/s)  912,239 (5747 MB/s)            -111,556  -10.90%   x 1.12
 bench::space_asciichars_as_pattern        1,044,517 (5019 MB/s)  920,070 (5698 MB/s)            -124,447  -11.91%   x 1.14
 bench::space_asciichars_macro             993,105 (5279 MB/s)    895,638 (5853 MB/s)             -97,467   -9.81%   x 1.11
 bench::space_find_byte                    3,610,758 (1452 MB/s)  3,487,791 (1503 MB/s)          -122,967   -3.41%   x 1.04
 bench::space_find_char                    633,608 (8274 MB/s)    652,212 (8038 MB/s)              18,604    2.94%   x 0.97
 bench::space_find_char_set                10,600,525 (494 MB/s)  10,601,290 (494 MB/s)               765    0.01%   x 1.00
 bench::space_find_closure                 10,156,759 (516 MB/s)  10,221,552 (512 MB/s)            64,793    0.64%   x 0.99
 bench::space_find_string                  7,506,830 (698 MB/s)   7,347,271 (713 MB/s)           -159,559   -2.13%   x 1.02
 bench::substring_as_pattern               1,082,652 (4842 MB/s)  945,317 (5546 MB/s)            -137,335  -12.69%   x 1.15
 bench::substring_find                     1,670,638 (3138 MB/s)  1,630,799 (3214 MB/s)           -39,839   -2.38%   x 1.02
 bench::substring_with_cached_searcher     997,570 (5255 MB/s)    866,252 (6052 MB/s)            -131,318  -13.16%   x 1.15
 bench::substring_with_created_searcher    1,007,291 (5204 MB/s)  927,735 (5651 MB/s)             -79,556   -7.90%   x 1.09
 bench::xml_delim_3_asciichars             1,014,110 (5169 MB/s)  910,886 (5755 MB/s)            -103,224  -10.18%   x 1.11
 bench::xml_delim_3_asciichars_as_pattern  984,594 (5324 MB/s)    1,153,064 (4546 MB/s)           168,470   17.11%   x 0.85
 bench::xml_delim_3_asciichars_macro       1,023,173 (5124 MB/s)  1,306,580 (4012 MB/s)           283,407   27.70%   x 0.78
 bench::xml_delim_3_find_byte_closure      2,237,287 (2343 MB/s)  2,261,405 (2318 MB/s)            24,118    1.08%   x 0.99
 bench::xml_delim_3_find_char_closure      14,359,362 (365 MB/s)  14,338,068 (365 MB/s)           -21,294   -0.15%   x 1.00
 bench::xml_delim_3_find_char_set          17,588,694 (298 MB/s)  17,867,957 (293 MB/s)           279,263    1.59%   x 0.98
 bench::xml_delim_5_asciichars             1,032,586 (5077 MB/s)  928,426 (5647 MB/s)            -104,160  -10.09%   x 1.11
 bench::xml_delim_5_asciichars_as_pattern  1,034,084 (5070 MB/s)  873,274 (6003 MB/s)            -160,810  -15.55%   x 1.18
 bench::xml_delim_5_asciichars_macro       986,644 (5313 MB/s)    897,917 (5838 MB/s)             -88,727   -8.99%   x 1.10
 bench::xml_delim_5_find_byte_closure      2,257,573 (2322 MB/s)  2,211,174 (2371 MB/s)           -46,399   -2.06%   x 1.02
 bench::xml_delim_5_find_char_closure      8,009,474 (654 MB/s)   7,353,278 (712 MB/s)           -656,196   -8.19%   x 1.09
 bench::xml_delim_5_find_char_set          23,184,513 (226 MB/s)  23,235,118 (225 MB/s)            50,605    0.22%   x 1.00

Of note is that I forgot to transform all callsites at first and got this wonderful LLVM error:

   Compiling jetscii v0.3.1 (file:///Users/shep/Projects/jetscii)
LLVM ERROR: Cannot select: 0x1169337b8: i32,i32 = X86ISD::PCMPESTRI 0x1169334e0, 0x116933958, 0x116938208, 0x116938888, Constant:i8<12>
  0x1169334e0: v16i8,ch = CopyFromReg 0x1168f3c40, Register:v16i8 %5
    0x116938d68: v16i8 = Register %5
  0x116933958: i32 = AssertZext 0x1169335b0, ValueType:ch:i8
    0x1169335b0: i32,ch = CopyFromReg 0x1168f3c40, Register:i32 %4
      0x1169380d0: i32 = Register %4
  0x116938208: v16i8 = bitcast 0x116938bc8
    0x116938bc8: v2i64,ch = load<LD16[%62](noalias=<0x116871248>,<0x1168711e8>,<0x1168711b8>)> 0x1168f3c40, 0x116938f70, undef:i64
      0x116938f70: i64,ch = CopyFromReg 0x1168f3c40, Register:i64 %16
        0x116938548: i64 = Register %16
      0x1169383a8: i64 = undef
  0x116938888: i32 = truncate 0x116938068
    0x116938068: i64,ch = CopyFromReg 0x1168f3c40, Register:i64 %18
      0x116933bc8: i64 = Register %18
  0x1169386e8: i8 = Constant<12>
In function: _ZN74_$LT$jetscii..Substring$LT$$u27$a$GT$$u20$as$u20$jetscii..DirectSearch$GT$4find17h8d639020997e25d3E

@shepmaster
Copy link
Member Author

I reran my benchmarks with a slightly less loaded machine and the intrinsics appear to be a net win at the moment:

 name                                      inline-asm2 ns/iter    intrinsics-inline2 ns/iter  diff ns/iter   diff %  speedup
 bench::space_asciichars                   985,646 (5319 MB/s)    886,373 (5914 MB/s)              -99,273  -10.07%   x 1.11
 bench::space_asciichars_as_pattern        972,518 (5391 MB/s)    915,875 (5724 MB/s)              -56,643   -5.82%   x 1.06
 bench::space_asciichars_macro             981,848 (5339 MB/s)    893,722 (5866 MB/s)              -88,126   -8.98%   x 1.10
 bench::space_find_byte                    3,411,612 (1536 MB/s)  3,430,566 (1528 MB/s)             18,954    0.56%   x 0.99
 bench::space_find_char                    631,117 (8307 MB/s)    618,917 (8471 MB/s)              -12,200   -1.93%   x 1.02
 bench::space_find_char_set                10,401,568 (504 MB/s)  10,149,078 (516 MB/s)           -252,490   -2.43%   x 1.02
 bench::space_find_closure                 10,025,366 (522 MB/s)  10,057,682 (521 MB/s)             32,316    0.32%   x 1.00
 bench::space_find_string                  7,452,581 (703 MB/s)   7,484,535 (700 MB/s)              31,954    0.43%   x 1.00
 bench::substring_as_pattern               997,548 (5255 MB/s)    869,230 (6031 MB/s)             -128,318  -12.86%   x 1.15
 bench::substring_find                     1,615,510 (3245 MB/s)  1,658,117 (3161 MB/s)             42,607    2.64%   x 0.97
 bench::substring_with_cached_searcher     975,915 (5372 MB/s)    865,898 (6054 MB/s)             -110,017  -11.27%   x 1.13
 bench::substring_with_created_searcher    954,724 (5491 MB/s)    877,560 (5974 MB/s)              -77,164   -8.08%   x 1.09
 bench::xml_delim_3_asciichars             973,039 (5388 MB/s)    882,884 (5938 MB/s)              -90,155   -9.27%   x 1.10
 bench::xml_delim_3_asciichars_as_pattern  968,508 (5413 MB/s)    923,631 (5676 MB/s)              -44,877   -4.63%   x 1.05
 bench::xml_delim_3_asciichars_macro       982,211 (5337 MB/s)    877,300 (5976 MB/s)             -104,911  -10.68%   x 1.12
 bench::xml_delim_3_find_byte_closure      2,144,315 (2445 MB/s)  2,101,482 (2494 MB/s)            -42,833   -2.00%   x 1.02
 bench::xml_delim_3_find_char_closure      14,054,180 (373 MB/s)  14,105,941 (371 MB/s)             51,761    0.37%   x 1.00
 bench::xml_delim_3_find_char_set          17,150,179 (305 MB/s)  17,139,688 (305 MB/s)            -10,491   -0.06%   x 1.00
 bench::xml_delim_5_asciichars             1,000,874 (5238 MB/s)  855,672 (6127 MB/s)             -145,202  -14.51%   x 1.17
 bench::xml_delim_5_asciichars_as_pattern  1,001,554 (5234 MB/s)  895,429 (5855 MB/s)             -106,125  -10.60%   x 1.12
 bench::xml_delim_5_asciichars_macro       996,100 (5263 MB/s)    855,440 (6128 MB/s)             -140,660  -14.12%   x 1.16
 bench::xml_delim_5_find_byte_closure      2,170,030 (2416 MB/s)  2,177,576 (2407 MB/s)              7,546    0.35%   x 1.00
 bench::xml_delim_5_find_char_closure      7,920,629 (661 MB/s)   7,233,341 (724 MB/s)            -687,288   -8.68%   x 1.10
 bench::xml_delim_5_find_char_set          22,317,881 (234 MB/s)  22,443,399 (233 MB/s)            125,518    0.56%   x 0.99

@shepmaster
Copy link
Member Author

I do have some outstanding questions / comments:

  • It feels incorrect that adding an #[inline(always)] to a function tagged with target_feature causes an LLVM error.
  • Having a trait that has a fallback non-SIMD implementation and a SIMD implementation feels very idiomatic Rust, but...
  • It's disappointing that I have to make "too many" functions unsafe just to allow for inlining (and thus performance). Now I have a 52-line function that previously had one unsafe line that is now completely unsafe.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 25, 2018

It feels incorrect that adding an #[inline(always)] to a function tagged with target_feature causes an LLVM error.

This is a known Rust bug. Making a #[target_feature] function #[inline(always)] is ok. Calling it from other functions with compatible target features is also ok, because the function can be inlined into them.

However, functions cannot be inlined across incompatible target features without introducing all sort of weird undefined behavior. So if you attempt to call an #[inline(always)] function from one, well... those sort of LLVM errors is what currently is produced (at best).

It's disappointing that I have to make "too many" functions unsafe just to allow for inlining (and thus performance). Now I have a 52-line function that previously had one unsafe line that is now completely unsafe.

How were you benchmarking? RUSTFLAGS=-C target-feature=+sse4.2 cargo bench?

@shepmaster
Copy link
Member Author

shepmaster commented Mar 26, 2018

How were you benchmarking? RUSTFLAGS=-C target-feature=+sse4.2 cargo bench?

No, I was simply running the benchmark (cargo bench --features=unstable).

Rolling back to the version before I tweaked everything for inlining, and running with that -C flag, I see that the performance stays good. Rolling back to before I added #[target_feature] everywhere, I also get the same good performance.

This is both good (yay, I don't have to add as much cruft!) but also bad (how am I going to explain to every user and downstream user that they need to compile with "magic flags"?).

@BurntSushi
Copy link
Member

@shepmaster -C target-feature=+sse4.2 is a compile time flag. It's probably not what you want. My guess is that @gnzlbg is trying to help debug your problems. :-)

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 26, 2018

Rolling back to the version before I tweaked everything for inlining, and running with that -C flag, I see that the performance stays good.

So does that mean that using RUSTFLAGS=-C target-feature=+sse4.2 fixes the performance, the issues with inlining, and with unsafe code everywhere? (from the code it looks like it does)

how am I going to explain to every user and downstream user that they need to compile with "magic flags"?

Let's leave that out for now and try to come up first with a version of the code that uses the intrinsics, yet has no problems with inlining nor unsafe code. The quickest way there should be the RUSTFLAGS above, so it would be cool if you could confirm that it works properly (no unnecessary unsafe, inlining, test pass, etc.).

@alexcrichton
Copy link
Member

@shepmaster if possible, can you open a dedicated issue for the LLVM error you got? That's definitely a bug in rustc and/or stdsimd, we want to hide users from those at all costs!

@shepmaster
Copy link
Member Author

So does that mean that using RUSTFLAGS=-C target-feature=+sse4.2 fixes the performance, the issues with inlining, and with unsafe code everywhere? (from the code it looks like it does)

Yes, it does.

can you open a dedicated issue for the LLVM error you got

Done in #404

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 27, 2018

@shepmaster

So first I'd like to explain why so many changes are required. Before, if you run jetscii on a x86/x86_64 CPU without SSE4.2 that leads to undefined behavior [0]. Arguably, it is extremely unlikely that somebody will be targeting such a CPU with jetscii nowadays, but currently we have no way of expressing that crates requires certain features to work properly beyond a compile_fail if SSE4.2 is not enabled at compile-time. So if you want to port jetscii to stdsimd with minimal changes, this is probably your best bet:

// crate root
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"), not(target_feature = "sse4.2")))]
compile_fail!("jetscii x86/x86_64 requires a target with SSE4.2 enabled. Please compile these targets with RUSTFLAGS=\"-C target-feature=+sse4.2\"");

#[inline(always)]
#[target_feature = "sse4.2"]
unsafe fn foo_impl() { ... call intrinsics here }

// This is always safe because of the compile fail above:
#[inline(always)] fn foo() { unsafe { foo_impl() } }

Now, if you want to make jetscii detect at run-time whether the x86 cpu has SSE4.2 and choose different algorithms depending on the answer at run-time then well... that's just going to require more changes to the library. We already talked on IRC about this so I won't repeat that here, but at this point idioms about how to do this nicely are still to be discovered. For jetscii this is hard because its algorithms are stateful so maybe @BurntSushi can chim in and explain how teddy does it. My only advice is to put each "feature-dependent" algorithm in a different module couple with its own state. Then at a higher-level, couple the different "feature-dependent" algorithms of each architecture, doing run-time feature detection. At a higher-level, expose the algorithm of the architecture being targeted using cfg(target_arch). So you end up with this at the arch level:

mod myalgo {
cfg_if! {
    if #[cfg(any(target_arch = "x86", target_arch = "x86_64))] {
        mod x86;
        pub use x86::myalgo;
    } else if {
        mod arm;
        pub use arm::myalgo;
    } else {
        mod fallback;
        pub use fallback::myalgo;
    }
}

and then at the x86 level you end up with:

mod x86 {
    
mod sse42;
#[path = "../fallback.rs"]  // reuse fallback here
mod fallback;

#[inline]
fn myalgo() {
    if is_x86_feature_detected("sse4.2") { 
        unsafe { sse42::myalgo() }
    } else {
        fallback::myalgo()
    }
}

}

Obviously, for jetscii this is going to be a bit messier because of the arch/feature dependent state (and also because you want to avoid storing two states at the arch level if the feature is known at compile-time), but using a separation like this might help you keep the chaos under control, in particular if you want to use enums for the state in some archs, but just a struct in another arch or the fallback case where there aren't any relevant features.


[0]: the number of times one writes unsafe does not really matter much, the only thing that matters is whether the program runs into undefined behavior or not.

@BurntSushi
Copy link
Member

For jetscii this is hard because its algorithms are stateful so maybe @BurntSushi can chim in and explain how teddy does it.

The key to how I do things is this type:

/// A builder for SSSE3 empowered vectors.
///
/// This builder represents a receipt that the SSSE3 target feature is enabled
/// on the currently running CPU. Namely, the only way to get a value of this
/// type is if the SSSE3 feature is enabled.
///
/// This type can then be used to build vector types that use SSSE3 features
/// safely.
#[derive(Clone, Copy, Debug)]
pub struct SSSE3VectorBuilder(());

The only way for a consumer outside of this module to get a value with type SSSE3VectorBuilder is to call its constructor:

    /// Create a new SSSE3 vector builder.
    ///
    /// If the SSSE3 feature is not enabled for the current target, then
    /// return `None`.
    pub fn new() -> Option<SSSE3VectorBuilder> {
        if is_x86_feature_detected!("ssse3") {
            Some(SSSE3VectorBuilder(()))
        } else {
            None
        }
    }

And in turn, the only way for this constructor to return a non-None value is if the ssse3 target feature is enabled. What this means is that this type acts a "receipt" of sorts that, if you have it in hand, then you know for a fact that the right CPU target features are enabled.

In that same module, I defined my own vector type (using a macro so that things work on Rust 1.12):

// We define our union with a macro so that our code continues to compile on
// Rust 1.12.
macro_rules! defunion {
    () => {
        /// A u8x16 is a 128-bit vector with 16 single-byte lanes.
        ///
        /// It provides a safe API that uses only SSE2 or SSSE3 instructions.
        /// The only way for callers to construct a value of this type is
        /// through the SSSE3VectorBuilder type, and the only way to get a
        /// SSSE3VectorBuilder is if the `ssse3` target feature is enabled.
        ///
        /// Note that generally speaking, all uses of this type should get
        /// inlined, otherwise you probably have a performance bug.
        #[derive(Clone, Copy)]
        #[allow(non_camel_case_types)]
        pub union u8x16 {
            vector: __m128i,
            bytes: [u8; 16],
        }
    }
}

defunion!();

In particular, the only way for a consumer to get a u8x16 is by first constructing a SSSE3VectorBuilder. So the implication above holds: if you have a u8x16 from this module, then you know that SSSE3 is enabled. This in turn lets me define safe methods on u8x16 for use in higher level code. For example:

    #[inline]
    pub fn ne(self, other: u8x16) -> u8x16 {
        // Safe because we know SSSE3 is enabled.
        unsafe {
            let boolv = _mm_cmpeq_epi8(self.vector, other.vector);
            let ones = _mm_set1_epi8(0xFF as u8 as i8);
            u8x16 { vector: _mm_andnot_si128(boolv, ones) }
        }
    }

This might seem like a lot of ceremony, but these vectors are used in a fairly complex SIMD algorithm called Teddy, and this was the only way I could figure out how to "isolate" the unsafe. The key here is to realize that your only responsibility in determining whether an intrinsic is safe to call or not is whether the underlying CPU is capable of executing it or not. Therefore, we use the type system to represent this state. The alternative here would be that the entirety of Teddy would be unsafe. Despite @gnzlbg's insistent that the amount of unsafe doesn't matter much, I actually think it does, quite a bit, and I really wanted to minimize the amount of unsafe.

I don't mean to suggest you should adopt this approach for jetscii unless you feel like it will work well for you, but rather, to plant the seed of using the type system in some way to control unsafe.

One potential downside to all of this is that it can be easy to introduce performance bugs, and to some extent, it's not clear whether I'm relying on LLVM bugs to get performance correct here. In particular, I only use #[target_feature(enable = "ssse3")] on a single top-level method in the Teddy implementation. All of the intrinsics are defined using that as well, but I have a bunch of intermediate methods that don't have a target_feature attribute. Even so, LLVM seems happy to inline all of them, even when using AVX vectors.

(To be clear, compile time flags aren't really an option to me. IMO, people should be focusing on using runtime detection as much as possible, to make their optimizations apply with the lowest possible friction.)

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 27, 2018

Despite @gnzlbg's insistent that the amount of unsafe doesn't matter much, I actually think it does, quite a bit, and I really wanted to minimize the amount of unsafe.

Oh sorry this is not what I mean. What I meant is that before porting the code to stdsimd the code allowed undefined behavior to happen in safe-Rust, and that the amount of work required to port code to stdsimd depends on whether this is the case with the code you want to port, or not.

One potential downside to all of this is that it can be easy to introduce performance bugs, and to some extent, it's not clear whether I'm relying on LLVM bugs to get performance correct here. In particular, I only use #[target_feature(enable = "ssse3")] on a single top-level method in the Teddy implementation. All of the intrinsics are defined using that as well, but I have a bunch of intermediate methods that don't have a target_feature attribute.

I think that all those methods should be #[target_feature(enable = "ssse3")]. Marking all those methods with target_feature is a pain, so maybe we could allow that as a top-level module attribute so that one can do #![target_feature(...)] and have all the functions in a module get the target feature (@alexcrichton thoughts? this would be useful in stdsimd). The second problem is that doing so would force these methods to be unsafe fn. Something like rust-lang/rfcs#2212 would help with that. Coming up with solutions to this problem should IMO be higher priority.

(To be clear, compile time flags aren't really an option to me. IMO, people should be focusing on using runtime detection as much as possible, to make their optimizations apply with the lowest possible friction.)

Yes. I was just pointing out that the code did not use this before, and that the minimal path towards using stdsimd is to keep not doing it. Whether it is acceptable that crates fail to compile because of this, it IMO isn't. They should at least provide a compile-time only fallback, but this is all "extra work".

@BurntSushi
Copy link
Member

BurntSushi commented Mar 27, 2018

Oh sorry this is not what I mean. What I meant is that before porting the code to stdsimd the code allowed undefined behavior to happen in safe-Rust, and that the amount of work required to port code to stdsimd depends on whether this is the case with the code you want to port, or not.

Oh I see. I thought @shepmaster was doing a CPU ID check before though. Now that I look at the code, I don't see any CPU ID check, so yeah, I see what you mean now.

I think that all those methods should be #[target_feature(enable = "ssse3")]. Marking all those methods with target_feature is a pain, so maybe we could allow that as a top-level module attribute so that one can do #![target_feature(...)] and have all the functions in a module get the target feature (@alexcrichton thoughts? this would be useful in stdsimd). The second problem is that doing so would force these methods to be unsafe fn. Something like rust-lang/rfcs#2212 would help with that. Coming up with solutions to this problem should IMO be higher priority.

The first problem is mildly annoying, but I could easily live with it. The second problem is really what drove me to my current solution.

But yes, I think it is interesting to question whether those methods should have a target_feature or not. Today, it is working as I intend it to. Is that a bug? Or is LLVM just really smart and actually Doing The Right Thing? All the call sites are actually sandwiched between compatible target_feature attributes, so it does seem like to me that inlining everything is a valid thing to do?

Yes. I was just pointing out that the code did not use this before, and that the minimal path towards using stdsimd is to keep not doing it. Whether it is acceptable that crates fail to compile because of this, it IMO isn't. They should at least provide a compile-time only fallback, but this is all "extra work".

Ah I see, yeah, that's true if you were using the simd crate. jetscii is using inline ASM, so today it either Just Works without any compile time flags, or results in a SIGILL/UB.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 27, 2018

Today, it is working as I intend it to. Is that a bug? Or is LLVM just really smart and actually Doing The Right Thing? All the call sites are actually sandwiched between compatible target_feature attributes, so it does seem like to me that inlining everything is a valid thing to do?

Basically all these methods are inside the same crate and LLVM is just doing inlining as usual. Inlining functions into other functions that extend their target feature set is ok, so inlining non-ssse3 functions into an ssse3 functions is perfectly fine.

Problems can only arise when LLVM does not inline an intermediary function that does not have the ssse3 target feature attribute. In that case, ssse3 functions won't be inlineable into it either.

I personally think that code should express intent. In this case, your intent is clearly for all that code to be compiled with ssse3 enabled independently of how good or bad LLVM is at inlining. The way to express that in Rust is to mark those function with the #[target_feature] attribute, therefore I think that those functions should have it.

This requires you to make these functions unsafe fn which is unnecessary in this case because it is also your intent to only call most of those functions from other ssse3 functions, which is safe.

Currently there is no way to express this intent in Rust, but I think this is a major ergonomic issue with the current std::arch intrinsics that's worth solving. Also, you probably would prefer a compiler error here instead of a performance bug because inlining somewhere failed.

@alexcrichton
Copy link
Member

@BurntSushi

Even so, LLVM seems happy to inline all of them, even when using AVX vectors.

This seemed very worrisome to me at first (bad LLVM!) but looking more into this I think that's ok. I'd guess that what's happening here is that LLVM is inlining methods without ssse3 into the ssse3 method very aggressively. At that point it calls a bunch of the intrinsics tagged with ssse3 and it's safely inlined.

I suspect that if you didn't tag the top method with ssse3 (or if LLVM didn't inline in a loop) then none of the intrinsic calls would get inlined.

This actually seems like it could be a really cool trick one day...

impl SSSE3VectorBuilder {
    #[target_feature(enable = "ssse3")]
    pub fn run<F, R>(&self, f: F) -> R 
        where F: FnOnce() -> R
    { f() }

Somehow you could probably assert to the compiler that the enable is safe and SSSE3VectorBuilder is alone proof that the unsafe on the function isn't needed. That way with a trick like this you could rely on LLVM's inliner and only rarely use #[target_feature(enable = ...)]

@gnzlbg

(@alexcrichton thoughts? this would be useful in stdsimd)

I'd probably shy away from it for now, but I could see it being a possibility!

@BurntSushi
Copy link
Member

I suspect that if you didn't tag the top method with ssse3 (or if LLVM didn't inline in a loop) then none of the intrinsic calls would get inlined.

Oh yes, indeed! I experimented with that when I was writing the code to check my understanding, and that was indeed the case. Performance regressed and none of the intrinsics were inlined (as expected).

@gnzlbg I basically agree with you about intent. I just happen to come down on the side of "I'd rather obscure intent a little in favor of isolating unsafe." And yes, I would definitely appreciate a compiler error if a function failed to inline. :-)

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 27, 2018

@alexcrichton

I'd probably shy away from it for now, but I could see it being a possibility!

I think we should pursue a minimal step towards what @hsivonen proposed in his RFC.

Currently, we require that #[target_feature] functions be declared as unsafe fn. We could allow declaring safe #[target_feature] functions, but these functions can only be called from functions with the exact same #[target_feature]:

#[target_feature = "sse2"] unsafe fn foo() { }
#[target_feature = "sse2"] fn bar() { }

fn meow() {
    foo(); // ERROR (unsafe block required)
    unsafe { foo() }; // OK
    bar(); // ERROR (meow is not sse2)
    unsafe { bar() }; // ERROR (meow is not sse2)
}

#[target_feature = "sse2"]
fn bark() {
    foo(); // ERROR (unsafe block required)
    unsafe { foo() }; // OK
    bar(); // OK (bark is sse2)
    unsafe { bar() }; // OK (bark is sse2)
}

#[target_feature = "avx"]  // avx != sse2, see [0]
fn moo() {
    foo(); // ERROR (unsafe block required)
    unsafe { foo() }; // OK
    bar(); // ERROR (bark is not sse2)
    unsafe { bar() }; // ERROR (bark is not sse2)
}

This minimal step would probably solve most of @BurntSushi 's pain points.

An incremental improvement over this would be to allow calling these "safe" target feature functions from functions that do not have the same target feature by using an unsafe { } block:

#[target_feature = "sse2"] unsafe fn foo() { }
#[target_feature = "sse2"] fn bar() { }

fn meow() {
    foo(); // ERROR (unsafe block required)
    unsafe { foo() }; // OK
    bar(); // ERROR (meow is not sse2)
    unsafe { bar() }; // OK - was an error before
}

#[target_feature = "sse2"]
fn bark() {
    foo(); // ERROR (unsafe block required)
    unsafe { foo() }; // OK
    bar(); // OK (bark is sse2)
    unsafe { bar() }; // OK (bark is sse2)
}

#[target_feature = "avx"]
fn moo() {
    foo(); // ERROR (unsafe block required)
    unsafe { foo() }; // OK
    bar(); // ERROR (bark is not sse2)
    unsafe { bar() }; // OK - was an error before
}

These two improvements would already deliver a lot of bang-per-buck. They would allow everybody to write safe #[target_feature] functions, and only use unsafe when they actually do something unsafe inside them, or when they want to call them from an unsafe context.

For things like trait methods and function pointers, #[target_feature] applies to the implementation/concrete functions, so one could only use them here at first if the trait method / function pointer was actually marked unsafe:

trait Foo { fn foo(); }
struct Fooish();
impl Foo for Fooish { 
    #[target_feature = "sse2"] fn foo() { }  
    // ^ ERROR: #[target_feature] on trait method impl requires 
    // unsafe fn but Foo::foo is safe
}

trait Bar { unsafe fn bar(); }
struct Barish();
impl Bar for Barish { 
    #[target_feature = "sse2"] unsafe fn bar() { }  // OK
}

#[target_feature] fn meow() {}

static x: fn () -> () = meow;
// ^ ERROR: meow can only be assigned to unsafe fn pointers due to 
// #[target_feature] but function pointer x with type fn()->() is safe.
static y: unsafe fn () -> () = meow; // OK

[0]: to keep things minimal and the discussion focused I'd rather stay away from feature hierarchies initially.

@shepmaster
Copy link
Member Author

I am really happy that the two of you ironed out my (evidently poorly worded) concern:

fn driver() {
    let a = alpha();
    let b = beta();
}

#[inline]
#[target_feature(...)]
fn alpha() { /* some intrinsic */ }

#[inline]
#[target_feature(...)]
fn beta() { /* some intrinsic */ }

In this code, alpha and beta will not be inlined into driver, generally ruining performance. To fix this, you have to mark driver with a target_feature (or change the compilation settings, which effectively marks all functions with the features, AFAICT). However, to mark driver with target_feature, it now has to become unsafe, and that is what is ultimately distasteful.

the amount of unsafe doesn't matter much, I actually think it does, quite a bit, and I really wanted to minimize the amount of unsafe.

I concur 100%

@shepmaster
Copy link
Member Author

To be clear, compile time flags aren't really an option to me

@BurntSushi I wanted the best of both worlds, so I'm pursuing something that uses the compile-time settings when it can and runtime detection when it cannot:

pub struct Searcher<F>
where
    F: Fn(u8) -> bool,
{
    // Include this implementation only when compiling for x86_64 as
    // that's the only platform that we support.
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    fast: Fast,

    // If we are *guaranteed* to have SSE 4.2, then there's no reason
    // to have this implementation.
    #[cfg(not(target_feature = "sse4.2"))]
    fallback: Fallback<F>,

    // Since we might not use the fallback implementation, we add this
    // to avoid unused type parameters.
    _fallback: PhantomData<F>,
}

impl<F> Searcher<F>
where
    F: Fn(u8) -> bool,
{
    pub /* const */ fn new(bytes: &[u8], fallback: F) -> Self {
        Searcher {
            #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
            fast: Fast::new(bytes),

            #[cfg(not(target_feature = "sse4.2"))]
            fallback: Fallback::new(fallback),

            _fallback: PhantomData,
        }
    }

    #[inline]
    pub fn find(&self, haystack: &[u8]) -> Option<usize> {
        // If we can tell at compile time that we have support,
        // call the optimized code directly.
        #[cfg(target_feature = "sse4.2")] {
            unsafe { self.fast.find(haystack) }
        }

        // If we can tell at compile time that we will *never* have
        // support, call the fallback directly.
        #[cfg(not(any(target_arch = "x86", target_arch = "x86_64")))] {
            self.fallback.find(haystack)
        }

        // Otherwise, we will be run on a machine with or without
        // support, so we perform runtime detection.
        #[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
                  not(target_feature = "sse4.2")))] {
            if is_x86_feature_detected!("sse4.2") {
                unsafe { self.fast.find(haystack) }
            } else {
                self.fallback.find(haystack)
            }
        }
    }
}

@shepmaster
Copy link
Member Author

shepmaster commented Mar 27, 2018

if you run jetscii on a x86/x86_64 CPU without SSE4.2 that leads to undefined behavior

and choose different algorithms depending on the answer at run-time then well... that's just going to require more changes to the library

I thought @shepmaster was doing a CPU ID check before though

To be clear, I'm not complaining about any extra code I have to write. Please note that Jetscii was written 3 years ago using inline assembly and the entire reason that I wrote the Cupid crate was to work towards having runtime detection, I just... never got around to it 😸 . I am only attempting to provide feedback based on porting the algorithms expressed within to stdsimd.

@shepmaster
Copy link
Member Author

allow calling these "safe" target feature functions from functions that do not have the same target feature by using an unsafe { } block:

In a magical world, I'd love it if #[cfg(target_feature = "X")] and if is_*_feature_detected!("X") automatically made calling a target_feature function safe. unsafe could be "The Big Hammer" when that's not enough.

@alexcrichton
Copy link
Member

@gnzlbg I'd be on board with making #[target_feature] easier to use for sure! I'd also be ok adding something like #[unsafe_target_feature] which can be applied to a safe function but the programmer is unsafely asserting that they only call it appropriately locally.

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 27, 2018

@shepmaster

To be clear, I'm not complaining about any extra code I have to write. Please note that Jetscii was written 3 years ago using inline assembly and the entire reason that I wrote the Cupid crate was to work towards having runtime detection, I just never got around to it. I am only attempting to provide feedback based on porting the algorithms expressed within to stdsimd.

Gotcha. Please don't misunderstand my words as criticism towards jetscii. I was just pointing out that the way the intrinsics are currently designed basically force crates to do feature-detection either at compile-time or at run-time, and that's something that very few crates in the Rust (and probably C and C+) ecosystem are currently doing because it wasn't required before.

In a magical world, I'd love it if #[cfg(target_feature = "X")] and if is_*_feature_detected!("X") automatically made calling a target_feature function safe

Those extensions would be nice to have as well, and I guess that if you add them you are probably really close to where the discussion in @hsivonen's RFC ended up. I think that these would be harder though. There is currently an open bug about this function (playground):

#[target_feature(enable = "avx")]
unsafe fn foo() -> bool {
   #[cfg(target_feature = "avx")] { true }
   #[cfg(not(target_feature = "avx"))] { false }
}

where per RFC2045 it should always return true, because #[target_feature] should set #[cfg(target_feature)] locally, but right now it often returns false, and fixing this is non-trivial.

@alexcrichton

I'd also be ok adding something like #[unsafe_target_feature] which can be applied to a safe function but the programmer is unsafely asserting that they only call it appropriately locally.

But this is where we currently are. Today, #[target_feature] is unsafe, and most of the code is asserting all the time that calling #[target_feature] functions is safe, and this is done by opening an unsafe { } block to call these functions.

What I was suggesting is to add a way in which the compiler asserts this for you, and reliably produce an error if the assertion fails. We could use a different attribute, e.g., #[safe_target_feature], that doesn't require the function to be unsafe, but that allows this function to only be used in contexts where doing so is safe.

@hsivonen
Copy link
Member

I'd also be ok adding something like #[unsafe_target_feature] which can be applied to a safe function but the programmer is unsafely asserting that they only call it appropriately locally.

Why would that be better than the "Minimal target feature unsafe" RFC?

@shepmaster
Copy link
Member Author

What is the right path to report "poor" assembly?

A tight loop of code is generating this:

+0xc0	        movdqu              (%rsi,%r10), %xmm1  ;;  0.3%
+0xc6	        movl                $1, %eax            ;; 23.1%
+0xcb	        movl                $16, %edx           ;;  0.4%;
+0xd0	        pcmpestri           $0, %xmm1, %xmm0    ;;  0.2%;
+0xd6	        jb                  "find+0x130"        ;; 64.4%
+0xd8	        incq                %rbx                ;; 11.2%
+0xdb	        addq                $16, %r10           ;;  0.3%
+0xdf	        cmpq                %r11, %rbx          ;;  0.2%
+0xe2	        jb                  "find+0xc0"

The two initial movl for the constants 1 and 16 are included inside the loop, but the values of the registers never changes inside of here. Profiling indicates that one of the loads is taking appreciable time, but it could be a red herring.

@alexcrichton
Copy link
Member

I've personally found that instruction profiling is typically off-by-one, so the 64% is probably pcmpestri and the 23% is probably the movdqu

@gnzlbg
Copy link
Contributor

gnzlbg commented Apr 6, 2018

@shepmaster the best path is probably to use rust.godbolt.org to create a minimal example that shows the issue. Once you have that, save that link. Then, use --emit=llvm-ir to generate LLVM-IR for that example without and with optimizations. Save those two links as well. Finally, you can go to gcc.godbolt.org and chose llc and opt as the programming language, paste the LLVM-IR in there, enable optimizations / right CPU model, and emit assembly from llc, and "optimized" LLVM-IR from opt. Save those links, and post here all that stuff.

We can then take a look, clean up the IR, and answer the question: is rustc emitting sensible LLVM-IR ? And also check for the same example, which LLVM-IR does clang emit.

If the LLVM-IR is not getting optimized properly we can fill in an LLVM bug. If Rust is emitting poor LLVM-IR that would be actually good news, since that is something that we might be able to fix quickly on Rust's end.

An alternative is for you to provide a minimal example on rust.godbolt.org and I can take over the investigation from there if you want.

@alexcrichton
Copy link
Member

I believe the outstanding issues here have since been resolved so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants