Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable C11 automatically if the compiler supports it #1729

Merged
merged 18 commits into from
Aug 21, 2024

Conversation

andrewhop
Copy link
Contributor

@andrewhop andrewhop commented Jul 31, 2024

Issues:

Resolves #1723

Description of changes:

#1723 pointed out the default AWS-LC behavior can be very slow because we enable C99 by default. The user has to opt into the better atomics by specifying -DCMAKE_C_STANDARD=11 and they probably won't find that option. This change turns C11 on by default if the compiler supports it.

Add a new benchmark to test our ref count performance.

Call-outs:

If users do not change their CMake options those with modern compilers will get C11 atomics automatically, legacy customers will continue with the pthread implementation. Both legacy and modern users can continue to manually control this by setting the CMAKE_C_STANDARD option. This option is only available to CMake >= 3.1 customers.

Testing:

Tried different options and verified the build.ninja included the expected -std=gnu99 or -std=gnu11. Performance numbers on an Graviton 4 R8g show a 1.6 times increase for single threaded uncontested locking performance, and 33.0 times increase in heavily contested locking performance. depending on the use case real world applications will see less of an increase. Each thread attempts to increment the same CRYPTO_refcount_t 1,000 times:
C99 (previous performance):

cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_STANDARD=99 ../
./tool/bssl speed -filter CRYPTO_refcount_inc -timeout 10
Did 225000 CRYPTO_refcount_inc 1000 iterations with 1 threads operations in 10043601us (22402.3 ops/sec)
Did 78861 CRYPTO_refcount_inc 1000 iterations with 2 threads operations in 10008608us (7879.3 ops/sec)
Did 26448 CRYPTO_refcount_inc 1000 iterations with 4 threads operations in 10032376us (2636.3 ops/sec)
Did 8262 CRYPTO_refcount_inc 1000 iterations with 8 threads operations in 10035344us (823.3 ops/sec)
Did 2520 CRYPTO_refcount_inc 1000 iterations with 16 threads operations in 10048296us (250.8 ops/sec)
Did 684 CRYPTO_refcount_inc 1000 iterations with 32 threads operations in 10079403us (67.9 ops/sec)
Did 218 CRYPTO_refcount_inc 1000 iterations with 64 threads operations in 10015418us (21.8 ops/sec)

C11 (current performance):

cmake -GNinja -DCMAKE_BUILD_TYPE=Release ../ 
./tool/bssl speed -filter CRYPTO_refcount_inc -timeout 10
Did 372000 CRYPTO_refcount_inc 1000 iterations with 1 threads operations in 10002129us (37192.1 ops/sec)
Did 268000 CRYPTO_refcount_inc 1000 iterations with 2 threads operations in 10022289us (26740.4 ops/sec)
Did 168000 CRYPTO_refcount_inc 1000 iterations with 4 threads operations in 10022843us (16761.7 ops/sec)
Did 74943 CRYPTO_refcount_inc 1000 iterations with 8 threads operations in 10100478us (7419.7 ops/sec)
Did 32550 CRYPTO_refcount_inc 1000 iterations with 16 threads operations in 10037521us (3242.8 ops/sec)
Did 15246 CRYPTO_refcount_inc 1000 iterations with 32 threads operations in 10065910us (1514.6 ops/sec)
Did 7208 CRYPTO_refcount_inc 1000 iterations with 64 threads operations in 10042097us (717.8 ops/sec)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

@codecov-commenter
Copy link

codecov-commenter commented Jul 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.32%. Comparing base (b5280e3) to head (7a94220).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1729   +/-   ##
=======================================
  Coverage   78.31%   78.32%           
=======================================
  Files         580      580           
  Lines       97140    97145    +5     
  Branches    13926    13928    +2     
=======================================
+ Hits        76073    76086   +13     
+ Misses      20445    20437    -8     
  Partials      622      622           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

CMakeLists.txt Outdated Show resolved Hide resolved
justsmth
justsmth previously approved these changes Jul 31, 2024
CMakeLists.txt Outdated Show resolved Hide resolved
@justsmth
Copy link
Contributor

Should simple_main.c instead be a little more targetted for our usage of atomics. I'd like to make a similar change for aws-lc-rs, but ran into the following compile error when compiling for arm-linux-androideabi (using Clang 14.0.6).

  /home/runner/work/aws-lc-rs/aws-lc-rs/aws-lc-sys/aws-lc/crypto/refcount_c11.c:39:23: error: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Werror,-Watomic-alignment]
    uint32_t expected = atomic_load(count);
                        ^

@justsmth justsmth self-requested a review July 31, 2024 15:25
@justsmth
Copy link
Contributor

justsmth commented Jul 31, 2024

I think we'll need to add a condition on ATOMIC_LONG_LOCK_FREE > 0 to the macro definition here

#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L
#include <stdatomic.h>
// CRYPTO_refcount_t is a |uint32_t|
#define AWS_LC_ATOMIC_LOCK_FREE ATOMIC_LONG_LOCK_FREE
#else
#define AWS_LC_ATOMIC_LOCK_FREE 0
#endif

#if !defined(OPENSSL_C11_ATOMIC) && defined(OPENSSL_THREADS) &&   \
    !defined(__STDC_NO_ATOMICS__) && defined(__STDC_VERSION__) && \
    __STDC_VERSION__ >= 201112L && AWS_LC_ATOMIC_LOCK_FREE > 0
#define OPENSSL_C11_ATOMIC 
#endif

@andrewhop
Copy link
Contributor Author

Should simple_main.c instead be a little more targetted for our usage of atomics. I'd like to make a similar change for aws-lc-rs, but ran into the following compile error when compiling for arm-linux-androideabi (using Clang 14.0.6).

  /home/runner/work/aws-lc-rs/aws-lc-rs/aws-lc-sys/aws-lc/crypto/refcount_c11.c:39:23: error: large atomic operation may incur significant performance penalty; the access size (4 bytes) exceeds the max lock-free size (0  bytes) [-Werror,-Watomic-alignment]
    uint32_t expected = atomic_load(count);
                        ^

Hmmmm, it looks like our atomic version can't be enabled with just C11 then, I think we need two checks and flags:

  1. Use C99 or C11
  2. If we're using C11 can we use atomics

@wtarreau
Copy link

wtarreau commented Aug 1, 2024

Hmmmm, it looks like our atomic version can't be enabled with just C11 then, I think we need two checks and flags:

1. Use C99 or C11

2. If we're using C11 can we use atomics

Andrew, please excuse me for insisting, but as I already suggested, why not just rely on the compiler version in the source code instead of all the cmake machinery ? The atomics are not dependent on C11 but on (clang || gcc >= 4.7). That's much easier IMHO to just fix the condition in the code itself rather than risk to break stuff by adding complex version detection. It would even allow to keep the -std=c99 for now if it turns out to have some uses.

@andrewhop
Copy link
Contributor Author

The atomics are not dependent on C11 but on (clang || gcc >= 4.7).

Ahhh, I missed that part, doing an experiment locally confirms that it does compile as expected, let's see if it gets through all the CI compilers.

@andrewhop
Copy link
Contributor Author

Just using atomics if the compiler supports it does not work because we enabled -Wno-c99-c11-compat/-Wno-c11-extensions to fix a pedantic error in #1608. I agree with the spirit of that error message: if we're setting the standard to C99 we shouldn't be using C11 features. I will try and think of a more sane way to enable C11 in code/cmake.

@wtarreau
Copy link

wtarreau commented Aug 1, 2024

OK, I can understand, of course!

And the other option, why not simply drop the forced -std=c99 ? In which case exactly is it required to manually force and older version of the standard when 99% of the modern compilers will automatically default to the newer ones ?

@andrewhop
Copy link
Contributor Author

Hit a fun issue, GCC 4.8 actually doesn't support atomics even though it supports C11. This seems to be a known issue, you might want to increase your check to >= 4.9 which is when support and the header file got included.

@wtarreau
Copy link

wtarreau commented Aug 1, 2024

I'm using them there and I can assure you they work pretty well:

2$ cat atomic.c
unsigned atomic_inc(unsigned *a)
{
        return __atomic_fetch_add(a, 1, __ATOMIC_SEQ_CST);
}
-bash-4.2$ gcc -Os -c atomic.c 
-bash-4.2$ objdump  -dr atomic.o

atomic.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <atomic_inc>:
   0:   b8 01 00 00 00          mov    $0x1,%eax
   5:   f0 0f c1 07             lock xadd %eax,(%rdi)
   9:   c3                      retq   
-bash-4.2$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) 

What might possibly be missing is stdatomic or some defines, but I still don't understand why these would be needed at all. I've never been aware of them and have been using atomics on gcc and clang for many years now.

I really think that simply dropping -std=c99 which breaks the test and triggers your warning, and checking on gcc version should do the trick. I don't understand why there seems to be some complicated checks that strive to force certain version and later to check them, this might have been inherited from older legacy code maybe, but I really feel that there's some form of inconsistency there that is self-feeding and not necessary at all.

@wtarreau
Copy link

wtarreau commented Aug 1, 2024

Or maybe if you want, for non-gcc compilers that might require stdatomic.h, you could do something like this:

# if !defined(__ATOMIC_RELAXED) && defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L
#include <stdatomic.h>
#endif

I.e. it wlil consider the C standard version for compilers that do not define the ATOMIC macros and that might require stdatomic.h (if there are any at all, but given that gcc and clang have them builtin, that would only be for possibly other ones). E.g. that's gcc 4.8 below:

$ gcc -dM -xc -E - < /dev/null |grep __ATOMIC
#define __ATOMIC_ACQUIRE 2
#define __ATOMIC_HLE_RELEASE 131072
#define __ATOMIC_HLE_ACQUIRE 65536
#define __ATOMIC_RELAXED 0
#define __ATOMIC_CONSUME 1
#define __ATOMIC_SEQ_CST 5
#define __ATOMIC_ACQ_REL 4
#define __ATOMIC_RELEASE 3

@andrewhop
Copy link
Contributor Author

Poking around we need stdatomic.h to get access to ATOMIC_LONG_LOCK_FREE to handle this issue #1729 (comment).

@andrewhop
Copy link
Contributor Author

I really think that simply dropping -std=c99 which breaks the test and triggers your warning, and checking on gcc version should do the trick

We have some active customers that still rely on very old versions of GCC and C99. Trust me I am very excited for the day when they update and we can drop a lot of this complicated build options.

@wtarreau
Copy link

wtarreau commented Aug 2, 2024

Then maybe it's easier to detect older compilers and force std=c99 only for them if that's the purpose ?

@wtarreau
Copy link

wtarreau commented Aug 2, 2024

Regrading stdatomic.h, on gcc you have the ATOMIC_LONG_LOCK_FREE definition by default without including anything. If stdatomic is needed for clang, what about including only with clang, since all of its versions support C99 ?

justsmth pushed a commit that referenced this pull request Aug 2, 2024
### Description of changes: 
While working on #1729 I noticed the
CMake builds took a while because it was all single threaded. The GitHub
runners [have 4
cores](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories)
so this should speed it up a bit.

### Testing:
Waiting to see what happens here. 

By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license and the ISC license.
@andrewhop
Copy link
Contributor Author

andrewhop commented Aug 17, 2024

I figured out why I didn't see any performance improvement on some platforms with this change:

  • Thread T1 is spinning up N other threads to call CRYPTO_refcount_inc X times
  • When a new thread T2 is created it immediately starts calling CRYPTO_refcount_inc X times
  • By the time the T1 had created the next thread T3, T2 was already done so there was no contention

The solution was increasing X from 100 to 1,000 so it takes longer for each thread to finish. TLDR yes C11 atomics are much faster than pthread

@andrewhop andrewhop force-pushed the threads branch 2 times, most recently from efa04c7 to 9b0a2fb Compare August 17, 2024 01:23
justsmth
justsmth previously approved these changes Aug 20, 2024
smittals2
smittals2 previously approved these changes Aug 20, 2024
@andrewhop andrewhop merged commit 92654db into aws:main Aug 21, 2024
106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Abysmal multi-thread performance due to C11 atomics not being used by default
7 participants