Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCC optimization flag incompatibility #22

Closed
mratsim opened this issue Aug 12, 2020 · 9 comments
Closed

GCC optimization flag incompatibility #22

mratsim opened this issue Aug 12, 2020 · 9 comments

Comments

@mratsim
Copy link
Contributor

mratsim commented Aug 12, 2020

FYI, BLST is incompatible with the following GCC flag: -ftree-loop-vectorize

This causes the scalar multiplication to miscompile and easily noticeable on blst_sk_to_pk_in_g1 function.

Unfortunately it is automatically activated with -O3. You might want to have a note about not compiling with -O3 or using -fno-tree-loop-vectorize to deactivate the offending flag.

Tested with GCC v10.1.0

Clang works fine (v10.0.1)

@mratsim mratsim changed the title GCC/Clang optimization flags incompatibility GCC optimization flag incompatibility Aug 12, 2020
@mratsim
Copy link
Contributor Author

mratsim commented Aug 13, 2020

One solution at your level would be something like

#define POINT_MULT_SCALAR_W5_IMPL(ptype) \
#if (__GNUC__ == 4 && __GNUC_MINOR__ == 8 && __GNUC_PATCHLEVEL__ == 5) \
__attribute__((optimize("no-tree-vectorize"))) \
#endif \
static void ptype##_gather_booth_w5(ptype *restrict p, const ptype table[16], \
                                    limb_t booth_idx) \
{ \
    size_t i; \
    limb_t booth_sign = (booth_idx >> 5) & 1; \
\
    booth_idx &= 0x1f; \
    vec_zero(p, sizeof(ptype)); /* implicit infinity at table[-1] */\
    /* ~6% with -Os, ~2% with -O3 ... */\
    for (i = 1; i <= 16; i++) \
        ptype##_ccopy(p, table + i - 1, i == booth_idx); \
\
    ptype##_cneg(p, booth_sign); \
} \

(assuming this is the problematic function and problematic compiler version)

@dot-asm
Copy link
Collaborator

dot-asm commented Aug 13, 2020

BLST is incompatible with the following GCC flag: -ftree-loop-vectorize

Why not other way around? :-):-):-) But on serious note, if specific compiler version fails to compile a piece of code, while others can, it speaks rather in favour of compiler bug. This is not to say that it necessarily means compiler bug, but it's first assumption to make.

One solution at your level would be something like

I for one am not big fan of compiler-specific workarounds, but suggestion is not the way to go. Because nested pre-processor directives don't work. But function can have separate declaration with designated attributes... Another way to solve it would be ... more assembly, so that compiler won't be in position to make the self-defeating assumptions...

On side note. Keep in mind that blst is not that dependent on optimization level, because most of the "magic" happens in assembly. In other words difference between -O2 and -O3 is effectively negligible, so you don't actually have to compile blst with -O3. Unless of course if your C code is sensitive to optimization level, and you want to compile everything in the same go...

@dot-asm
Copy link
Collaborator

dot-asm commented Aug 13, 2020

Can you confirm that compiling with -Drestrict= flag helps?

[Just in case for reference, this is not a suggested solution, just an attempt to pinpoint the problem.]

@mratsim
Copy link
Contributor Author

mratsim commented Aug 13, 2020

Why not other way around? :-):-):-) But on serious note, if specific compiler version fails to compile a piece of code, while others can, it speaks rather in favour of compiler bug. This is not to say that it necessarily means compiler bug, but it's first assumption to make.

There is a related GCC bug that has been lurking for 10 years at least, for example x264 https://mailman.videolan.org/pipermail/x264-devel/2010-June/007462.html. Clang doesn't exhibit this which also supports a GCC bug.

Yes -Drestrict= makes blst_sk_to_pk_in_g1 behave

Unless of course if your C code is sensitive to optimization level, and you want to compile everything in the same go...

Yes that's the case, I don't compile BLST as a separate DLL but compile it at the same time as the rest of the Nim/C code.

@dot-asm
Copy link
Collaborator

dot-asm commented Aug 13, 2020

Why not other way around? :-):-):-)

There is a related GCC bug that has been lurking for 10 years at least,

In other words bug is so old that it's considered a feature:-) This is exactly why I'm not fond of compiler-specific workarounds, they effectively let compiler off the hook...

Either way, could you double-check vec_select_n? It's only x86_64 for the moment...

@dot-asm
Copy link
Collaborator

dot-asm commented Aug 18, 2020

vec_select_n is merged. Closing...

@mratsim
Copy link
Contributor Author

mratsim commented Sep 18, 2020

Sorry for the late reply, I didn't have time to upgrade earlier .

Unfortunately I seem to still get wrong results with GCC -O3 with the master from yesterday (a8398ed) unless I pass fno-tree-vectorize, despite that branch being merged and f8a77bd

For now I'll keep using fno-tree-vectorize with that compiler.

mratsim added a commit to status-im/nimbus-eth2 that referenced this issue Sep 18, 2020
* Bump BLST

* Test for supranational/blst#22 regression

* Use SHA256 from BLST + bump nim-blscurve to reenable fno-tree-vectorize

* SHA256 on non-blst platforms import fixes

* import fixes again

* can't prefix with nimcrypto

* address review comment [skip ci]

* {.noInit.} on the digests
@dot-asm
Copy link
Collaborator

dot-asm commented Sep 25, 2020

Still? Hmm... Since -Drestrict= helps, I assume it still does, can you test one thing? Drop the qualifiers from ptype##_ccopy in src/point.h.

@dot-asm
Copy link
Collaborator

dot-asm commented Sep 25, 2020

Just in case for reference. restrict qualifiers were added in order to eliminate arguably unjustified branches depending on outcome of pointer comparisons. It's not really a constant-time thing, but rather this-ought-to-complicate-binary-code-validation thing. But one should expect slightly better better performance as well...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants