Intgemm refactor #762

emjotde · 2020-11-12T00:07:15Z

Hi,
this is a first draft of my small refactoring changes. All in all it was easier than thought.

Similar to FBGEMM the compute type is now determined by the Element type in the model.bin file.
If that's only intgemm{8,16} it will do the hardware-specific stuff on the fly.
If intgemm8avx2 etc. it is already pre-packed and does no conversion, but is hardware specific. This should be possible to mmap.
Got rid of the code in backends (for now). Open to add it back later.

TODO:

~~Verify I did not break FBGEMM back-compat with the new types.~~ (regression tests for FBGEMM pass, @ykim362 you may want to do some more testing)
~~Test out mmapping~~ (seems to work out-of-the-box, see below)
Take a look at the TODOs I mentioned, especially concerning aborting with wrong hardware during conversion.
It seems sse2 or ssse3 models do not work for me on avx2, please help verify/debug. AVX2 on AVX2 hardware works fine.
Verify AVX512 works on AVX512 hardware.

Checklist

I have tested the code manually
I have run regression tests
I have read and followed CONTRIBUTING.md
I have updated CHANGELOG.md

emjotde · 2020-11-12T02:43:21Z

Mmapping just works, how nice :) At least for intgemm{8,16}avx2.
Can be enabled for testing here:

marian-dev/src/translator/translator.h

Line 21 in 5990142

#define MMAP 0

(just set the define to 1, and you should see a message like: Memory mapping model at 0x7f2fcae1f000).

emjotde · 2020-11-12T04:34:00Z

@snukky with this change the following regression test fails:

tests/models/wnmt18/test_student_small_aan_optimize.sh

This is expected as I removed the --optimize option. The correct way to adapt this test would be to run marian-conv -f model.npz -t model.bin --gemm-type intgemm16 and then point marian-decoder to the model.bin file.

This could also be extended to more regression tests, at least for --gemm-type intgemm8.

XapaJIaMnu · 2020-11-12T08:51:35Z

Hey,

It looks good. Glad MMAP'ing works. I would like to have the shifted version exposed (as a command line switch or something), as it is noticeably faster on VNNI.

As for SSSE/SSE not working on avx2, this is because the multiply here: https://github.com/marian-nmt/marian-dev/blob/intgemm_refactor/src/tensors/cpu/intgemm_interface.h#L407 is width dependent. Potentially we might want to make it read in the type.

Cheers,

Nick

XapaJIaMnu · 2020-11-12T08:52:15Z

src/tensors/cpu/backend.h

-protected:
-  bool int16_{false};
-  bool int8_{false};
-  bool int8shift_{false};


Any plans for alternative passing of the "shifted" parameter?

Not against it, but can we do that in the next PR? Maybe there are better ways than having it in backend (maybe not).

Should we just get rid of the unshifted version?

It's slower on pre-VNNI without precomputed Alphas, which we don't have here.

From my curiosity, could you let me how much the perf diff is between shifted and unshifted?

If the difference isn't dramatic I would be in favor of being pragmatic about this :)

XapaJIaMnu · 2020-11-12T08:53:07Z

src/layers/generic.cpp

@@ -283,31 +282,7 @@ namespace marian {
      };

      if (shortlist_ && !cachedShortWt_) { // shortlisted versions of parameters are cached within one batch, then clear()ed


How difficult is to have the shortlist_ be passed to the Affine routine, then we can do it on the fly without extra cached nodes?

Something to think about, but not in this PR maybe? This whole shortlist business is quite horrible already, having to pass even further down makes it worse IMO. This would benefit from a cleaner, better thought out approach.

XapaJIaMnu · 2020-11-12T08:54:01Z

src/graph/expression_operators.cpp

        return cpu::integer::affine<Type::int8>(a, b, bias, transA, transB, scale, clipValue);
-      } else if(a->graph()->getBackend()->isInt16()  || matchType<intgemm16>(bElementType) ) {
+      } else if(sizeOf(bElementType) == 2) {


I don't understand this, why not use matchtype? It's not obvious what sizeOf(bElementType) == 2 means

You would need to list all 8 types here. See explanation below.

XapaJIaMnu · 2020-11-12T08:54:54Z

src/graph/expression_operators.cpp

        return cpu::integer::dot<Type::int8>(a, b, transA, transB, scale);
-      } else if(a->graph()->getBackend()->isInt16() || matchType<intgemm16>(bElementType)) {
+      } else if(sizeOf(bElementType) == 2) {


Again, what does sizeOf(bElementType) == 2) and can we use matchType or something else more obvious instead?

XapaJIaMnu

Looks good, biggest issue is that the meaning of sizeOf(bElementType) == 2) is unclear.

snukky · 2020-11-12T09:19:29Z

I will refresh regression tests we have for the intgemm integration and check with this branch later today.

emjotde · 2020-11-12T13:22:35Z

@XapaJIaMnu sizeOf(...) is just our equivalent for C++ sizeof that works with the Type enum. As you know the types are defined by their type classes and their size.

intgemm16 has isIntgemm: true, size: 2, every hardware-specific flag false;
intgemm8avx2 has isIntgemm: true, size: 1, isAvx2: true
etc.

So size of the type is a correct distinctive feature, no? That being said, at least a comment would be helpful there, yes.

emjotde · 2020-11-12T13:31:53Z

As for SSSE/SSE not working on avx2, this is because the multiply here: https://github.com/marian-nmt/marian-dev/blob/intgemm_refactor/src/tensors/cpu/intgemm_interface.h#L407 is width dependent. Potentially we might want to make it read in the type.

I don't get that. The code clearly is width-dependent, so is the type, no? How is there a mismatch? Or are you saying it is using the AVX2 code?

emjotde · 2020-11-12T13:38:43Z

IMO the big thing missing here is now making sure that everything that's hardware-specific either runs correctly (like intgemm16sse2 on avx2) or croaks FBGEMM-style. So we can pull this into your branch and then look at the ABORTs etc. there. Or we can finish it here first. Maybe with matching regression tests.

XapaJIaMnu · 2020-11-12T13:49:11Z

As for SSSE/SSE not working on avx2, this is because the multiply here: https://github.com/marian-nmt/marian-dev/blob/intgemm_refactor/src/tensors/cpu/intgemm_interface.h#L407 is width dependent. Potentially we might want to make it read in the type.

I don't get that. The code clearly is width-dependent, so is the type, no? How is there a mismatch? Or are you saying it is using the AVX2 code?

Sorry, this "width" here is just int8 or int16. The instruction "width" is determined by runtime initialised function pointers. If you want the avx2 for example, you need to do intgemm::avx2_int8::Multiply.
If you do intgemm::Int8::Multiply you get dispatched to the highest architecture your machine supports

XapaJIaMnu · 2020-11-12T13:50:03Z

@XapaJIaMnu sizeOf(...) is just our equivalent for C++ sizeof that works with the Type enum. As you know the types are defined by their type classes and their size.

intgemm16 has isIntgemm: true, size: 2, every hardware-specific flag false;

intgemm8avx2 has isIntgemm: true, size: 1, isAvx2: true

etc.

So size of the type is a correct distinctive feature, no? That being said, at least a comment would be helpful there, yes.

Is intgemm the only 16bit one? There's GPU fp16, etc..?

src/tensors/cpu/expression_graph_packable.h

emjotde · 2020-11-12T15:42:03Z

Is intgemm the only 16bit one? There's GPU fp16, etc..?

Of course, that's why there is a check for isIntgemm above :) I suppose this should actually have all 8 Intgemm types which would also avoid the sse2 vs avx2 issue?

   } else if(isFloat(aElementType) && isIntgemm(bElementType)) {
      // @TODO: this branch should move into cpu::integer::*
      if(sizeOf(bElementType) == 1) {
        return cpu::integer::dot<Type::int8>(a, b, transA, transB, scale);
      } else if(sizeOf(bElementType) == 2) {
        return cpu::integer::dot<Type::int16>(a, b, transA, transB, scale);
      } else {
        ABORT("Wrong size for Intgemm type {}??", sizeOf(bElementType));
      }
    }

emjotde · 2020-11-12T15:45:48Z

So it seems to be we should have a cpu::integer::dot function (and affine) rather than the two distinguished by built-in type and that one would dispatch the correct multiplies.

XapaJIaMnu · 2020-11-12T15:50:59Z

Is intgemm the only 16bit one? There's GPU fp16, etc..?

Of course, that's why there is a check for isIntgemm above :) I suppose this should actually have all 8 Intgemm types which would also avoid the sse2 vs avx2 issue?
   } else if(isFloat(aElementType) && isIntgemm(bElementType)) {
      // @TODO: this branch should move into cpu::integer::*
      if(sizeOf(bElementType) == 1) {
        return cpu::integer::dot<Type::int8>(a, b, transA, transB, scale);
      } else if(sizeOf(bElementType) == 2) {
        return cpu::integer::dot<Type::int16>(a, b, transA, transB, scale);
      } else {
        ABORT("Wrong size for Intgemm type {}??", sizeOf(bElementType));
      }
    } 

No, it will not avoid (automatically) the issue of sse2 vs avx2. We need to update all of intgemm_interface.h to not use the dispatcher: ingemm::width::X needs to become intgemm::avx2_8bit::X (For example)

emjotde · 2020-11-12T15:55:52Z

We need to update all of intgemm_interface.h to not use the dispatcher: ingemm::width::X needs to become intgemm::avx2_8bit::X (For example)

OK, then we should do that.

ykim362

Looks good!

emjotde · 2020-11-12T17:20:47Z

BTW. On-the-fly conversion will be easy to add back once we merge with my fp16 changes. There are --precision flags that operate on the model-loading level. We can just do an on-the-fly conversion there as I do for fp16 (but to intgemm8 or intgemm16) and then the corrected dispatching will just take care of the rest.

emjotde · 2020-11-12T17:22:05Z

@ykim362 That will also enable on-the-fly conversion with fbgemm when we do that correctly. :)

ykim362 · 2020-11-12T17:24:09Z

@ykim362 That will also enable on-the-fly conversion with fbgemm when we do that correctly. :)

@emjotde That sounds good! AML is using a branch for that.

emjotde · 2020-11-13T16:28:17Z

OK, as far as I am concerned this is more or less done. Removes some functionality, but it will be OK to add it back after we merge with the fp16 branch which should come with a few things to make that easier.

emjotde · 2020-11-13T16:52:01Z

Mmapping also works now for everything that runs on my AVX2 machine. Couldn't test AVX512.

emjotde · 2020-11-13T16:53:17Z

@kpu any remarks before this goes into the original PR? (And then into master if compilation etc. succeeds)

emjotde · 2020-11-14T15:56:24Z

OK, pulling into the original branch. @snukky and @XapaJIaMnu are working on regression tests, once those are updated and tested, we can pull it in.

emjotde added 7 commits November 11, 2020 12:42

clean-up type hierarchy

3d194f6

towards offline hardware-specific packing

76f7a7e

call conversion explicitly

a33b84a

simplify conversion

d656026

remove backend modifications for now

609e7fa

more clean-up

939025b

get rid of Transpose10

43b39b3

emjotde requested review from XapaJIaMnu, kpu and ykim362 November 12, 2020 00:07

update changelog

5788445

emjotde marked this pull request as ready for review November 12, 2020 04:46

XapaJIaMnu reviewed Nov 12, 2020

View reviewed changes

XapaJIaMnu approved these changes Nov 12, 2020

View reviewed changes

snukky reviewed Nov 12, 2020

View reviewed changes

src/tensors/cpu/expression_graph_packable.h Outdated Show resolved Hide resolved

snukky reviewed Nov 12, 2020

View reviewed changes

src/tensors/cpu/expression_graph_packable.h Outdated Show resolved Hide resolved

snukky mentioned this pull request Nov 12, 2020

Add tests for intgemm refactor marian-nmt/marian-regression-tests#73

Merged

address review comments

7a70eca

ykim362 approved these changes Nov 12, 2020

View reviewed changes

emjotde added 4 commits November 13, 2020 00:05

clean-up intgemm_interface.h

538de63

align parameters

ec20b2e

add correct dispatching

1571f8c

add comments

46722f6

minor formatting

98bd774

emjotde merged commit a0c57e4 into intgemm_reintegrated Nov 14, 2020

kpu mentioned this pull request Apr 6, 2021

quantize setting as the doc said but lead to "skipping *-th update due to loss being nan" for all train data input marian-nmt/marian#362

Open

snukky deleted the intgemm_refactor branch February 15, 2022 12:35

jerinphilip mentioned this pull request Mar 9, 2022

ARM Backend using ruy for fp32 and int8 browsermt/marian-dev#79

Merged

3 tasks

jerinphilip mentioned this pull request Apr 14, 2022

Sense int8 or int8Shift browsermt/marian-dev#92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intgemm refactor #762

Intgemm refactor #762

emjotde commented Nov 12, 2020 •

edited

Loading

emjotde commented Nov 12, 2020

emjotde commented Nov 12, 2020

XapaJIaMnu commented Nov 12, 2020

XapaJIaMnu Nov 12, 2020

emjotde Nov 12, 2020

kpu Nov 12, 2020

XapaJIaMnu Nov 12, 2020

ykim362 Nov 12, 2020

emjotde Nov 12, 2020

XapaJIaMnu Nov 12, 2020

emjotde Nov 12, 2020

XapaJIaMnu Nov 12, 2020

emjotde Nov 12, 2020

XapaJIaMnu Nov 12, 2020

XapaJIaMnu left a comment

snukky commented Nov 12, 2020 •

edited

Loading

emjotde commented Nov 12, 2020

emjotde commented Nov 12, 2020 •

edited

Loading

emjotde commented Nov 12, 2020

XapaJIaMnu commented Nov 12, 2020

XapaJIaMnu commented Nov 12, 2020

emjotde commented Nov 12, 2020 •

edited

Loading

emjotde commented Nov 12, 2020

XapaJIaMnu commented Nov 12, 2020

emjotde commented Nov 12, 2020

ykim362 left a comment

emjotde commented Nov 12, 2020

emjotde commented Nov 12, 2020

ykim362 commented Nov 12, 2020

emjotde commented Nov 13, 2020

emjotde commented Nov 13, 2020

emjotde commented Nov 13, 2020

emjotde commented Nov 14, 2020

		@@ -283,31 +282,7 @@ namespace marian {
		};

		if (shortlist_ && !cachedShortWt_) { // shortlisted versions of parameters are cached within one batch, then clear()ed

Intgemm refactor #762

Intgemm refactor #762

Conversation

emjotde commented Nov 12, 2020 • edited Loading

Checklist

emjotde commented Nov 12, 2020

emjotde commented Nov 12, 2020

XapaJIaMnu commented Nov 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XapaJIaMnu left a comment

Choose a reason for hiding this comment

snukky commented Nov 12, 2020 • edited Loading

emjotde commented Nov 12, 2020

emjotde commented Nov 12, 2020 • edited Loading

emjotde commented Nov 12, 2020

XapaJIaMnu commented Nov 12, 2020

XapaJIaMnu commented Nov 12, 2020

emjotde commented Nov 12, 2020 • edited Loading

emjotde commented Nov 12, 2020

XapaJIaMnu commented Nov 12, 2020

emjotde commented Nov 12, 2020

ykim362 left a comment

Choose a reason for hiding this comment

emjotde commented Nov 12, 2020

emjotde commented Nov 12, 2020

ykim362 commented Nov 12, 2020

emjotde commented Nov 13, 2020

emjotde commented Nov 13, 2020

emjotde commented Nov 13, 2020

emjotde commented Nov 14, 2020

emjotde commented Nov 12, 2020 •

edited

Loading

snukky commented Nov 12, 2020 •

edited

Loading

emjotde commented Nov 12, 2020 •

edited

Loading

emjotde commented Nov 12, 2020 •

edited

Loading