Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intgemm refactor #762

Merged
merged 14 commits into from
Nov 14, 2020
Merged

Intgemm refactor #762

merged 14 commits into from
Nov 14, 2020

Conversation

emjotde
Copy link
Member

@emjotde emjotde commented Nov 12, 2020

Hi,
this is a first draft of my small refactoring changes. All in all it was easier than thought.

  • Similar to FBGEMM the compute type is now determined by the Element type in the model.bin file.
  • If that's only intgemm{8,16} it will do the hardware-specific stuff on the fly.
  • If intgemm8avx2 etc. it is already pre-packed and does no conversion, but is hardware specific. This should be possible to mmap.
  • Got rid of the code in backends (for now). Open to add it back later.

TODO:

  • Verify I did not break FBGEMM back-compat with the new types. (regression tests for FBGEMM pass, @ykim362 you may want to do some more testing)
  • Test out mmapping (seems to work out-of-the-box, see below)
  • Take a look at the TODOs I mentioned, especially concerning aborting with wrong hardware during conversion.
  • It seems sse2 or ssse3 models do not work for me on avx2, please help verify/debug. AVX2 on AVX2 hardware works fine.
  • Verify AVX512 works on AVX512 hardware.

Checklist

  • I have tested the code manually
  • I have run regression tests
  • I have read and followed CONTRIBUTING.md
  • I have updated CHANGELOG.md

@emjotde
Copy link
Member Author

emjotde commented Nov 12, 2020

  • Mmapping just works, how nice :) At least for intgemm{8,16}avx2.
  • Can be enabled for testing here: (just set the define to 1, and you should see a message like: Memory mapping model at 0x7f2fcae1f000).

@emjotde
Copy link
Member Author

emjotde commented Nov 12, 2020

@snukky with this change the following regression test fails:

tests/models/wnmt18/test_student_small_aan_optimize.sh

This is expected as I removed the --optimize option. The correct way to adapt this test would be to run marian-conv -f model.npz -t model.bin --gemm-type intgemm16 and then point marian-decoder to the model.bin file.

This could also be extended to more regression tests, at least for --gemm-type intgemm8.

@emjotde emjotde marked this pull request as ready for review November 12, 2020 04:46
@XapaJIaMnu
Copy link
Contributor

Hey,

It looks good. Glad MMAP'ing works. I would like to have the shifted version exposed (as a command line switch or something), as it is noticeably faster on VNNI.

As for SSSE/SSE not working on avx2, this is because the multiply here: https://github.com/marian-nmt/marian-dev/blob/intgemm_refactor/src/tensors/cpu/intgemm_interface.h#L407 is width dependent. Potentially we might want to make it read in the type.

Cheers,

Nick

protected:
bool int16_{false};
bool int8_{false};
bool int8shift_{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any plans for alternative passing of the "shifted" parameter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not against it, but can we do that in the next PR? Maybe there are better ways than having it in backend (maybe not).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just get rid of the unshifted version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's slower on pre-VNNI without precomputed Alphas, which we don't have here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my curiosity, could you let me how much the perf diff is between shifted and unshifted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the difference isn't dramatic I would be in favor of being pragmatic about this :)

@@ -283,31 +282,7 @@ namespace marian {
};

if (shortlist_ && !cachedShortWt_) { // shortlisted versions of parameters are cached within one batch, then clear()ed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult is to have the shortlist_ be passed to the Affine routine, then we can do it on the fly without extra cached nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to think about, but not in this PR maybe? This whole shortlist business is quite horrible already, having to pass even further down makes it worse IMO. This would benefit from a cleaner, better thought out approach.

return cpu::integer::affine<Type::int8>(a, b, bias, transA, transB, scale, clipValue);
} else if(a->graph()->getBackend()->isInt16() || matchType<intgemm16>(bElementType) ) {
} else if(sizeOf(bElementType) == 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this, why not use matchtype? It's not obvious what sizeOf(bElementType) == 2 means

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would need to list all 8 types here. See explanation below.

return cpu::integer::dot<Type::int8>(a, b, transA, transB, scale);
} else if(a->graph()->getBackend()->isInt16() || matchType<intgemm16>(bElementType)) {
} else if(sizeOf(bElementType) == 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, what does sizeOf(bElementType) == 2) and can we use matchType or something else more obvious instead?

Copy link
Contributor

@XapaJIaMnu XapaJIaMnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, biggest issue is that the meaning of sizeOf(bElementType) == 2) is unclear.

@snukky
Copy link
Member

snukky commented Nov 12, 2020

I will refresh regression tests we have for the intgemm integration and check with this branch later today.

@emjotde
Copy link
Member Author

emjotde commented Nov 12, 2020

@XapaJIaMnu sizeOf(...) is just our equivalent for C++ sizeof that works with the Type enum. As you know the types are defined by their type classes and their size.

  • intgemm16 has isIntgemm: true, size: 2, every hardware-specific flag false;
  • intgemm8avx2 has isIntgemm: true, size: 1, isAvx2: true
  • etc.

So size of the type is a correct distinctive feature, no? That being said, at least a comment would be helpful there, yes.

@emjotde
Copy link
Member Author

emjotde commented Nov 12, 2020

As for SSSE/SSE not working on avx2, this is because the multiply here: https://github.com/marian-nmt/marian-dev/blob/intgemm_refactor/src/tensors/cpu/intgemm_interface.h#L407 is width dependent. Potentially we might want to make it read in the type.

I don't get that. The code clearly is width-dependent, so is the type, no? How is there a mismatch? Or are you saying it is using the AVX2 code?

@emjotde
Copy link
Member Author

emjotde commented Nov 12, 2020

IMO the big thing missing here is now making sure that everything that's hardware-specific either runs correctly (like intgemm16sse2 on avx2) or croaks FBGEMM-style. So we can pull this into your branch and then look at the ABORTs etc. there. Or we can finish it here first. Maybe with matching regression tests.

@XapaJIaMnu
Copy link
Contributor

As for SSSE/SSE not working on avx2, this is because the multiply here: https://github.com/marian-nmt/marian-dev/blob/intgemm_refactor/src/tensors/cpu/intgemm_interface.h#L407 is width dependent. Potentially we might want to make it read in the type.

I don't get that. The code clearly is width-dependent, so is the type, no? How is there a mismatch? Or are you saying it is using the AVX2 code?

Sorry, this "width" here is just int8 or int16. The instruction "width" is determined by runtime initialised function pointers. If you want the avx2 for example, you need to do intgemm::avx2_int8::Multiply.
If you do intgemm::Int8::Multiply you get dispatched to the highest architecture your machine supports

@XapaJIaMnu
Copy link
Contributor

@XapaJIaMnu sizeOf(...) is just our equivalent for C++ sizeof that works with the Type enum. As you know the types are defined by their type classes and their size.

  • intgemm16 has isIntgemm: true, size: 2, every hardware-specific flag false;
  • intgemm8avx2 has isIntgemm: true, size: 1, isAvx2: true
  • etc.

So size of the type is a correct distinctive feature, no? That being said, at least a comment would be helpful there, yes.

Is intgemm the only 16bit one? There's GPU fp16, etc..?

@emjotde
Copy link
Member Author

emjotde commented Nov 12, 2020

Is intgemm the only 16bit one? There's GPU fp16, etc..?

Of course, that's why there is a check for isIntgemm above :) I suppose this should actually have all 8 Intgemm types which would also avoid the sse2 vs avx2 issue?

   } else if(isFloat(aElementType) && isIntgemm(bElementType)) {
      // @TODO: this branch should move into cpu::integer::*
      if(sizeOf(bElementType) == 1) {
        return cpu::integer::dot<Type::int8>(a, b, transA, transB, scale);
      } else if(sizeOf(bElementType) == 2) {
        return cpu::integer::dot<Type::int16>(a, b, transA, transB, scale);
      } else {
        ABORT("Wrong size for Intgemm type {}??", sizeOf(bElementType));
      }
    } 

@emjotde
Copy link
Member Author

emjotde commented Nov 12, 2020

So it seems to be we should have a cpu::integer::dot function (and affine) rather than the two distinguished by built-in type and that one would dispatch the correct multiplies.

@XapaJIaMnu
Copy link
Contributor

Is intgemm the only 16bit one? There's GPU fp16, etc..?

Of course, that's why there is a check for isIntgemm above :) I suppose this should actually have all 8 Intgemm types which would also avoid the sse2 vs avx2 issue?

   } else if(isFloat(aElementType) && isIntgemm(bElementType)) {
      // @TODO: this branch should move into cpu::integer::*
      if(sizeOf(bElementType) == 1) {
        return cpu::integer::dot<Type::int8>(a, b, transA, transB, scale);
      } else if(sizeOf(bElementType) == 2) {
        return cpu::integer::dot<Type::int16>(a, b, transA, transB, scale);
      } else {
        ABORT("Wrong size for Intgemm type {}??", sizeOf(bElementType));
      }
    } 

No, it will not avoid (automatically) the issue of sse2 vs avx2. We need to update all of intgemm_interface.h to not use the dispatcher: ingemm::width::X needs to become intgemm::avx2_8bit::X (For example)

@emjotde
Copy link
Member Author

emjotde commented Nov 12, 2020

We need to update all of intgemm_interface.h to not use the dispatcher: ingemm::width::X needs to become intgemm::avx2_8bit::X (For example)

OK, then we should do that.

Copy link
Member

@ykim362 ykim362 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@emjotde
Copy link
Member Author

emjotde commented Nov 12, 2020

BTW. On-the-fly conversion will be easy to add back once we merge with my fp16 changes. There are --precision flags that operate on the model-loading level. We can just do an on-the-fly conversion there as I do for fp16 (but to intgemm8 or intgemm16) and then the corrected dispatching will just take care of the rest.

@emjotde
Copy link
Member Author

emjotde commented Nov 12, 2020

@ykim362 That will also enable on-the-fly conversion with fbgemm when we do that correctly. :)

@ykim362
Copy link
Member

ykim362 commented Nov 12, 2020

@ykim362 That will also enable on-the-fly conversion with fbgemm when we do that correctly. :)

@emjotde That sounds good! AML is using a branch for that.

@emjotde
Copy link
Member Author

emjotde commented Nov 13, 2020

OK, as far as I am concerned this is more or less done. Removes some functionality, but it will be OK to add it back after we merge with the fp16 branch which should come with a few things to make that easier.

@emjotde
Copy link
Member Author

emjotde commented Nov 13, 2020

Mmapping also works now for everything that runs on my AVX2 machine. Couldn't test AVX512.

@emjotde
Copy link
Member Author

emjotde commented Nov 13, 2020

@kpu any remarks before this goes into the original PR? (And then into master if compilation etc. succeeds)

@emjotde
Copy link
Member Author

emjotde commented Nov 14, 2020

OK, pulling into the original branch. @snukky and @XapaJIaMnu are working on regression tests, once those are updated and tested, we can pull it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants