-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intgemm refactor #762
Intgemm refactor #762
Conversation
|
@snukky with this change the following regression test fails:
This is expected as I removed the This could also be extended to more regression tests, at least for |
Hey, It looks good. Glad MMAP'ing works. I would like to have the shifted version exposed (as a command line switch or something), as it is noticeably faster on VNNI. As for SSSE/SSE not working on avx2, this is because the multiply here: https://github.com/marian-nmt/marian-dev/blob/intgemm_refactor/src/tensors/cpu/intgemm_interface.h#L407 is width dependent. Potentially we might want to make it read in the type. Cheers, Nick |
protected: | ||
bool int16_{false}; | ||
bool int8_{false}; | ||
bool int8shift_{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any plans for alternative passing of the "shifted" parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not against it, but can we do that in the next PR? Maybe there are better ways than having it in backend (maybe not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just get rid of the unshifted version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's slower on pre-VNNI without precomputed Alphas, which we don't have here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my curiosity, could you let me how much the perf diff is between shifted and unshifted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the difference isn't dramatic I would be in favor of being pragmatic about this :)
@@ -283,31 +282,7 @@ namespace marian { | |||
}; | |||
|
|||
if (shortlist_ && !cachedShortWt_) { // shortlisted versions of parameters are cached within one batch, then clear()ed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How difficult is to have the shortlist_ be passed to the Affine routine, then we can do it on the fly without extra cached nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to think about, but not in this PR maybe? This whole shortlist business is quite horrible already, having to pass even further down makes it worse IMO. This would benefit from a cleaner, better thought out approach.
src/graph/expression_operators.cpp
Outdated
return cpu::integer::affine<Type::int8>(a, b, bias, transA, transB, scale, clipValue); | ||
} else if(a->graph()->getBackend()->isInt16() || matchType<intgemm16>(bElementType) ) { | ||
} else if(sizeOf(bElementType) == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this, why not use matchtype? It's not obvious what sizeOf(bElementType) == 2
means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would need to list all 8 types here. See explanation below.
src/graph/expression_operators.cpp
Outdated
return cpu::integer::dot<Type::int8>(a, b, transA, transB, scale); | ||
} else if(a->graph()->getBackend()->isInt16() || matchType<intgemm16>(bElementType)) { | ||
} else if(sizeOf(bElementType) == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, what does sizeOf(bElementType) == 2)
and can we use matchType
or something else more obvious instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, biggest issue is that the meaning of sizeOf(bElementType) == 2)
is unclear.
I will refresh regression tests we have for the intgemm integration and check with this branch later today. |
@XapaJIaMnu
So size of the type is a correct distinctive feature, no? That being said, at least a comment would be helpful there, yes. |
I don't get that. The code clearly is width-dependent, so is the type, no? How is there a mismatch? Or are you saying it is using the AVX2 code? |
IMO the big thing missing here is now making sure that everything that's hardware-specific either runs correctly (like intgemm16sse2 on avx2) or croaks FBGEMM-style. So we can pull this into your branch and then look at the ABORTs etc. there. Or we can finish it here first. Maybe with matching regression tests. |
Sorry, this "width" here is just int8 or int16. The instruction "width" is determined by runtime initialised function pointers. If you want the avx2 for example, you need to do |
Is intgemm the only 16bit one? There's GPU fp16, etc..? |
Of course, that's why there is a check for
|
So it seems to be we should have a |
No, it will not avoid (automatically) the issue of sse2 vs avx2. We need to update all of |
OK, then we should do that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
BTW. On-the-fly conversion will be easy to add back once we merge with my fp16 changes. There are |
@ykim362 That will also enable on-the-fly conversion with fbgemm when we do that correctly. :) |
OK, as far as I am concerned this is more or less done. Removes some functionality, but it will be OK to add it back after we merge with the fp16 branch which should come with a few things to make that easier. |
Mmapping also works now for everything that runs on my AVX2 machine. Couldn't test AVX512. |
@kpu any remarks before this goes into the original PR? (And then into master if compilation etc. succeeds) |
OK, pulling into the original branch. @snukky and @XapaJIaMnu are working on regression tests, once those are updated and tested, we can pull it in. |
Hi,
this is a first draft of my small refactoring changes. All in all it was easier than thought.
TODO:
Verify I did not break FBGEMM back-compat with the new types.(regression tests for FBGEMM pass, @ykim362 you may want to do some more testing)Test out mmapping(seems to work out-of-the-box, see below)Checklist