-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AMX support to speed up Faiss Inner-Product #535
Conversation
@mellonyou 🔍 Important: PR Classification Needed! For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”. Thanks for your efforts and contribution to the community!. |
Signed-off-by: Fangzheng Zhang <[email protected]>
issue: #541 |
I can't edit the labels, need any access permissions? |
/kind enhancement |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #535 +/- ##
=========================================
+ Coverage 0 71.59% +71.59%
=========================================
Files 0 67 +67
Lines 0 4446 +4446
=========================================
+ Hits 0 3183 +3183
- Misses 0 1263 +1263 |
BaseData::getState().store(BASE_DATA_STATE::MODIFIED); | ||
} | ||
|
||
void execut(float** out_f32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: execute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's a typo
// inner memory bf16 | ||
bf16_md1 = dnnl::memory::desc({xrow, xcol}, dnnl::memory::data_type::bf16, dnnl::memory::format_tag::any); | ||
bf16_md2 = dnnl::memory::desc({yrow, ycol}, dnnl::memory::data_type::bf16, dnnl::memory::format_tag::any); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noob Q, why we use bf16 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because AMX can native support for bf16/int8 compute, which can significantly improve performance, and we have done the test, it have little impact on accuracy.
BASE_DATA_STATE expected = BASE_DATA_STATE::MODIFIED; | ||
|
||
if (BaseData::getState().compare_exchange_strong(expected, BASE_DATA_STATE::PREPARE)) { | ||
pthread_rwlock_wrlock(&rwlock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noob Q, why we need to lock this. Is that because we only have only AMX instruction can run at a time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is designed for multi-thread scenario, if two threads operate on the same base dataset with different query dataset, the lock prevent the base dataset from being modified by the other thread while working on it.
dnnl::reorder(f32_mem1, bf16_mem1).execute(engine_stream, f32_mem1, bf16_mem1); | ||
BASE_DATA_STATE expected = BASE_DATA_STATE::MODIFIED; | ||
|
||
if (BaseData::getState().compare_exchange_strong(expected, BASE_DATA_STATE::PREPARE)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plz CMIIW. In the first call, expected
will be BASE_DATA_STATE::MODIFIED
and changed into BASE_DATA_STATE::PREPARE
in this line and return false. Then it will loop in line 196 forever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The state is also designed for multi-thread scenario, the state change is INIT->MODIFIED -> PREPARE -> READY. When the first thread have finished the initialization, the other thread will get the state is READY, and then skip line 196.
if (is_dnnl_enabled()) { | ||
float *res_arr = NULL; | ||
|
||
comput_f32bf16f32_inner_product(nx, d, ny, d, const_cast<float*>(x), const_cast<float*>(y), &res_arr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we implement a dynamic hook like all other simd in Knowhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have also considered following the other simd interface, but due to the implementation of AMX, it may be a bit incompatible with the current interface:
- AMX prefers batch data calculation, and it's library will schedule multiple threads on its own.
- The return value is a array for batch data operation.
So if we use dynamic hook, maybe need add new interface for batch data operation, and call the new interface when AMX is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liliu-z We are planning to port code to adapt dynamic hook, do you have any other suggestions?
@@ -211,30 +214,59 @@ void exhaustive_inner_product_seq_impl( | |||
using SingleResultHandler = typename BlockResultHandler::SingleResultHandler; | |||
int nt = std::min(int(nx), omp_get_max_threads()); | |||
|
|||
#ifdef FAISS_WITH_DNNL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem here is that this code is inserted into the function that computes inner products according to a filter. So, if the filter filters out 90% of samples, then 9 out of 10 computed distances will not be used, costing quite an extra memory bandwidth.
Benchmarks are needed for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexanderguzhva The filter is inside Knowhere or in the Milvus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xtangxtang an external filter (in the form of bitset), provided from Milvus
…nednn. Signed-off-by: Eric Zhang <[email protected]>
Add searchwithbuf and rangesearch interface implementation with AMX onednn. And will submit the related build config into milvus later. |
I am trying to do a manual filter with multithread before AMX IP. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mellonyou The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1 similar comment
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mellonyou The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Have tried to do manual filter with multithread before AMX IP, and it have a significant impact on performance. So we only filter the results to ensure their accuracy, which has a relatively small impact on performance. |
Use Intel AMX to speed up Inner-Product algorithm of knowhere::BruteForce::Search(), which can bring more than 10x performance boost.
Build parameter: use "-o with_dnnl=True/False" to control enable/disable AMX feature.
This feature will depends on libdnnl.so.3, you can install it by running scripts/install_deps.sh.
Runtime parameter: if you want use AMX feature, you need set ENV parameter "DNNL_ENABLE=1" at first, otherwise the AMX feature will not work.