Enable Intel®-AMX/oneDNN to accelerate IndexFlatIP search #3266

guangzegu · 2024-02-27T15:01:38Z

Description

Intel® AMX, which is an AI acceleration engine deeply embedded into every core of our 4th/5th Gen Intel® Xeon® Scalable processor. Intel® AMX(Intel Advanced Matrix Extensions) is a set of programming extensions designed to enhance the performance of matrix operations. Intel oneAPI Deep Neural Network Library (oneDNN) is an open-source performance library designed to accelerate deep learning frameworks on Intel architectures. oneDNN is able to leverage the efficient matrix computation extensions provided by AMX to accelerate the performance of deep learning frameworks on Intel architectures, especially for computation-intensive matrix operations.

IndexFlatIP search performance accelerated by oneDNN/AMX improves by 1.7X to 5X compared to the default inner_product, In scenarios with 1 query, dimensions ranging from 64 to 1024, and 1,000,000 vectors.

IndexFlatIP search performance accelerated by oneDNN/AMX improves by up to 4X compared to the Blas inner_product, In scenarios with 1000 query, dimensions ranging from 64 to 1024, and 1,000,000 vectors.

How to use

When invoking Cmake , add an option as follows:

-DFAISS_ENABLE_DNNL=OFF Enable support for oneDNN to accelerate IndexFlatIP search(possible values are ON and OFF)

When you want to use Intel®-AMX/oneDNN to accelerate the search of indexFlatIP, set FAISS_ENABLE_DNNL to ON and run on 4th/5th Gen Intel® Xeon® Scalable processor, the exhaustive_inner_product_seq method will be accelerated.

Co-authored-by: @xtangxtang [email protected]

facebook-github-bot · 2024-02-27T15:01:44Z

Hi @guangzegu!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

alexanderguzhva · 2024-02-29T00:56:11Z

@guangzegu this patch is in extremely early stage.

there needs to be a description in the readme.txt file about how to set up oneAPI properly. For example, I needed to install dnnl, mkl and tbb, and then run source setvars.sh from oneAPI root directory. Imagine that someone sets this up on a fresh machine or in a docker container.
it needs to be mentioned on how to set up DNNL_LIB in cmake arguments.
a unit test tests to be added that activates the execution path that you've added. Basically, exhaustive search for IP has many if-then-else internal conditions and execution branches for various use cases (topk=1, topk=many, many query samples, few query samples, etc). The effect of your patch needs to be measured in milliseconds.
I tried invoke the needed path, and whenever I invoke your code on AWS M7i machine (Intel Xeon 4th gen), I see the exception with the test could not create a primitive descriptor for an inner product forward propagation primitive. It is completely unclear about what goes wrong. amx_bf16 capability is enabled, which is seen in cat /proc/cpuinfo

Thanks

@mdouze Is Intel Xeon 4th gen available for CI?

guangzegu · 2024-03-05T14:16:13Z

@guangzegu this patch is in extremely early stage.

there needs to be a description in the readme.txt file about how to set up oneAPI properly. For example, I needed to install dnnl, mkl and tbb, and then run source setvars.sh from oneAPI root directory. Imagine that someone sets this up on a fresh machine or in a docker container.

it needs to be mentioned on how to set up DNNL_LIB in cmake arguments.

a unit test tests to be added that activates the execution path that you've added. Basically, exhaustive search for IP has many if-then-else internal conditions and execution branches for various use cases (topk=1, topk=many, many query samples, few query samples, etc). The effect of your patch needs to be measured in milliseconds.

I tried invoke the needed path, and whenever I invoke your code on AWS M7i machine (Intel Xeon 4th gen), I see the exception with the test could not create a primitive descriptor for an inner product forward propagation primitive. It is completely unclear about what goes wrong. amx_bf16 capability is enabled, which is seen in cat /proc/cpuinfo

Thanks

@mdouze Is Intel Xeon 4th gen available for CI?

@alexanderguzhva Thank you very much for your comments.

I will add a description in the readme.txt file on configuring oneDNN to enable this feature, indeed, the addition of unit tests needs to be carefully considered.
I didn't run into this error could not create a primitive descriptor for an inner product forward propagation primitive in my environment before, I didn't set the environment variables using oneAPI, but simply installed oneDNN additionally under the community version. You can try referring to this link: https://oneapi-src.github.io/oneDNN/dev_guide_build.html. The version is v3.3+.

…ndexFlatIP

guangzegu · 2024-03-26T14:25:12Z

@alexanderguzhva

I suppose the unit tests for this PR can be covered by faiss/tests/test_index.py.
The new commits have added some installation descriptions and have also enhanced the performance.
You might try again with the latest changes. If there are any issues, please feel free to contact me .

alexanderguzhva · 2024-04-06T00:57:41Z

@guangzegu Thanks, I'll take a look

guangzegu · 2024-05-14T05:23:44Z

@guangzegu Thanks, I'll take a look

Great! How's it going? Have you run into any issues?

alexanderguzhva · 2024-05-24T21:35:35Z

@guangzegu Hi, it is stil in my plans, together with zilliztech/knowhere#535 . Sorry that it is taking too long, I get constantly distracted :(

mdouze · 2024-05-28T16:57:19Z

We are looking into compiling this in the CI
@ramilbakhshyiev

ramilbakhshyiev · 2024-05-30T19:07:59Z

@guangzegu Could you please rebase this? We can try a test CI build next and go from there. Thanks!

guangzegu · 2024-06-19T01:16:40Z

@ramilbakhshyiev Sure, I will rebase it. Thanks!

guangzegu · 2024-06-19T01:20:47Z

@alexanderguzhva No worries, I understand. Thank you for the update and for your efforts!

ramilbakhshyiev · 2024-06-21T22:08:57Z

Thanks @guangzegu! We will be trying this out soon.

mengdilin · 2024-07-06T02:22:24Z

Hi @guangzegu and @ramilbakhshyiev I'm trying to build this PR on the github CI :)

@guangzegu I'm following the documentation you have provided in README to set this up in: https://oneapi-src.github.io/oneDNN/dev_guide_build.html It looks like the official doc does not point to a conda installation (this is how FAISS normally installs dependencies, folks please correct me if I'm wrong here). I'm able to find dnnl on conda and ended up setting it up like below (can you clarify if this is the right way to install the dependency and if so, update the README?)

conda install -y -q conda-forge::onednn

If that is the case, I managed to get everything to build on CI but we have a C++ unit test failing, complaining about memleak (see build log) Is this something you can reproduce locally and expect? The actual test case source code is here

guangzegu · 2024-07-16T13:25:48Z

@mengdilin Thank you for verifying this PR and uncovering potential issues 😄, I'm going to try to reproduce this issue in my environment.

mengdilin · 2024-07-19T22:57:53Z

Hi @guangzegu After combing through the PR, I'm not seeing anything obvious that would cause the memory leak (besides my nit comment), but obviously I will defer to you on the dnnl memory management aspect. I ended up running the failing mem_leak test through valgrind (diffing test result from master commit vs your PR) and it looks like your PR did not introduce any new leak (valgrind produced consistent analysis between your PR and the master commit).

We will look into the possibility of disabling this test or omit it from dnnl build to unblock merging your PR

mengdilin · 2024-07-29T16:34:48Z

@guangzegu After omitting the memory leak test from your PR, it looks like we have encountered precision issue in several unit tests when it comes to inner product computation. Is this something expected?

A source for one of the failing tests is

faiss/tests/test_residual_quantizer.py

Line 694 in 34bbe5e

np.testing.assert_array_almost_equal(Dref, Dnew, decimal=5)

The test failure stacktrace looks like

args = (<function assert_array_almost_equal.<locals>.compare at 0x741cf7102480>, array([[12.644228 , 12.541752 , 11.607426 , ...03604 ],
       [12.91586  , 12.849993 , 12.578976 , ..., 11.806257 , 11.71474  ,
        11.699309 ]], dtype=float32))
kwds = {'err_msg': '', 'header': 'Arrays are not almost equal to 5 decimals', 'precision': 5, 'verbose': True}
    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
E           AssertionError: 
E           Arrays are not almost equal to 5 decimals
E           
E           Mismatched elements: 1226 / 1230 (99.7%)
E           Max absolute difference: 0.02308941
E           Max relative difference: 0.00268776
E            x: array([[12.64423, 12.54175, 11.60743, ..., 10.98963, 10.9623 , 10.89734],
E                  [ 6.23966,  6.20934,  6.11219, ...,  5.85792,  5.76734,  5.7244 ],
E                  [12.55453, 12.26167, 12.1587 , ..., 11.59533, 11.56127, 11.4444 ],...
E            y: array([[12.64321, 12.55013, 11.60776, ..., 10.98861, 10.96813, 10.89554],

You can reproduce the failure on your PR by cloning this PR #3615 and run the following after coming faiss with DNNL mode on:

cd build/faiss/python && path/to/bin/python setup.py install && pytest --junitxml=test-results/pytest/results.xml tests/test_*.py

mengdilin · 2024-07-29T17:09:16Z

@asadoughi pointed out that it looks like this PR is trading off precision for speed from https://github.com/facebookresearch/faiss/pull/3266/files#diff-9228cbbdef764c34694b0b5d637c05058ccc6c6b3279469a1b3421633e7feb3fR57

If that is the case, can you provide some tests covering the low precision scenario. We can gate these tests behind an explicit flag

mengdilin

Let's restructure the AMX integration with faiss so that bulk of its complexity can live inside a dedicated folder cppcontrib/amx/ since this feature is off by default and requires the users to turn it on explicitly (trading off precision for performance). I made some suggestions on how to accomplish that.

Following up on the previous comment, do you mind adding a few dedicated low precision tests for this PR?

mengdilin · 2024-07-19T22:16:09Z

faiss/utils/distances.cpp

+#ifdef ENABLE_DNNL
+    // use AMX to accelerate if available
+    if (is_amxbf16_supported()) {
+        float* res_arr = (float*)malloc(nx * ny * sizeof(float));


Suggested change

float* res_arr = (float*)malloc(nx * ny * sizeof(float));

float* res_arr = new float[nx * ny];

nit: delete [] res_arr should be paired with new [] otherwise it's undefined behavior.

mengdilin · 2024-07-30T15:45:45Z

faiss/utils/onednn/onednn_utils.h

@@ -0,0 +1,141 @@
+/**


Let's reorganize the code a little bit and move bulk of the logic for AMX to a centralized location.

Can you move this file and the other relevant pieces of dnn code in distance computation to faiss/cppcontrib/amx: https://github.com/facebookresearch/faiss/tree/4eeaa42b930363b7087f1ad39db8adaa8267d61a/faiss/cppcontrib

mengdilin · 2024-07-30T15:49:49Z

c_api/utils/distances_c.h

+
+/// Getter of block sizes value for oneDNN/AMX distance computations
+int faiss_get_distance_compute_dnnl_query_bs();
+


Is it possible for you to move these to cpi/cppcontrib/amx/distances_dnnl_c.h and if not feasible, gate it behind a compilation flag?

mengdilin · 2024-07-30T15:50:20Z

c_api/utils/distances_c.cpp

+void faiss_set_distance_compute_dnnl_query_bs(int value) {
+    faiss::distance_compute_dnnl_query_bs = value;
+}
+


Is it possible for you to move these to cpi/cppcontrib/amx/distances_dnnl_c.h and if not feasible, gate it behind a compilation flag?

mengdilin · 2024-07-30T15:58:21Z

faiss/utils/distances.cpp

@@ -145,26 +149,60 @@ void exhaustive_inner_product_seq(

    FAISS_ASSERT(use_sel == (sel != nullptr));

+#ifdef ENABLE_DNNL
+    // use AMX to accelerate if available
+    if (is_amxbf16_supported()) {


Let's move the AMX specific inner product implementations to cppcontrib/amx/distances_dnnl.cpp

Can you have variants of the 2 functions for DNNL: exhaustive_inner_product_seq_dnnl and exhaustive_inner_product_blas_dnnl and they can live inside cppcontrib/amx alongside onednn_utils.h. Then you can dispatch to use these two functions here:

faiss/faiss/utils/distances.cpp

Lines 611 to 614 in 4eeaa42

exhaustive_inner_product_seq(x, y, d, nx, ny, res);

} else {

exhaustive_inner_product_blas(x, y, d, nx, ny, res);

}

based on ENABLE_DNNL and is_amxbf16_supported() as that is the only place calling these two functions.

mengdilin · 2024-07-30T16:00:42Z

faiss/utils/distances.cpp

@@ -650,6 +709,8 @@ int distance_compute_blas_threshold = 20;
 int distance_compute_blas_query_bs = 4096;
 int distance_compute_blas_database_bs = 1024;
 int distance_compute_min_k_reservoir = 100;
+int distance_compute_dnnl_query_bs = 10240;


Can you move these to cppcontrib/amx/distances_dnnl.cpp?

Is it possible to move these two extern variables to cppcontrib/amx/distances_dnnl.h?

mengdilin · 2024-07-30T16:03:42Z

faiss/utils/distances.h

@@ -281,6 +281,10 @@ FAISS_API extern int distance_compute_blas_threshold;
 FAISS_API extern int distance_compute_blas_query_bs;
 FAISS_API extern int distance_compute_blas_database_bs;

+// block sizes for oneDNN/AMX distance computations
+FAISS_API extern int distance_compute_dnnl_query_bs;


Can you extern it in a separate header file for cppcontrib/amx/distances_dnnl.h? If not, can you gate it behind ENABLE_DNNL flag?

mengdilin · 2024-08-19T23:08:36Z

Hi @guangzegu and @xtangxtang what is the status of this PR? ? Let me know if you are blocked on anything :)

guangzegu · 2024-08-21T02:00:06Z

Hi @guangzegu and @xtangxtang what is the status of this PR? ? Let me know if you are blocked on anything :)
@mengdilin Sorry, I took some time off due to family matters. Now we will follow your suggestions to make some adjustments first and then ask for your help to review :).

endomorphosis · 2024-09-11T08:17:59Z

can I get an update on this merge?

alexanderguzhva · 2024-09-11T20:05:31Z

@guangzegu, @xtangxtang, I've played a bit with AMX code. What are the advantages of using Intel libraries for AMX? I was able to write a functional AMX-based code without any Intel libraries. Thanks.

xtangxtang · 2024-09-25T03:22:33Z

@guangzegu, @xtangxtang, I've played a bit with AMX code. What are the advantages of using Intel libraries for AMX? I was able to write a functional AMX-based code without any Intel libraries. Thanks.

@alexanderguzhva Because we should split big matrix into fitted size so that AMX could process. This work deal with some optimization method that could improve AMX performance. Also, we may start multiple threads to fully utilize the AMX. All this work is warped by oneDNN library.

xtangxtang · 2024-09-25T03:24:58Z

can I get an update on this merge?
@mengdilin @endomorphosis we have updated the code according to the comments, could please review? Thanks

endomorphosis · 2024-09-25T03:33:09Z

I am not authorized to merge the pull request, I was just working with the the intel OPEA team / linux foundation / LAION , to optimize the retrieval times on large datasets, and integrang with the the intel OPEA team / linux foundation / LAION , to optimize the retrieval times on large datasets, and integrate this te this into the opea project. I can run a test for some different hardware platforms to ensure bug testing, but I cannot dictate the design goals for this repository.

mengdilin · 2024-09-25T13:49:00Z

@xtangxtang acked, will take a look next week!

Before looking deeper, have you taken a look at the unit test concerns from the past comment. Basically when compiling with dnnl optimization, our unit tests are failing due to higher precision requirements (you should be able to reproduce these locally if you run them on the python tests, let me know if you need help reproducing). Can you please provide some tests covering the low precision scenario. We can dedicate these tests to cover for the DNNL changes

xtangxtang · 2024-09-27T01:08:20Z

@xtangxtang acked, will take a look next week!

Before looking deeper, have you taken a look at the unit test concerns from the past comment. Basically when compiling with dnnl optimization, our unit tests are failing due to higher precision requirements (you should be able to reproduce these locally if you run them on the python tests, let me know if you need help reproducing). Can you please provide some tests covering the low precision scenario. We can dedicate these tests to cover for the DNNL changes

Yes, we will provide low precision UT alone with this PR. Thanks for your feadback

mengdilin

Thanks for working on the change! Some code comments and rebase

Can you rebase off of the current main branch? Some of the code path is out of date?
Do you mind patching changes in [ignore] Test CI AMX #3900 to your PR after you have added the low precision tests so that AMX CI signals show up? Once you have these tests, you will need to provide a flag such that by default, we run tests in high precision cases, but for AMX cases, the flag can be toggled to cover the low precision cases.

mengdilin · 2024-09-30T17:11:36Z

faiss/utils/distances.cpp

+#ifdef ENABLE_DNNL
+/* Find the nearest neighbors for nx queries in a set of ny vectors using oneDNN/AMX */
+template <class BlockResultHandler,  bool use_sel = false>
+void exhaustive_inner_product_seq_dnnl(


Let's go a step further and move exhaustive_inner_product_seq_dnnl and exhaustive_inner_product_blas_dnnl (this should be renamed to exhaustive_inner_product_dnnl instead) to cppcontrib/amx/distances_dnnl.h?

So that the only DNNL logic remaining in distances.cpp is the dispatching mechanism between blas and dnnl in knn_inner_product_select

mengdilin · 2024-09-30T17:11:54Z

faiss/utils/distances.cpp

+#ifdef ENABLE_DNNL
+/** Find the nearest neighbors for nx queries in a set of ny vectors using oneDNN/AMX */
+template <class BlockResultHandler>
+void exhaustive_inner_product_blas_dnnl(


rename this to exhaustive_inner_product_dnnl

mengdilin · 2024-09-30T17:13:18Z

faiss/utils/distances.cpp

@@ -650,6 +709,8 @@ int distance_compute_blas_threshold = 20;
 int distance_compute_blas_query_bs = 4096;
 int distance_compute_blas_database_bs = 1024;
 int distance_compute_min_k_reservoir = 100;
+int distance_compute_dnnl_query_bs = 10240;


Is it possible to move these two extern variables to cppcontrib/amx/distances_dnnl.h?

mengdilin · 2024-09-30T17:15:56Z

faiss/utils/distances.cpp

+
+    FAISS_ASSERT(use_sel == (sel != nullptr));
+
+    float* res_arr = (float*)malloc(nx * ny * sizeof(float));


Nit: std::unique_ptr

…ation

guangzegu · 2024-10-22T06:38:39Z

@mengdilin

Thank you for the feedback and the helpful comments 😄 ! I’ve rebased the code on the current main branch and addressed the code comments you provided, Could you please review it again?
Additionally, I’ve reproduced the issue in the high-precision tests and prepared an initial version of the fix, although I haven’t pushed it yet.
One quick question: for the precision control flag, should it be passed through the pytest command? I want to ensure it’s set up correctly for the AMX CI signals.

endomorphosis · 2024-10-22T06:50:18Z

@guangzegu

Thank you for all the hard work.

Enable Intel®-AMX/oneDNN to accelerate IndexFlat search

7db8ce6

formatted distances.cpp and onednn_utils.h

b35a0f2

guangzegu force-pushed the main branch from 2de4b10 to b35a0f2 Compare March 11, 2024 02:36

facebook-github-bot added the CLA Signed label Mar 13, 2024

Add descriptions of Intel®-AMX/oneDNN optimization to INSTALL.md

781f178

guangzegu marked this pull request as ready for review March 19, 2024 07:38

Add oneDNN/AMX optimization for distance calculation using Blas for I…

2f3fdf9

…ndexFlatIP

mdouze assigned ramilbakhshyiev May 29, 2024

Merge branch 'facebookresearch:main' into main

a15c5cc

asadoughi added the Platform label Jul 3, 2024

endomorphosis mentioned this pull request Jul 12, 2024

Feature request add AMX instructions for FAISS, in the retrievers opea-project/GenAIComps#307

Open

junjieqi added Platform and removed Platform labels Jul 22, 2024

mengdilin reviewed Jul 30, 2024

View reviewed changes

george-gu-2021 mentioned this pull request Aug 3, 2024

Any plan to speed up pgvector.rs with Intel AMX instruction? tensorchord/pgvecto.rs#547

Open

Restructure the AMX integration with faiss

116fc01

mengdilin requested changes Sep 30, 2024

View reviewed changes

guangzegu added 3 commits October 14, 2024 15:19

Merge remote-tracking branch 'upstream/main'

e3ea518

Refactor and optimize the code structure to support AMX/OneDNN comput…

9e34323

…ation

Format distances_dnnl.h

f556407

Merge branch 'main' into main

ed7b184

	float* res_arr = (float)malloc(nx ny * sizeof(float));
	float* res_arr = new float[nx * ny];


		/// Getter of block sizes value for oneDNN/AMX distance computations
		int faiss_get_distance_compute_dnnl_query_bs();

	exhaustive_inner_product_seq(x, y, d, nx, ny, res);
	} else {
	exhaustive_inner_product_blas(x, y, d, nx, ny, res);
	}


		FAISS_ASSERT(use_sel == (sel != nullptr));

		float* res_arr = (float)malloc(nx ny * sizeof(float));

Enable Intel®-AMX/oneDNN to accelerate IndexFlatIP search #3266

Are you sure you want to change the base?

Enable Intel®-AMX/oneDNN to accelerate IndexFlatIP search #3266

Conversation

guangzegu commented Feb 27, 2024 • edited Loading

Description

How to use

facebook-github-bot commented Feb 27, 2024

Action Required

Process

alexanderguzhva commented Feb 29, 2024 • edited Loading

guangzegu commented Mar 5, 2024

guangzegu commented Mar 26, 2024

alexanderguzhva commented Apr 6, 2024

guangzegu commented May 14, 2024

alexanderguzhva commented May 24, 2024

mdouze commented May 28, 2024

ramilbakhshyiev commented May 30, 2024

guangzegu commented Jun 19, 2024

guangzegu commented Jun 19, 2024

ramilbakhshyiev commented Jun 21, 2024

mengdilin commented Jul 6, 2024

guangzegu commented Jul 16, 2024

mengdilin commented Jul 19, 2024 • edited Loading

mengdilin commented Jul 29, 2024 • edited Loading

mengdilin commented Jul 29, 2024

mengdilin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mengdilin Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

mengdilin commented Aug 19, 2024

guangzegu commented Aug 21, 2024 • edited Loading

endomorphosis commented Sep 11, 2024

alexanderguzhva commented Sep 11, 2024

xtangxtang commented Sep 25, 2024

xtangxtang commented Sep 25, 2024

endomorphosis commented Sep 25, 2024

mengdilin commented Sep 25, 2024 • edited Loading

xtangxtang commented Sep 27, 2024

mengdilin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guangzegu commented Oct 22, 2024 • edited Loading

endomorphosis commented Oct 22, 2024

guangzegu commented Feb 27, 2024 •

edited

Loading

alexanderguzhva commented Feb 29, 2024 •

edited

Loading

mengdilin commented Jul 19, 2024 •

edited

Loading

mengdilin commented Jul 29, 2024 •

edited

Loading

mengdilin Jul 30, 2024 •

edited

Loading

guangzegu commented Aug 21, 2024 •

edited

Loading

mengdilin commented Sep 25, 2024 •

edited

Loading

mengdilin left a comment •

edited

Loading

guangzegu commented Oct 22, 2024 •

edited

Loading