Change weight matrix storage to significantly speed up prediction while using less memory #33

tomtung · 2021-12-04T09:11:33Z

Background

During prediction, one major bottleneck is the computation of sparse-sparse dot product between the input feature vector and each branch/label weight vector on each node. For example, if the two vectors have M and N non-zero elements, respectively, a simple "marching pointers" implementation would have O(M + N) time complexity. Note that even though both are sparse vectors, N could still be enormous given the possible high feature dimension--e.g., 10% of 10 million is still 1 million.

In the current implementation, we support reducing the time complexity to O(M) at the cost of using more memory by storing some of the "relatively dense" sparse weight vectors in dense format, determined by the --max_sparse_density option. Observe that, on the one hand, weight vectors generally get denser the closer they are to the root (since they represent a wider variety of labels). On the other hand, the number of nodes shrinks exponentially the smaller the depth is (for a balanced tree at least). Therefore appropriately setting --max_sparse_density could achieve a very noticeable speed-up without the model taking up too much memory.

Alternatively, we could also compute the sparse-sparse vector products using binary search. This was introduced in 6049701 which did speed up sparse-sparse vector product in general, but was later removed in cdc71ad because it might slightly slow down the overall prediction when --max_sparse_density is small enough. One possible reason could be that when both vectors are sufficiently sparse, the cost of cache misses caused by binary searches overshadows the improvement in asymptotic time complexity.

About this change

This pull request implements a technique very similar to what's described by Etter et al. (2021) to replace the naive "marching pointers" implementation for calculating sparse-sparse products.

The basic idea is similar to how we efficiently rank relevant documents in information retrieval. We can see the input feature vector as the query and weight vectors of all branches/labels as documents. We want to calculate the similarity between the query and each document measured by the dot product between their vectors.

Both query and document vectors are high-dimensional, but the number of non-zero elements in a query vector is typically drastically smaller. Therefore, we can build an inverted index for the documents in which non-zero values from all documents on the same dimension are stored and quickly retrieved together. Using this index, for each non-zero query dimension, we can efficiently compute how much this dimension contributes to the dot product with each document, and adding them up for each document gives us the final result.

In this implementation, the "inverted index" is just a list-of-lists sparse matrix that stores feature indices in sorted order, so that we can locate any given feature index by binary search. Therefore, to compute the query-document dot products for all documents, we only need to do a "combined" binary search once for all documents simultaneously, as opposed to doing the expensive and cache-unfriendly binary search multiple times, once for every document. Also, we expect the number of "combined" non-zero feature indices not to be too much greater than individual documents: since the "documents" on each node are labels from the same cluster, their weight vectors tend to share the same non-zero dimensions.

Note: since this changes how weights are stored, it also breaks the compatibility of saved model files.

Benchmarks

We tested the prediction speed of the new implementation (labeled "new") on the Amazon-670K dataset, which contains 670,091 labels and 153,025 test examples of dimension 135,909. For comparison, we also run the same test on the implementation from the current master branch (labeled "current"), as well as a slightly changed version that uses binary search to compute sparse-sparse vector dot products (labeled "current-bsearch").

We varied --max_sparse_density between 0.01 and 0.5 to get different tradeoffs between space and time. We only use a single thread to reduce noise (--n_threads 1).

For the Parabel model (i.e., deep trees built with balanced 2-means clustering), we plot throughput (number of predictions per second) against estimated memory usage:

The graph shows that the new implementation can easily be several times faster for the same memory usage. In other words, the new implementation can achieve the same throughput/latency using only a fraction of the memory as before.

The general results also hold for the Bonsai model (i.e., shallow trees built with regular k-means clustering), except that the advantage of the new implementation is even more pronounced:

Future work

In the future, we will explore indexing the features in weight matrices using hash maps. Compared to storing the whole matrix in dense format, it might achieve comparable speed-up while using much less extra memory.

Reference

Etter, P. A., Zhong, K., Yu, H.-F., Ying, L., & Dhillon, I. (2021). Accelerating Inference for Sparse Extreme Multi-Label Ranking Trees. Retrieved from http://arxiv.org/abs/2106.02697

For now we still use the CsMat implementation from the `sprs` library. Later we'll replace it with something more efficient for our purpose. Here we also removed the logic where if all training examples are positive, we skip training the corresponding classifier. If I remember correctly, the logic was added only to speed up training for a little bit, so this hopefully shouldn't affect prediction performance.

in case the total number of nnz elements in the matrix overflows the index type `u32`

During experimentation, I noticed the perplexing fact that decreasing `--max_sparse_density` actually slows down prediction when the trees have high-arity, despite using more memory to store weight matrices in dense format. My guess is that when tree arity is high, the binary search cost is averaged over a larger number of branches and becomes negligible. In such case, storing matrices in sparse format comes with the benefit of better cache locality, which dominates the cost of binary search index look-up. Indeed, after the change, make more weight matrices dense improves prediction speed as one would expect.

tomtung · 2021-12-05T10:04:00Z

As for the "future work" mentioned above, I quickly experimented with using hash tables for feature indexing instead of binary search. Surprisingly, the prediction speed only seemed to have noticeably slowed down, so further investigation is needed to understand why this happened.

For reference, the patch is attached here: hash_index.patch.txt

tomtung and others added 7 commits December 1, 2021 21:38

Add initial implementation of a List-of-list matrix

8c02dc1

Add dot product between col-major LIL matrix and sparse vector

9ff4e76

Use column-major LilMat to store weight matrices

cd8457e

Store weight matrices as dense if smaller sparse

72f28c5

Change the global index type of sparse weight matrix to usize

0f17063

in case the total number of nnz elements in the matrix overflows the index type `u32`

tomtung merged commit f4cc6af into master Dec 4, 2021

tomtung deleted the lilmat branch December 4, 2021 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change weight matrix storage to significantly speed up prediction while using less memory #33

Change weight matrix storage to significantly speed up prediction while using less memory #33

tomtung commented Dec 4, 2021

tomtung commented Dec 5, 2021

Change weight matrix storage to significantly speed up prediction while using less memory #33

Change weight matrix storage to significantly speed up prediction while using less memory #33

Conversation

tomtung commented Dec 4, 2021

Background

About this change

Benchmarks

Future work

Reference

tomtung commented Dec 5, 2021