Change weight matrix storage to significantly speed up prediction while using less memory #33
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
During prediction, one major bottleneck is the computation of sparse-sparse dot product between the input feature vector and each branch/label weight vector on each node. For example, if the two vectors have M and N non-zero elements, respectively, a simple "marching pointers" implementation would have O(M + N) time complexity. Note that even though both are sparse vectors, N could still be enormous given the possible high feature dimension--e.g., 10% of 10 million is still 1 million.
In the current implementation, we support reducing the time complexity to O(M) at the cost of using more memory by storing some of the "relatively dense" sparse weight vectors in dense format, determined by the
--max_sparse_density
option. Observe that, on the one hand, weight vectors generally get denser the closer they are to the root (since they represent a wider variety of labels). On the other hand, the number of nodes shrinks exponentially the smaller the depth is (for a balanced tree at least). Therefore appropriately setting--max_sparse_density
could achieve a very noticeable speed-up without the model taking up too much memory.Alternatively, we could also compute the sparse-sparse vector products using binary search. This was introduced in 6049701 which did speed up sparse-sparse vector product in general, but was later removed in cdc71ad because it might slightly slow down the overall prediction when
--max_sparse_density
is small enough. One possible reason could be that when both vectors are sufficiently sparse, the cost of cache misses caused by binary searches overshadows the improvement in asymptotic time complexity.About this change
This pull request implements a technique very similar to what's described by Etter et al. (2021) to replace the naive "marching pointers" implementation for calculating sparse-sparse products.
The basic idea is similar to how we efficiently rank relevant documents in information retrieval. We can see the input feature vector as the query and weight vectors of all branches/labels as documents. We want to calculate the similarity between the query and each document measured by the dot product between their vectors.
Both query and document vectors are high-dimensional, but the number of non-zero elements in a query vector is typically drastically smaller. Therefore, we can build an inverted index for the documents in which non-zero values from all documents on the same dimension are stored and quickly retrieved together. Using this index, for each non-zero query dimension, we can efficiently compute how much this dimension contributes to the dot product with each document, and adding them up for each document gives us the final result.
In this implementation, the "inverted index" is just a list-of-lists sparse matrix that stores feature indices in sorted order, so that we can locate any given feature index by binary search. Therefore, to compute the query-document dot products for all documents, we only need to do a "combined" binary search once for all documents simultaneously, as opposed to doing the expensive and cache-unfriendly binary search multiple times, once for every document. Also, we expect the number of "combined" non-zero feature indices not to be too much greater than individual documents: since the "documents" on each node are labels from the same cluster, their weight vectors tend to share the same non-zero dimensions.
Note: since this changes how weights are stored, it also breaks the compatibility of saved model files.
Benchmarks
We tested the prediction speed of the new implementation (labeled "new") on the Amazon-670K dataset, which contains 670,091 labels and 153,025 test examples of dimension 135,909. For comparison, we also run the same test on the implementation from the current master branch (labeled "current"), as well as a slightly changed version that uses binary search to compute sparse-sparse vector dot products (labeled "current-bsearch").
We varied
--max_sparse_density
between 0.01 and 0.5 to get different tradeoffs between space and time. We only use a single thread to reduce noise (--n_threads 1
).For the Parabel model (i.e., deep trees built with balanced 2-means clustering), we plot throughput (number of predictions per second) against estimated memory usage:
The graph shows that the new implementation can easily be several times faster for the same memory usage. In other words, the new implementation can achieve the same throughput/latency using only a fraction of the memory as before.
The general results also hold for the Bonsai model (i.e., shallow trees built with regular k-means clustering), except that the advantage of the new implementation is even more pronounced:
Future work
In the future, we will explore indexing the features in weight matrices using hash maps. Compared to storing the whole matrix in dense format, it might achieve comparable speed-up while using much less extra memory.
Reference