-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let HashProbe keep track of memory consumption when listing join results #10652
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
43b1e46
to
8e118f0
Compare
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
uint64_t totalBytes{0}; | ||
for (const auto& column : columns) { | ||
if (!rows_->columnTypes()[column]->isFixedWidth()) { | ||
totalBytes += rows_->variableSizeAt(row, column); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little worried about performance implication on this line. Usually we don't load the row container memory while listing the join results, so the memory is cold and reading from it takes very long time. Is it possible to do some row size estimation based on total size in row container and adjust the number of listing rows smartly? It is less accurate but will have better performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Yuhta , yeah there is going to be some regression, but only on list results part. Quickly after this list results, we are going to do memory copy which is going to be considerably more expensive than list results, making the regression less significant overall. Row size estimation based on total size in row container might not work in this case because we don't know which build side row the probe side is going to match. It could happen that all probe side rows are matching with a few super large build side rows (hence significant skew).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can try to use the average row size of the build size. Skewness on the build side is hard to solve in this case, but do we run into that in real workload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, real production workload leads to this improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we going over the row container to get the maximum row size and use that to adjust max num rows out? That way we don't need to stride through the memory twice for each probe and keep string columns on fast path.
velox/exec/HashTable.cpp
Outdated
} | ||
if (varSizeColumns.empty() && !hasDuplicates_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another regression here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify a bit on this part? Is it the row counts limit in fast path?
Maybe we can run some shadow benchmark to see how big of an impact it is to the overall performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This disable the fast path when string is present. The overall impact is probably small as hash join is not majority of the computation, but on individual queries this can be bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean listJoinResultsNoDuplicates is fast path compared with duplicate case? Thanks!
velox/exec/HashTable.cpp
Outdated
template <bool ignoreNullKeys> | ||
int32_t HashTable<ignoreNullKeys>::listJoinResults( | ||
const std::vector<vector_size_t>& listColumns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put listColumns in JoinResultIterator?
velox/exec/HashTable.cpp
Outdated
} | ||
if (varSizeColumns.empty() && !hasDuplicates_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean listJoinResultsNoDuplicates is fast path compared with duplicate case? Thanks!
velox/exec/HashTable.cpp
Outdated
int32_t numOut = 0; | ||
auto maxOut = inputRows.size(); | ||
auto maxOut = std::min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto maxOut
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanjialiang thanks for the update % minors.
bf87f1f
to
02dac6a
Compare
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
b0d3392
to
d6d068a
Compare
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanjialiang thanks for the update!
int32_t numOut = 0; | ||
auto maxOut = inputRows.size(); | ||
const auto maxOut = std::min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this function only applies for the case: (1) table has no rows with duplicate join keys, (2) all list columns are fixed size?
Shall we rename to
s/listJoinResultsNoDuplicates/listJoinResultsWithFixSizeColumnsAndNoDuplicateJoinKeys/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just call it listJoinResultsFastPath and let comments do the explanation.
d6d068a
to
4ea4855
Compare
4ea4855
to
04dec5e
Compare
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
04dec5e
to
3920fec
Compare
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanjialiang thanks for the iterations. LGTM!
3920fec
to
4db78bd
Compare
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
a3013ee
to
fb39a3a
Compare
fb39a3a
to
89aa01e
Compare
@tanjialiang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@tanjialiang merged this pull request in 82e5492. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…oin results (facebookincubator#10652)" This reverts commit 82e5492.
Our TPCDS performance regression nearly 5%, do we need to re-consider this feature?
|
@tanjialiang Can you add the check to compare average vs max build row size and only do the per row size estimation if the max size is larger than say 2x of the average size? TPC benchmarks are important externally and we want to keep it stay SoTA |
@jinchengchenghh This is an important fix for Meta internal traffic that prevents certain queries from OOMing. I will patch a fast path for the feature to improve the performance, as @Yuhta mentioned. |
Thanks very much. @tanjialiang @Yuhta |
@tanjialiang @Yuhta thanks for the quick turnaround. Please also note the benchmark results is done with TPCDS SF3000. Haven't done a larger scale test yet but based on experiences the performance drop may be bigger in that case thanks, -yuan |
…lts (facebookincubator#10652) Summary: Hash probe currently has limited memory control when extracting results from the hash table. When a small number of large sized rows from the build side is frequently joined with the left side, the total extracted size will explode, making HashProbe using a large amount of memory. And the process of filling output is not in spillable state, and will often cause OOM. This PR computes the total size when listing join results in hash probe if there are any variable size columns from the build side that is going to be extracted. It stops listing further when it reaches the maximum size. This can help to control hash probe side memory usage to a confined limit. Pull Request resolved: facebookincubator#10652 Reviewed By: xiaoxmeng Differential Revision: D60771773 Pulled By: tanjialiang fbshipit-source-id: 2cb8c58ba795a0aa1df0485b58e4f6d0100be8f8 (cherry picked from commit 82e5492)
…lts (facebookincubator#10652) (#495) Summary: Hash probe currently has limited memory control when extracting results from the hash table. When a small number of large sized rows from the build side is frequently joined with the left side, the total extracted size will explode, making HashProbe using a large amount of memory. And the process of filling output is not in spillable state, and will often cause OOM. This PR computes the total size when listing join results in hash probe if there are any variable size columns from the build side that is going to be extracted. It stops listing further when it reaches the maximum size. This can help to control hash probe side memory usage to a confined limit. Pull Request resolved: facebookincubator#10652 Reviewed By: xiaoxmeng Differential Revision: D60771773 Pulled By: tanjialiang fbshipit-source-id: 2cb8c58ba795a0aa1df0485b58e4f6d0100be8f8 (cherry picked from commit 82e5492) Co-authored-by: Jialiang Tan <[email protected]>
…m-01' into 'rebase-upstream-1.2.x-vcpkg' Let HashProbe keep track of memory consumption when listing join results (facebookincubator#10652) Summary: Hash probe currently has limited memory control when extracting results from the hash table. When a small number of large sized rows from the build side is frequently joined with the left side, the total extracted size will explode, making HashProbe using a large amount of memory. And the process of filling output is not in spillable state, and will often cause OOM. This PR computes the total size when listing join results in hash probe if there are any variable size columns from the build side that is going to be extracted. It stops listing further when it reaches the maximum size. This can help to control hash probe side memory usage to a confined limit. PR Link: https://dev.sankuai.com/code/repo-detail/data/velox/pr/42
Hash probe currently has limited memory control when extracting results from the hash table. When a small number of large sized rows from the build side is frequently joined with the left side, the total extracted size will explode, making HashProbe using a large amount of memory. And the process of filling output is not in spillable state, and will often cause OOM.
This PR computes the total size when listing join results in hash probe if there are any variable size columns from the build side that is going to be extracted. It stops listing further when it reaches the maximum size. This can help to control hash probe side memory usage to a confined limit.