Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid redundant calculation of hash data during join probe stage #8294

Closed
windtalker opened this issue Nov 2, 2023 · 0 comments · Fixed by #8297
Closed

Avoid redundant calculation of hash data during join probe stage #8294

windtalker opened this issue Nov 2, 2023 · 0 comments · Fixed by #8297
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@windtalker
Copy link
Contributor

Enhancement

In join probe stage, for each input block, it will be wrapped inside ProbeProcessInfo, this is because if the build side has many duplicated key entries, the probe stage may expand the input block greatly, and inorder to keep the max-row-size of a output block, ProbeProcessInfo is used to support probe only part of the input block. That is to say for an input block, it may be probed multiple times, each time of part of the data in the block is processed.

For each probe, it will call probeBlockImplTypeCase to do the probe, and inside that function, it will calculate the hash data of the whole block:

if (join_build_info.needVirtualDispatchForProbeBlock())
{
assert(!(join_build_info.restore_round > 0 && join_build_info.enable_fine_grained_shuffle));
/// TODO: consider adding a virtual column in Sender side to avoid computing cost and potential inconsistency by heterogeneous envs(AMD64, ARM64)
/// Note: 1. Not sure, if inconsistency will do happen in heterogeneous envs
/// 2. Virtual column would take up a little more network bandwidth, might lead to poor performance if network was bottleneck
/// Currently, the computation cost is tolerable, since it's a very simple crc32 hash algorithm, and heterogeneous envs support is not considered
computeDispatchHash(
rows,
key_columns,
collators,
sort_key_containers,
join_build_info.restore_round,
build_hash);
}

If a block is probed multiple times, the hash data will be calculated multiple times, which is meaningless and redundant.

@windtalker windtalker added the type/enhancement The issue or PR belongs to an enhancement. label Nov 2, 2023
@ti-chi-bot ti-chi-bot bot closed this as completed in #8297 Nov 2, 2023
ti-chi-bot bot pushed a commit that referenced this issue Nov 2, 2023
windtalker added a commit to windtalker/tiflash that referenced this issue Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant