Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorization Factor patch for computeInfoC2P with Broadcast in mapped IterDomain #1625

Merged
merged 62 commits into from
Feb 1, 2024
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
3aced73
quick hack to see the effect
jjsjann123 Jan 11, 2024
f66fd74
WIP to enable rfactor reorder
jjsjann123 Jan 11, 2024
3577582
fixing build
jjsjann123 Jan 11, 2024
cbb3409
fixing build
jjsjann123 Jan 11, 2024
0aed1e2
debug print
jjsjann123 Jan 12, 2024
3723352
fixing print out
jjsjann123 Jan 12, 2024
b20537e
Merge remote-tracking branch 'origin/main' into issue_1567
jjsjann123 Jan 12, 2024
95d5f04
fixing scheduling to use reordered root/rfactor
jjsjann123 Jan 12, 2024
eb057e8
quick fix
jjsjann123 Jan 12, 2024
d7fa997
Merge remote-tracking branch 'origin/main' into issue_1567
jjsjann123 Jan 12, 2024
d091c9d
clangtidy/clangformat
jjsjann123 Jan 12, 2024
890be18
more comments
jjsjann123 Jan 12, 2024
a971273
removing old code
jjsjann123 Jan 12, 2024
d57ecc1
test added
jjsjann123 Jan 12, 2024
fdb0a42
fix
jjsjann123 Jan 12, 2024
34bd074
adding test case
jjsjann123 Jan 13, 2024
a97a9a8
fixing test
jjsjann123 Jan 13, 2024
298f56f
test added
jjsjann123 Jan 13, 2024
280509b
refactoring broadcast checks in vectorization analysis
jjsjann123 Jan 15, 2024
48bde18
fixing tests
jjsjann123 Jan 15, 2024
3dcf287
Apply suggestions from code review
jjsjann123 Jan 16, 2024
9cd6610
Merge remote-tracking branch 'origin/main' into issue_1567_part2
jjsjann123 Jan 16, 2024
28389b4
modifying tests to reflect intermdiate without alloc_domain
jjsjann123 Jan 16, 2024
99d0411
Merge remote-tracking branch 'origin/main' into HEAD
jjsjann123 Jan 16, 2024
79724d6
Xiang's suggestion to remove reorder_map from heuristics, as it's a c…
jjsjann123 Jan 16, 2024
abb44fc
renaming reference
jjsjann123 Jan 16, 2024
a23ab05
remove profiling flag
jjsjann123 Jan 16, 2024
139578f
:evert "remove profiling flag"
jjsjann123 Jan 16, 2024
7d2fb11
Revert "refactoring broadcast checks in vectorization analysis"
jjsjann123 Jan 16, 2024
5c391fd
refactoring the skip
jjsjann123 Jan 16, 2024
9050cd5
fixing logic
jjsjann123 Jan 17, 2024
e2da518
Merge remote-tracking branch 'origin/issue_1567' into HEAD
jjsjann123 Jan 17, 2024
56e17c6
CLANGFORMAT
jjsjann123 Jan 17, 2024
9d6f425
using data_cache as suggested
jjsjann123 Jan 17, 2024
fb3a4d8
clangformat
jjsjann123 Jan 17, 2024
1253946
err
jjsjann123 Jan 17, 2024
543c08e
fixing build maybe?!
jjsjann123 Jan 17, 2024
6a00b22
CLANGFORMAT
jjsjann123 Jan 17, 2024
502b33d
Merge remote-tracking branch 'origin/issue_1567' into HEAD
jjsjann123 Jan 17, 2024
c46d855
Merge remote-tracking branch 'origin/main' into HEAD
jjsjann123 Jan 17, 2024
8b0beb8
more comment and renaming tests
jjsjann123 Jan 17, 2024
a21873a
Update csrc/scheduler/vectorize_helper.cpp
jjsjann123 Jan 18, 2024
c95686e
fix bug found by Naoya
jjsjann123 Jan 18, 2024
c2e4b96
clangformat
jjsjann123 Jan 18, 2024
04f6765
Merge remote-tracking branch 'origin/main' into HEAD
jjsjann123 Jan 26, 2024
800ff1a
refactor the fix
jjsjann123 Jan 26, 2024
0cb45ab
fixing build, signed/unsigned int compare
jjsjann123 Jan 26, 2024
a7396cc
adding Xiang's test suggestions
jjsjann123 Jan 26, 2024
eb3a618
force scheduling for test case0
jjsjann123 Jan 26, 2024
eb45314
Update csrc/scheduler/vectorize_helper.cpp
jjsjann123 Jan 26, 2024
221573a
fixing tests
jjsjann123 Jan 26, 2024
65bef69
Merge remote-tracking branch 'origin/issue_1567_part2' into HEAD
jjsjann123 Jan 26, 2024
34772d4
code cleanup
jjsjann123 Jan 26, 2024
d607d91
fixing logic
jjsjann123 Jan 26, 2024
e542bb2
tabs
jjsjann123 Jan 26, 2024
e5b8af4
tabs
jjsjann123 Jan 26, 2024
c502fcd
renaming tests
jjsjann123 Jan 26, 2024
0b4ad9e
something is funny with my vim setup here
jjsjann123 Jan 26, 2024
5df9c74
Merge branch 'main' into issue_1567_part2
jjsjann123 Jan 30, 2024
c8ae949
update tests
jjsjann123 Jan 30, 2024
77c24db
updating tests
jjsjann123 Jan 30, 2024
b633619
Merge branch 'main' into issue_1567_part2
jjsjann123 Feb 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 27 additions & 24 deletions csrc/scheduler/vectorize_helper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,9 @@ ContiguousInnerDimensionsMapper::computeInfoC2P(
std::shared_ptr<MaxInfoSpanningTree::Information> from_info) {
auto from_ids = std::dynamic_pointer_cast<const MappedDomain>(from_info)
->mapped_root_ids_;
// When we propagate, we should check the resolved broadcast in the order of
// mapped from_ids.
//
// If we have a case where we have a concretized broadcast that's being
// tracked in a consumer but not concretized in the producer we should break
// off the dimensions connected to the left of that dimension. So if we have:
Expand All @@ -445,29 +448,32 @@ ContiguousInnerDimensionsMapper::computeInfoC2P(
// T3[i0, i1, i2] = T1 + T2
// and we're propogating from T3 with {i0, i1, i2}
// When we go from T3 to T0, we don't have any mechanism to understand that i0
// and i2 are not contiguous in the original domain of T3. It's not ideal with
// transpose, but when this happens we'll clear all dimensions mapped left of
// the concretized broadcast.
// So if we have:
// T0[i1, i2]
// T1[b0, i1, i2] = broadcast(T0)
// T2[i1, b0, i2] = transpose(T1)
// T3[i1, i0, i2]
// T4[i1, i0, i2] = T2 + T3
// T5[i0, i1, i2] = transpose(T4)
// Then i1 and i2 are contiguous in both T0 and T5, but due to the realization
// of the broadcast on T4 we will have removed i1 from the mapped set.
// and i2 are not contiguous in the original domain of T3.
//
// Another example is that, if the last broadcast dimension resolved in
// consumers root domain is mapped for vectorization, the merge order in
// the vectorization axes matters.
//
// T0[i0, i1]
// T1[i0, i1, b2] = broadcast(T0)
// T2[i0, i1, i3]
// T3[i0, i1, i2] = T1 + T2
//
// If the mapped ids are {i0, i2, i1}, when propagating from T3 to T1, the
// resolved broadcast iterdomain is `i2`/`b2`, which would give clear_pos=1.
// So we'll skip all from_ids with index < clear_pos. see issue:
// https://github.com/NVIDIA/Fuser/issues/1567#issuecomment-1894605385
PairwiseRootDomainMap root_map(to, from);
auto c2p_map = root_map.mapConsumerToProducer();

// Id's in consumer to clear from the mapped set due to broadcast
// concretization.
std::unordered_set<IterDomain*> consumer_ids_to_clear;
int clear_pos = -1;
if (to->hasBroadcast()) {
// Find the last broadcast dimension resolved in consumers root domain
int clear_pos = -1;
for (auto i : c10::irange(from->getRootDomain().size())) {
auto c_id = from->getRootDomain()[i];
// Find the last broadcast dimension resolved in consumers through from_ids
for (auto i : c10::irange(from_ids.size())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to iterate from from_ids.size() to 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤕 Good catch! I don't know why I was just mindlessly copying the old behavior.

auto c_id = from_ids[i];
auto c_it = c2p_map.find(c_id);
if (c_it == c2p_map.end()) {
continue;
Expand All @@ -477,17 +483,14 @@ ContiguousInnerDimensionsMapper::computeInfoC2P(
clear_pos = (int)i;
}
}
// Clear everything to the left of the inner most resolved broadcast
// dimension, including the broadcasted domain.
if (clear_pos >= 0) {
consumer_ids_to_clear.insert(
from->getRootDomain().begin(),
from->getRootDomain().begin() + clear_pos + 1);
}
}

std::vector<IterDomain*> producer_rfactor_ids;
for (auto from_id : from_ids) {
for (int64_t i : c10::irange(from_ids.size())) {
jjsjann123 marked this conversation as resolved.
Show resolved Hide resolved
if (i < clear_pos) {
continue;
}
auto from_id = from_ids[i];
auto c2p_it = c2p_map.find(from_id);
if (c2p_it != c2p_map.end() &&
consumer_ids_to_clear.find(c2p_it->first) ==
Expand Down
168 changes: 168 additions & 0 deletions test/test_pointwise.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -201,4 +201,172 @@ TEST_F(PointwiseTest, VectorizeAllocationDomain) {
testValidate(fusion, cg_outputs, {t0}, __LINE__, __FILE__);
}

// All inputs & outputs share the same allocation domain permutation from root
// domain, but intermediate tv2 isn't specified a stride order. There's also a
// broadcast IterDomain on tv1, which is tricky for vectorization analysis to
// figure out which axes should be excluded from the computation of
// vectorization factor.
TEST_F(PointwiseTest, VectorizeAllocationDomainIssue1567) {
auto fusion_ptr = std::make_unique<Fusion>();
auto fusion = fusion_ptr.get();
FusionGuard fg(fusion);

TensorView* tv0 = TensorViewBuilder()
.ndims(3)
.contiguity({true, true, true})
.strideOrder({2, 0, 1})
.build();
TensorView* tv1 = TensorViewBuilder()
.ndims(3)
.shape({1, -1, 1})
.contiguity({std::nullopt, std::nullopt, true})
.strideOrder({2, 0, 1})
.build();
fusion->addInput(tv0);
fusion->addInput(tv1);
auto tv2 = add(tv0, tv1);
auto tv3 = add(tv2, IrBuilder::create<Val>(1.0, DataType::Float));
tv3->setAllocationDomain({tv3->axis(0), tv3->axis(2), tv3->axis(1)}, true);
fusion->addOutput(tv3);

FusionExecutorCache fec(std::move(fusion_ptr));
fec.profile(true);

auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
at::Tensor t0 = at::empty_strided({1024, 128, 25}, {128*25, 1, 128}, options);
at::Tensor t1 = at::empty_strided({1, 128, 1}, {128, 1, 128}, options);
auto cg_outputs = fec.runFusionWithInputs({t0, t1});
EXPECT_EQ(getVecSizeForPointwise(fec), 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check that the input cache of tv0 and tv1 is vectorized? Because it is possible that, if not done correctly, only the output tv is vectorized, and inputs are just unrolled.

testValidate(fusion, cg_outputs, {t0, t1}, __LINE__, __FILE__);
}

TEST_F(PointwiseTest, VectorizationFactorAnalysisCase0) {
auto fusion_ptr = std::make_unique<Fusion>();
auto fusion = fusion_ptr.get();
FusionGuard fg(fusion);

TensorView* tv0 = TensorViewBuilder()
.ndims(3)
.contiguity({true, true, std::nullopt})
.shape({-1, -1, 1})
.build();
TensorView* tv1 = TensorViewBuilder()
.ndims(3)
.contiguity({true, true, true})
.build();
fusion->addInput(tv0);
fusion->addInput(tv1);
auto tv2 = add(tv0, tv1);
fusion->addOutput(tv2);

auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
at::Tensor t0 = at::randn({1024, 2, 1}, options);
at::Tensor t1 = at::randn({1024, 2, 512}, options);
std::vector<c10::IValue> aten_inputs = {t0, t1};

// NOTE: force pointwise scheduler here just for testing purpose
auto params = getPointwiseHeuristics(fusion, aten_inputs);
auto lparams = schedulePointwise(fusion, aten_inputs);
FusionExecutor fe;
fe.compileFusion(fusion, aten_inputs, lparams);
auto cg_outputs = fe.runFusion(aten_inputs, lparams);

EXPECT_EQ(params->vectorize, true);
EXPECT_EQ(params->unroll_factor, 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check that, in the scheduled fusion, tv0's input cache is not vectorized?


testValidate(fusion, cg_outputs, aten_inputs, __LINE__, __FILE__);
}

TEST_F(PointwiseTest, VectorizationFactorAnalysisCase1) {
auto fusion_ptr = std::make_unique<Fusion>();
auto fusion = fusion_ptr.get();
FusionGuard fg(fusion);

TensorView* tv0 = TensorViewBuilder()
.ndims(3)
.contiguity({true, std::nullopt, true})
.shape({-1, 1, -1})
.build();
TensorView* tv1 = TensorViewBuilder()
.ndims(3)
.contiguity({true, true, true})
.build();
fusion->addInput(tv0);
fusion->addInput(tv1);
auto tv2 = add(tv0, tv1);
fusion->addOutput(tv2);

FusionExecutorCache fec(std::move(fusion_ptr));
fec.profile(true);

auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
at::Tensor t0 = at::randn({1024, 1, 2}, options);
at::Tensor t1 = at::randn({1024, 512, 2}, options);
auto cg_outputs = fec.runFusionWithInputs({t0, t1});
EXPECT_EQ(getVecSizeForPointwise(fec), 2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check that the input cache of tv0 is vectorized?

testValidate(fusion, cg_outputs, {t0, t1}, __LINE__, __FILE__);
}

TEST_F(PointwiseTest, VectorizationFactorAnalysisCase2) {
auto fusion_ptr = std::make_unique<Fusion>();
auto fusion = fusion_ptr.get();
FusionGuard fg(fusion);

TensorView* tv0 = TensorViewBuilder()
.ndims(3)
.contiguity({true, std::nullopt, true})
.shape({-1, 1, -1})
.build();
TensorView* tv1 = TensorViewBuilder()
.ndims(3)
.contiguity({true, true, true})
.build();
fusion->addInput(tv0);
fusion->addInput(tv1);
auto tv2 = add(tv0, tv1);
auto tv3 = transpose(tv2, 0, 1);
fusion->addOutput(tv3);

FusionExecutorCache fec(std::move(fusion_ptr));
fec.profile(true);

auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
at::Tensor t0 = at::randn({1024, 1, 2}, options);
at::Tensor t1 = at::randn({1024, 512, 2}, options);
auto cg_outputs = fec.runFusionWithInputs({t0, t1});
EXPECT_EQ(getVecSizeForPointwise(fec), 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check that the input cache of tv0 is vectorized?

testValidate(fusion, cg_outputs, {t0, t1}, __LINE__, __FILE__);
}

TEST_F(PointwiseTest, VectorizationFactorAnalysisCase3) {
auto fusion_ptr = std::make_unique<Fusion>();
auto fusion = fusion_ptr.get();
FusionGuard fg(fusion);

TensorView* tv0 = TensorViewBuilder()
.ndims(3)
.contiguity({std::nullopt, true, true})
.shape({1, -1, -1})
.build();
TensorView* tv1 = TensorViewBuilder()
.ndims(3)
.contiguity({true, true, true})
.build();
fusion->addInput(tv0);
fusion->addInput(tv1);
auto tv2 = add(tv0, tv1);
auto tv3 = transpose(tv2, 0, 1);
fusion->addOutput(tv3);

FusionExecutorCache fec(std::move(fusion_ptr));
fec.profile(true);

auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
at::Tensor t0 = at::randn({1, 1024, 2}, options);
at::Tensor t1 = at::randn({512, 1024, 2}, options);
auto cg_outputs = fec.runFusionWithInputs({t0, t1});
EXPECT_EQ(getVecSizeForPointwise(fec), 2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check that the input cache of tv0 is vectorized?

testValidate(fusion, cg_outputs, {t0, t1}, __LINE__, __FILE__);
}

} // namespace nvfuser
Loading