C++ version of frontend tests #2291

jacobhinkle · 2022-12-28T20:58:45Z

This PR adds some C++ tests that reproduce the tests in test/test_nvfuser_frontend.py, with a best effort made to provide a manual schedule matching the automatic schedule generated in each python test. Currently that matching is not enforced (it's done by inspecting the dumped transforms and kernels); automating that would probably require an == operator implemented on the IR.

torch/csrc/jit/codegen/cuda/test/test_gpu_match_frontend.cpp

csarofeen

Are there existing front end tests you're reimplementing? If so could you add references to them in a comment next to the test.

I would like to see computeAt avoided in favor of transform propagation or as @naoyam suggested mostInlined().

jacobhinkle · 2023-01-03T16:21:04Z

I'm reimplementing the tests from https://github.com/csarofeen/pytorch/blob/cpp_frontend_tests/test/test_nvfuser_frontend.py#L58 and using the same naming scheme. I'll make this more clear in comments.

jacobhinkle · 2023-01-05T20:58:02Z

Forced pushed (rebase to devel following Jie's refactor).

This file will hold tests that reproduce those in test_nvfuser_frontend.py.

Added rFactor and computeAt along the split reduction dimension. Since the reduction dimension was split, we also need to specify a size for the blocks.

Currently the SuperBasicFP16 test works fine but the BasicFP16 test fails to compile, complaining that a producer is not in global memory based on parallelization type. This is likely because I haven't parallelized the t5float tensor properly, but I wonder why I did not encounter this in the other (2d) test SuperBasicFP16.

This fixes one bug but now I have variable sized static arrays popping up. Also I fixed my precommit hooks so clang-format was run and hopefully the CI will pass.

This fixes that compile (in that it runs and passes), but neither of the basic tests matches the generated kernel from the pytest.

This test results in what looks like decent code. Again, it doesn't match the pytest code exactly. Also, I have to manually specify not only the block dimension but the grid dimension as well for this test, and I'm not 100% sure why yet.

The result is that this kernel better matches the python-generated one: it does not loop inside the kernel, and instead assumes there will be enough blocks to cover the entire input.

This test runs and is very similar to the python test, but has multiple predicates instead of one outer predicate, just as the SuperBasic test does.

The ir_math_check() function TORCH_CHECKs whether two fusions produce the same ir math string. This is useful for targeting autogenerated fusions. When translating Python frontend tests, I've first manually scheduled until the fusion compiles and runs, then attempted to match the automatically scheduled kernels. This commit makes the second stage simpler. However, if changes are made to the automatic scheduler, then the manual scheduling in these tests would need to be updated.

This makes the test pass now that I'm checking that the auto and manual IRs match.

Currently, `ir_kernel_check` is commented out since it is failing with trivial variable number bumps for the SuperBasic{,FP16} tests. It is low-priority, but at some point I'd like to understand why the equivalent kernel math leads to slightly different (but still equivalent) kernels in these cases. For now, I will move on to the other tests.

This is mainly so that I can use printTransforms to compare IRs.

This test required using pointwiseSchedule. Note that it does not currently pass, like many of the others, due to my misunderstanding of ca_pos.

Note that like SuperBasic{,FP16}, this test passes the math and transforms check but produces a kernel whose variables are incremented (this time by two). There are two inputs in this test where in the other case there is only one. Perhaps something is being done with the inputs by the automatic scheduler to skip a variable number when the kernel is generated. Anyway, the kernels are completely equivalent, but just not _equal_ yet. I will move on to further tests and make another pass later once I figure out what's happening with the variable names in the kernels.

This is just the necessary changes to get my tests compiling after Jie's big refactor, which I've just rebased in.

To make these tests pass including matching the generated kernels, I used the WIP fusion_debug dump output from #2326. This let me see a noisy log of operations that revealed a few patterns I was unaware of. For example, in the FrontendAdd test, the pointwise scheduler is used, and the order of inlining at the end is not defined (and does change) due to using unordered_set. Also, notice that getPointwiseHeuristics actually creates a Val (fusion.oneVal()) so it actually does technically alter the Fusion since it adds a Val to its vals_ list. Beyond those trivial things, I noticed differences in inlining patterns between the pointwise and reduction schedulers, and also saw some different ways to call methods like parallelizeAllLike. One thing I haven't looked more into is the reorders that often happen at the beginning of scheduling both pointwise and reduction fusions. They have no effect on the generated kernels, but it is noticeable and I plan to read up on it soon.

Again I used the fusion_debug dump from #2326 to trace what the reduction scheduler is doing. This time I learned about multiReductionInliner, which uses two calls to parallelizeAllLike for different types of ParallelTypes, followed by an undoing of unrolling and vectorization on the reference tensor. The need for the latter is still a little unclear to me.

There were no real surprises here except that the representative tensor seemed to be chosen differently than in the FP32 case, maybe because the casting before cacheBefore pushed that choice further back in the pipeline.

This was very similar to the FrontendAdd example since it also uses the pointwise scheduler.

At this point `nvfuser_tests --gtest_filter='*Frontend*' passes all 8 tests.

The only new thing in these tests is that when given an explicit broadcast dimension, we shoul duse makeConcreteTensor with extents of 1, in which case those IterDomains will automatically be set to broadcast.

naoyam reviewed Jan 2, 2023

View reviewed changes

torch/csrc/jit/codegen/cuda/test/test_gpu_match_frontend.cpp Outdated Show resolved Hide resolved

csarofeen reviewed Jan 3, 2023

View reviewed changes

jacobhinkle force-pushed the cpp_frontend_tests branch from 0ac02f2 to a06224f Compare January 5, 2023 20:57

jacobhinkle added 25 commits January 18, 2023 09:28

Add test_gpu_match_frontend.cpp

a598d85

This file will hold tests that reproduce those in test_nvfuser_frontend.py.

Properly parallelize SuperBasic and Basic tests

8290482

Added rFactor and computeAt along the split reduction dimension. Since the reduction dimension was split, we also need to specify a size for the blocks.

Add parallelize tv4 in FusionFrontendBasicFP16_CUDA

d7cc442

This fixes one bug but now I have variable sized static arrays popping up. Also I fixed my precommit hooks so clang-format was run and hopefully the CI will pass.

Add tv1->computeAt() to BasicFP16 test

aa6b378

This fixes that compile (in that it runs and passes), but neither of the basic tests matches the generated kernel from the pytest.

Add test: FusionFrontendCastDoubleToHalf_CUDA

2e614f9

This test results in what looks like decent code. Again, it doesn't match the pytest code exactly. Also, I have to manually specify not only the block dimension but the grid dimension as well for this test, and I'm not 100% sure why yet.

Remove griddim split in CastDoubleToHalf test

fb4fa85

The result is that this kernel better matches the python-generated one: it does not loop inside the kernel, and instead assumes there will be enough blocks to cover the entire input.

Add test FusionFrontendPromoteToDouble_CUDA

c0f211c

This test runs and is very similar to the python test, but has multiple predicates instead of one outer predicate, just as the SuperBasic test does.

Replace computeAt(..MostInlined) with inlineMost()

9175568

Make SuperBasic manual schedule match automatic

6d3d34f

This makes the test pass now that I'm checking that the auto and manual IRs match.

Add python defs to docstrings of frontend tests

5ffd5a5

Add stream arg to Fusion::print{,Transforms}()

cd5fc9f

This is mainly so that I can use printTransforms to compare IRs.

Add compare_ir() which compares math, fusions & kernels

e041b9f

Add pointwise Add() example to frontend and C++ tests

acbdbc9

Add FusionFrontendImplicitBroadcastInput_CUDA test

30ed719

This test required using pointwiseSchedule. Note that it does not currently pass, like many of the others, due to my misunderstanding of ca_pos.

Update {super,}basic tests to closer match auto sched

a75c509

Add IR comparisons to Cast and Promote tests

95c0e08

Modify CMakeLists.txt and headers following rebase.

35b872d

This is just the necessary changes to get my tests compiling after Jie's big refactor, which I've just rebased in.

Make FrontendBasicFP16 test match auto schedule

dd44075

There were no real surprises here except that the representative tensor seemed to be chosen differently than in the FP32 case, maybe because the casting before cacheBefore pushed that choice further back in the pipeline.

Make FrontendCastDoubleToHalf match auto schedule

4519bd1

This was very similar to the FrontendAdd example since it also uses the pointwise scheduler.

jacobhinkle force-pushed the cpp_frontend_tests branch from 19d5e47 to 4519bd1 Compare January 18, 2023 14:28

jacobhinkle added 3 commits January 18, 2023 11:05

Make FrontendPromoteToDouble test match auto sched

4697c30

Make FrontendImplicitBroadcastInput match auto sched

98e5259

At this point `nvfuser_tests --gtest_filter='*Frontend*' passes all 8 tests.

Add two broadcasting frontend tests

aaf5713

The only new thing in these tests is that when given an explicit broadcast dimension, we shoul duse makeConcreteTensor with extents of 1, in which case those IterDomains will automatically be set to broadcast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ version of frontend tests #2291

C++ version of frontend tests #2291

jacobhinkle commented Dec 28, 2022

csarofeen left a comment

jacobhinkle commented Jan 3, 2023

jacobhinkle commented Jan 5, 2023

C++ version of frontend tests #2291

Are you sure you want to change the base?

C++ version of frontend tests #2291

Conversation

jacobhinkle commented Dec 28, 2022

csarofeen left a comment

Choose a reason for hiding this comment

jacobhinkle commented Jan 3, 2023

jacobhinkle commented Jan 5, 2023