-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ version of frontend tests #2291
base: devel
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there existing front end tests you're reimplementing? If so could you add references to them in a comment next to the test.
I would like to see computeAt
avoided in favor of transform propagation or as @naoyam suggested mostInlined()
.
I'm reimplementing the tests from https://github.com/csarofeen/pytorch/blob/cpp_frontend_tests/test/test_nvfuser_frontend.py#L58 and using the same naming scheme. I'll make this more clear in comments. |
0ac02f2
to
a06224f
Compare
Forced pushed (rebase to devel following Jie's refactor). |
This file will hold tests that reproduce those in test_nvfuser_frontend.py.
Added rFactor and computeAt along the split reduction dimension. Since the reduction dimension was split, we also need to specify a size for the blocks.
Currently the SuperBasicFP16 test works fine but the BasicFP16 test fails to compile, complaining that a producer is not in global memory based on parallelization type. This is likely because I haven't parallelized the t5float tensor properly, but I wonder why I did not encounter this in the other (2d) test SuperBasicFP16.
This fixes one bug but now I have variable sized static arrays popping up. Also I fixed my precommit hooks so clang-format was run and hopefully the CI will pass.
This fixes that compile (in that it runs and passes), but neither of the basic tests matches the generated kernel from the pytest.
This test results in what looks like decent code. Again, it doesn't match the pytest code exactly. Also, I have to manually specify not only the block dimension but the grid dimension as well for this test, and I'm not 100% sure why yet.
The result is that this kernel better matches the python-generated one: it does not loop inside the kernel, and instead assumes there will be enough blocks to cover the entire input.
This test runs and is very similar to the python test, but has multiple predicates instead of one outer predicate, just as the SuperBasic test does.
The ir_math_check() function TORCH_CHECKs whether two fusions produce the same ir math string. This is useful for targeting autogenerated fusions. When translating Python frontend tests, I've first manually scheduled until the fusion compiles and runs, then attempted to match the automatically scheduled kernels. This commit makes the second stage simpler. However, if changes are made to the automatic scheduler, then the manual scheduling in these tests would need to be updated.
This makes the test pass now that I'm checking that the auto and manual IRs match.
Currently, `ir_kernel_check` is commented out since it is failing with trivial variable number bumps for the SuperBasic{,FP16} tests. It is low-priority, but at some point I'd like to understand why the equivalent kernel math leads to slightly different (but still equivalent) kernels in these cases. For now, I will move on to the other tests.
This is mainly so that I can use printTransforms to compare IRs.
This test required using pointwiseSchedule. Note that it does not currently pass, like many of the others, due to my misunderstanding of ca_pos.
Note that like SuperBasic{,FP16}, this test passes the math and transforms check but produces a kernel whose variables are incremented (this time by two). There are two inputs in this test where in the other case there is only one. Perhaps something is being done with the inputs by the automatic scheduler to skip a variable number when the kernel is generated. Anyway, the kernels are completely equivalent, but just not _equal_ yet. I will move on to further tests and make another pass later once I figure out what's happening with the variable names in the kernels.
This is just the necessary changes to get my tests compiling after Jie's big refactor, which I've just rebased in.
To make these tests pass including matching the generated kernels, I used the WIP fusion_debug dump output from #2326. This let me see a noisy log of operations that revealed a few patterns I was unaware of. For example, in the FrontendAdd test, the pointwise scheduler is used, and the order of inlining at the end is not defined (and does change) due to using unordered_set. Also, notice that getPointwiseHeuristics actually creates a Val (fusion.oneVal()) so it actually does technically alter the Fusion since it adds a Val to its vals_ list. Beyond those trivial things, I noticed differences in inlining patterns between the pointwise and reduction schedulers, and also saw some different ways to call methods like parallelizeAllLike. One thing I haven't looked more into is the reorders that often happen at the beginning of scheduling both pointwise and reduction fusions. They have no effect on the generated kernels, but it is noticeable and I plan to read up on it soon.
Again I used the fusion_debug dump from #2326 to trace what the reduction scheduler is doing. This time I learned about multiReductionInliner, which uses two calls to parallelizeAllLike for different types of ParallelTypes, followed by an undoing of unrolling and vectorization on the reference tensor. The need for the latter is still a little unclear to me.
There were no real surprises here except that the representative tensor seemed to be chosen differently than in the FP32 case, maybe because the casting before cacheBefore pushed that choice further back in the pipeline.
This was very similar to the FrontendAdd example since it also uses the pointwise scheduler.
19d5e47
to
4519bd1
Compare
At this point `nvfuser_tests --gtest_filter='*Frontend*' passes all 8 tests.
The only new thing in these tests is that when given an explicit broadcast dimension, we shoul duse makeConcreteTensor with extents of 1, in which case those IterDomains will automatically be set to broadcast.
This PR adds some C++ tests that reproduce the tests in
test/test_nvfuser_frontend.py
, with a best effort made to provide a manual schedule matching the automatic schedule generated in each python test. Currently that matching is not enforced (it's done by inspecting the dumped transforms and kernels); automating that would probably require an==
operator implemented on the IR.