Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ version of frontend tests #2291

Draft
wants to merge 28 commits into
base: devel
Choose a base branch
from
Draft

C++ version of frontend tests #2291

wants to merge 28 commits into from

Conversation

jacobhinkle
Copy link
Collaborator

This PR adds some C++ tests that reproduce the tests in test/test_nvfuser_frontend.py, with a best effort made to provide a manual schedule matching the automatic schedule generated in each python test. Currently that matching is not enforced (it's done by inspecting the dumped transforms and kernels); automating that would probably require an == operator implemented on the IR.

Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there existing front end tests you're reimplementing? If so could you add references to them in a comment next to the test.

I would like to see computeAt avoided in favor of transform propagation or as @naoyam suggested mostInlined().

@jacobhinkle
Copy link
Collaborator Author

I'm reimplementing the tests from https://github.com/csarofeen/pytorch/blob/cpp_frontend_tests/test/test_nvfuser_frontend.py#L58 and using the same naming scheme. I'll make this more clear in comments.

@jacobhinkle
Copy link
Collaborator Author

Forced pushed (rebase to devel following Jie's refactor).

This file will hold tests that reproduce those in test_nvfuser_frontend.py.
Added rFactor and computeAt along the split reduction dimension. Since the reduction dimension was split, we also need to specify a size for the blocks.
Currently the SuperBasicFP16 test works fine but the BasicFP16 test fails to compile, complaining that a producer is not in global memory based on parallelization type. This is likely because I haven't parallelized the t5float tensor properly, but I wonder why I did not encounter this in the other (2d) test SuperBasicFP16.
This fixes one bug but now I have variable sized static arrays popping
up. Also I fixed my precommit hooks so clang-format was run and
hopefully the CI will pass.
This fixes that compile (in that it runs and passes), but neither of the
basic tests matches the generated kernel from the pytest.
This test results in what looks like decent code. Again, it doesn't
match the pytest code exactly. Also, I have to manually specify not only
the block dimension but the grid dimension as well for this test, and
I'm not 100% sure why yet.
The result is that this kernel better matches the python-generated one:
it does not loop inside the kernel, and instead assumes there will be
enough blocks to cover the entire input.
This test runs and is very similar to the python test, but has multiple
predicates instead of one outer predicate, just as the SuperBasic test
does.
The ir_math_check() function TORCH_CHECKs whether two fusions produce the same
ir math string. This is useful for targeting autogenerated fusions. When
translating Python frontend tests, I've first manually scheduled until
the fusion compiles and runs, then attempted to match the automatically
scheduled kernels. This commit makes the second stage simpler. However,
if changes are made to the automatic scheduler, then the manual
scheduling in these tests would need to be updated.
This makes the test pass now that I'm checking that the auto and manual IRs match.
Currently, `ir_kernel_check` is commented out since it is failing with
trivial variable number bumps for the SuperBasic{,FP16} tests. It is
low-priority, but at some point I'd like to understand why the
equivalent kernel math leads to slightly different (but still
equivalent) kernels in these cases. For now, I will move on to the other
tests.
This is mainly so that I can use printTransforms to compare IRs.
This test required using pointwiseSchedule. Note that it does not
currently pass, like many of the others, due to my misunderstanding of
ca_pos.
Note that like SuperBasic{,FP16}, this test passes the math and
transforms check but produces a kernel whose variables are incremented
(this time by two). There are two inputs in this test where in the other
case there is only one. Perhaps something is being done with the inputs
by the automatic scheduler to skip a variable number when the kernel is
generated. Anyway, the kernels are completely equivalent, but just not
_equal_ yet. I will move on to further tests and make another pass later
once I figure out what's happening with the variable names in the
kernels.
This is just the necessary changes to get my tests compiling after Jie's
big refactor, which I've just rebased in.
To make these tests pass including matching the generated kernels, I
used the WIP fusion_debug dump output from
#2326. This let me see a noisy
log of operations that revealed a few patterns I was unaware of. For
example, in the FrontendAdd test, the pointwise scheduler is used, and
the order of inlining at the end is not defined (and does change) due to
using unordered_set. Also, notice that getPointwiseHeuristics actually
creates a Val (fusion.oneVal()) so it actually does technically alter
the Fusion since it adds a Val to its vals_ list.

Beyond those trivial things, I noticed differences in inlining patterns
between the pointwise and reduction schedulers, and also saw some
different ways to call methods like parallelizeAllLike. One thing I
haven't looked more into is the reorders that often happen at the
beginning of scheduling both pointwise and reduction fusions. They have
no effect on the generated kernels, but it is noticeable and I plan to
read up on it soon.
Again I used the fusion_debug dump from
#2326 to trace what the
reduction scheduler is doing. This time I learned about
multiReductionInliner, which uses two calls to parallelizeAllLike for
different types of ParallelTypes, followed by an undoing of unrolling
and vectorization on the reference tensor. The need for the latter is
still a little unclear to me.
There were no real surprises here except that the representative tensor
seemed to be chosen differently than in the FP32 case, maybe because the
casting before cacheBefore pushed that choice further back in the
pipeline.
This was very similar to the FrontendAdd example since it also uses the
pointwise scheduler.
At this point `nvfuser_tests --gtest_filter='*Frontend*' passes all 8
tests.
The only new thing in these tests is that when given an explicit
broadcast dimension, we shoul duse makeConcreteTensor with extents of 1,
in which case those IterDomains will automatically be set to broadcast.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants