-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move more TPM modules and setup routines to common #148
Conversation
Nice. As before, might take some time to look over this in detail. |
This makes TPM_FIELDS now all double precision
dc37408
to
d9ccd88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but I'd like a bit more assurance that the additional casting won't impact performance. Maybe it's insignificant, but we should demonstrate that with evidence. I'll try running some benchmarks.
Here are some performance results from commit: 9198e86
commit: d9ccd88
I think it's safe to merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and nice testing from @samhatfield
Another small step in direction of using multiple handles of different backend and resolution.
Moving more TPM modules and setup routines to a common precision/backend independent library means that there is no code duplication and/or multiple compilation and a possible reuse of the globals that are part of these modules across the different backends.
I managed to move all but one TPM modules.
The missing one for now is TPM_FLT because that essentially contains the Legendre coefficients for each backend in a specific precision. The CPU version also contains the fast-legendre coefficients and butterfly structures which are not ported to the GPU.
Because of this, it will be still some task to see how we can reuse the Legendre coefficients of same resolution across different precisions and backends without recomputing them.
Most work was spent in making TPM_FIELDS , and the flattened GPU arrays have been moved to a new TPM_FIELDS_FLAT module, and I understand that the code will be refactored soon to remove the use of the flattened arrays.
Note that results will not be bit-identical, but hopefully performance-neutral. Could someone benchmark/verify this for me?