-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UsdSkel _GetJointWorldInverseBindTransforms has a race, and may compute the transforms multiple times #1742
Comments
Hi @williamkrick , sorry you're hitting this. I just wanted to be honest and let you know it's unlikely we'll be able to look into this anytime soon; our UsdSkel expertise is not high, currently, and the team that ostensibly supports it is small and Presto-based, not Maya-based. |
@spiffmon I've met the same problem @williamkrick met. Basically the lazy computation of _GetJointWorldInverseBindTransforms () makes this every function calling it not thread same. @williamkrick we had a workaround in our project, just precalculate these invBindingTransforms somewhere when the skeleton is loaded (to avoid this lazy computation). |
@spiffmon no worries, I just wish I had a solid answer on what the race is. @frankzhang11 thanks for the idea! I'll add that and finger crossed it'll fix my crash. |
I don't think I've hit this crash in my use of UsdSkel, but here's my guess at a possible cause:
I suspect there could be an issue if a
|
Filed as internal issue #USD-7143 |
Wow @cameronwhite thank you! I had forgotten that VtArray has copy-on-write semantics. The order of events you've laid out here seems to plausibly cause a crash. I will try to use the debugger to force this order of events to occur and see if I crash. |
VtArray write is not thread safe |
Is it a bug that it is not thread safe? It seems like a thread-safe version could be made (assuming each thread has it's own refcount pointer to the array). If two threads tried to write at the same time it would make two copies and then throw away the original after the copies were made. |
I’d reserve the final word for @gitamohr, but I don’t think we can
reasonably expect more from VtArray than we do std::shared_ptr. VtArray is
low-level, central currency, we may have millions of them, and they need to
be low-overhead, so the prospect of requiring granular tls for each seems
really heavy?
Not denying there are significant gotchas with VtArray… possibly it’s
over-constrained, but all the constraints still exist somewhere in our
codebase, unfortunately. We’ve taken suggestions for Api to make it easier
and more prominent to extract a const-array from a mutable one, which is a
source of many of the gotchas. That might be useful here? But if not,
thread-spawning code needs to provide copies of arrays to workers rather
than references, and hopefully that doesn’t degrade performance too much?
On Wed, Jan 19, 2022 at 7:46 PM Bill Spitzak ***@***.***> wrote:
Is it a bug that it is not thread safe? It seems like a thread-safe
version could be made (assuming each thread has it's own refcount pointer
to the array). If two threads tried to write at the same time it would make
two copies and then throw away the original after the copies were made.
—
Reply to this email directly, view it on GitHub
<#1742 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPOU2A4P34V5T23CPGVEFTUW6ASLANCNFSM5L7T7BKQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
--spiffiPhone
|
On my side I've found that I can force the issue to occur using the debugger and tightly controlling the timing of the various threads. I don't think that's surprising given the technical discussion going on here but it's nice to have practical backup to our ideas. I also figured out that this race can't happen in UsdView. I can work around the issue by implementing my own class derived from HdExtComputation which also calls |
In this case I'd propose that Aside from avoiding redundant work, having threads calling non-const methods on the cached value while other threads may be in the middle of copying it seems like a bad idea in general.. |
- In methods like _ComputeJointWorldInverseBindTransforms(), check the compute flag again after acquiring the lock to avoid potentially recomputing the result again if multiple threads were waiting on the mutex. Although the computed result would not change, it is not safe to call mutable member functions of the VtArray (which can cause a copy-on-write detach) while other threads may be in the middle of making a copy of it. - Prefer using operator|= to atomically set the flag rather than doing a read -> bitwise OR -> atomic store sequence which could cause flags to be lost if there are concurrent writes. Currently the writes are all guarded by the same mutex so the previous approach was not problematic, but the new approach is safer if e.g. in the future there are separate locks for each cached array. Bug: PixarAnimationStudios#1742
- In methods like _ComputeJointWorldInverseBindTransforms(), check the compute flag again after acquiring the lock to avoid potentially recomputing the result again if multiple threads were waiting on the mutex. Although the computed result would not change, it is not safe to call mutable member functions of the VtArray (which can cause a copy-on-write detach) while other threads may be in the middle of making a copy of it. - Prefer using operator|= to atomically set the flag rather than doing a read -> bitwise OR -> atomic store sequence which could cause flags to be lost if there are concurrent writes. Currently the writes are all guarded by the same mutex so the previous approach was not problematic, but the new approach is safer if e.g. in the future there are separate locks for each cached array. Bug: PixarAnimationStudios#1742
- In methods like _ComputeJointWorldInverseBindTransforms(), check the compute flag again after acquiring the lock to avoid potentially recomputing the result again if multiple threads were waiting on the mutex. Although the computed result would not change, it is not safe to call mutable member functions of the VtArray (which can cause a copy-on-write detach) while other threads may be in the middle of making a copy of it. - Prefer using operator|= to atomically set the flag rather than doing a read -> bitwise OR -> atomic store sequence which could cause flags to be lost if there are concurrent writes. Currently the writes are all guarded by the same mutex so the previous approach was not problematic, but the new approach is safer if e.g. in the future there are separate locks for each cached array. Bug: PixarAnimationStudios#1742
Description of Issue
_GetJointWorldInverseBindTransforms tries to limit itself to computing the inverse bind transforms once by testing to see if the
ComputeFlag
has been set inUsdSkel_SkelDefinition::_flags
before performing the calculation. However, if this code is executed in parallel multiple threads could read the value of_flags
before the computation finishes and ComputeFlag is set. This could cause multiple threads to enter_ComputeJointWorldInverseBindTransforms
and do the computation.There is a lock in
_ComputeJointWorldInverseBindTransforms
which prevents anything really bad from happening (as far as I can see, more on that later) but we do waste a little bit of time on load re-doing this calculation.How did I discover this? I'm working on CPU USDSkel for MayaUSD, and I occasionally crash on file load near here. Specifically, the calling code is UsdSkelSkeletonQuery::_ComputeSkinningTransforms, and I crash at line 378 in the inline dtor for local variable
inverseBindXforms
. The dtor has the ref count for the VtArray going to zero and crashing trying to delete the underlying control block. However, tracing the code tells me that the reference count should never be going to zero here, because the data underlying inverseBindTransforms should always be held by theUsdSkel_SkelDefinition
in_jointWorldInverseBindXforms
. The key accident that keeps it safe is when the transforms are re-computed the resize call discovers the array is already the correct size and does nothing. This prevents the storage from changing and breaking other threads that are already accessing_jointWorldInverseBindXforms
through their own local VtArray.Clearly I'm missing something, so I'm hoping y'all can take a look and tell me if you see a race that could cause the crash.
I tried reproducing this in USDView and I couldn't reproduce the crash there. I didn't try investigating to see if multiple threads could re-do
_ComputeJointWorldInverseBindTransforms
, but I think it probably can happen there too.Sorry for the relative vagueness of this issue, I'd prefer to pin down exactly what the race is but I'm stuck trying to figure out how the crash could occur and I don't see it.
Steps to Reproduce
System Information (OS, Hardware)
Windows
Package Versions
USD 21.11, the MayaUSD branch I linked above
Build Flags
The text was updated successfully, but these errors were encountered: