UsdSkel _GetJointWorldInverseBindTransforms has a race, and may compute the transforms multiple times #1742

williamkrick · 2022-01-14T19:48:10Z

Description of Issue

_GetJointWorldInverseBindTransforms tries to limit itself to computing the inverse bind transforms once by testing to see if the ComputeFlag has been set in UsdSkel_SkelDefinition::_flags before performing the calculation. However, if this code is executed in parallel multiple threads could read the value of _flags before the computation finishes and ComputeFlag is set. This could cause multiple threads to enter _ComputeJointWorldInverseBindTransforms and do the computation.

There is a lock in _ComputeJointWorldInverseBindTransforms which prevents anything really bad from happening (as far as I can see, more on that later) but we do waste a little bit of time on load re-doing this calculation.

How did I discover this? I'm working on CPU USDSkel for MayaUSD, and I occasionally crash on file load near here. Specifically, the calling code is UsdSkelSkeletonQuery::_ComputeSkinningTransforms, and I crash at line 378 in the inline dtor for local variable inverseBindXforms. The dtor has the ref count for the VtArray going to zero and crashing trying to delete the underlying control block. However, tracing the code tells me that the reference count should never be going to zero here, because the data underlying inverseBindTransforms should always be held by the UsdSkel_SkelDefinition in _jointWorldInverseBindXforms. The key accident that keeps it safe is when the transforms are re-computed the resize call discovers the array is already the correct size and does nothing. This prevents the storage from changing and breaking other threads that are already accessing _jointWorldInverseBindXforms through their own local VtArray.

Clearly I'm missing something, so I'm hoping y'all can take a look and tell me if you see a race that could cause the crash.

I tried reproducing this in USDView and I couldn't reproduce the crash there. I didn't try investigating to see if multiple threads could re-do _ComputeJointWorldInverseBindTransforms, but I think it probably can happen there too.

Sorry for the relative vagueness of this issue, I'd prefer to pin down exactly what the race is but I'm stuck trying to figure out how the crash could occur and I don't see it.

Steps to Reproduce

Sync and build my branch of MayaUSD: Autodesk/maya-usd@abd3eda
Load a scene with USDSkel where multiple mesh rprims are bound to the same skeleton. You'll probably have to re-load a number of times to reproduce the crash. Depending on the scene it may happen once in ten tries or less.

System Information (OS, Hardware)

Windows

Package Versions

USD 21.11, the MayaUSD branch I linked above

Build Flags

The text was updated successfully, but these errors were encountered:

spiffmon · 2022-01-15T05:16:34Z

Hi @williamkrick , sorry you're hitting this. I just wanted to be honest and let you know it's unlikely we'll be able to look into this anytime soon; our UsdSkel expertise is not high, currently, and the team that ostensibly supports it is small and Presto-based, not Maya-based.

frankzhang11 · 2022-01-15T06:43:44Z

@spiffmon I've met the same problem @williamkrick met. Basically the lazy computation of _GetJointWorldInverseBindTransforms () makes this every function calling it not thread same.

@williamkrick we had a workaround in our project, just precalculate these invBindingTransforms somewhere when the skeleton is loaded (to avoid this lazy computation).

williamkrick · 2022-01-17T23:30:49Z

@spiffmon no worries, I just wish I had a solid answer on what the race is.

@frankzhang11 thanks for the idea! I'll add that and finger crossed it'll fix my crash.

cameronwhite · 2022-01-19T15:42:17Z

I don't think I've hit this crash in my use of UsdSkel, but here's my guess at a possible cause:

After the first thread finishes _ComputeJointWorldInverseBindTransforms(), any threads entering UsdSkel_SkelDefinition::_GetJointWorldInverseBindTransforms() will grab the computed result from _jointWorldInverseBindXforms. Since it's a VtArray, this will bump the ref count and share the underlying data
Any threads that were still waiting on the mutex in _ComputeJointWorldInverseBindTransforms() will eventually enter the critical section and redo the computation. As one of the earlier comments pointed out, the _jointWorldInverseBindXforms array is already the right size and doesn't need to be reallocated. However, if some other thread(s) have already made a copy in the meantime (described above), then writing to _jointWorldInverseBindXforms would trigger a _DetachIfNotUnique()

I suspect there could be an issue if a VtArray instance is in the middle of _DetachIfNotUnique() while another thread is attempting to copy it. e.g:

thread A has a local copy of the array, which shares the data (refcount == 2)
thread B tries to write to the primary instance, entering the body of _DetachIfNotUnique() since _IsUnique() is false
thread C starts to copy the primary instance, but hasn't bumped the refcount yet
thread A finishes and decrements the refcount (refcount == 1)
thread B decrements the refcount, taking it to zero and destroying the data, before switching to its new copy of the data
thread C now has an array with a reference to the deleted data block

jilliene · 2022-01-19T19:43:02Z

Filed as internal issue #USD-7143

williamkrick · 2022-01-19T20:36:02Z

Wow @cameronwhite thank you! I had forgotten that VtArray has copy-on-write semantics. The order of events you've laid out here seems to plausibly cause a crash. I will try to use the debugger to force this order of events to occur and see if I crash.

frankzhang11 · 2022-01-20T03:37:33Z

VtArray write is not thread safe

spitzak · 2022-01-20T03:46:34Z

Is it a bug that it is not thread safe? It seems like a thread-safe version could be made (assuming each thread has it's own refcount pointer to the array). If two threads tried to write at the same time it would make two copies and then throw away the original after the copies were made.

spiffmon · 2022-01-20T06:11:07Z

I’d reserve the final word for @gitamohr, but I don’t think we can reasonably expect more from VtArray than we do std::shared_ptr. VtArray is low-level, central currency, we may have millions of them, and they need to be low-overhead, so the prospect of requiring granular tls for each seems really heavy? Not denying there are significant gotchas with VtArray… possibly it’s over-constrained, but all the constraints still exist somewhere in our codebase, unfortunately. We’ve taken suggestions for Api to make it easier and more prominent to extract a const-array from a mutable one, which is a source of many of the gotchas. That might be useful here? But if not, thread-spawning code needs to provide copies of arrays to workers rather than references, and hopefully that doesn’t degrade performance too much?

On Wed, Jan 19, 2022 at 7:46 PM Bill Spitzak ***@***.***> wrote: Is it a bug that it is not thread safe? It seems like a thread-safe version could be made (assuming each thread has it's own refcount pointer to the array). If two threads tried to write at the same time it would make two copies and then throw away the original after the copies were made. — Reply to this email directly, view it on GitHub <#1742 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPOU2A4P34V5T23CPGVEFTUW6ASLANCNFSM5L7T7BKQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

-- --spiffiPhone

williamkrick · 2022-01-20T15:44:30Z

On my side I've found that I can force the issue to occur using the debugger and tightly controlling the timing of the various threads. I don't think that's surprising given the technical discussion going on here but it's nice to have practical backup to our ideas.

I also figured out that this race can't happen in UsdView. HdStExtComputation::Sync() calls GetExtComputationInput on each input, including the joint world inverse bind transforms. The compute is an sprim so this sync occurs serially and avoids the race.

I can work around the issue by implementing my own class derived from HdExtComputation which also calls GetExtComputationInput. I can't re-use HdStExtComputation because it does Hydra GL specific things like allocating buffer array ranges. With this change the crash doesn't seem to occur.

cameronwhite · 2022-01-20T16:45:51Z

In this case I'd propose that _ComputeJointWorldInverseBindTransforms() should re-check the ComputeFlag after entering the critical section and skip if another thread already performed the computation (or use a similar pattern like std::call_once)

Aside from avoiding redundant work, having threads calling non-const methods on the cached value while other threads may be in the middle of copying it seems like a bad idea in general..

- In methods like _ComputeJointWorldInverseBindTransforms(), check the compute flag again after acquiring the lock to avoid potentially recomputing the result again if multiple threads were waiting on the mutex. Although the computed result would not change, it is not safe to call mutable member functions of the VtArray (which can cause a copy-on-write detach) while other threads may be in the middle of making a copy of it. - Prefer using operator|= to atomically set the flag rather than doing a read -> bitwise OR -> atomic store sequence which could cause flags to be lost if there are concurrent writes. Currently the writes are all guarded by the same mutex so the previous approach was not problematic, but the new approach is safer if e.g. in the future there are separate locks for each cached array. Bug: PixarAnimationStudios#1742

williamkrick mentioned this issue Jan 18, 2022

MAYA-106930 CPU UsdSkel support for MayaUSD. Autodesk/maya-usd#1992

Merged

cameronwhite mentioned this issue Mar 30, 2023

Fix thread safety issues in UsdSkel_SkelDefinition. #2369

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UsdSkel _GetJointWorldInverseBindTransforms has a race, and may compute the transforms multiple times #1742

UsdSkel _GetJointWorldInverseBindTransforms has a race, and may compute the transforms multiple times #1742

williamkrick commented Jan 14, 2022

spiffmon commented Jan 15, 2022

frankzhang11 commented Jan 15, 2022

williamkrick commented Jan 17, 2022

cameronwhite commented Jan 19, 2022

jilliene commented Jan 19, 2022

williamkrick commented Jan 19, 2022

frankzhang11 commented Jan 20, 2022

spitzak commented Jan 20, 2022

spiffmon commented Jan 20, 2022 via email

williamkrick commented Jan 20, 2022

cameronwhite commented Jan 20, 2022

UsdSkel _GetJointWorldInverseBindTransforms has a race, and may compute the transforms multiple times #1742

UsdSkel _GetJointWorldInverseBindTransforms has a race, and may compute the transforms multiple times #1742

Comments

williamkrick commented Jan 14, 2022

Description of Issue

Steps to Reproduce

System Information (OS, Hardware)

Package Versions

Build Flags

spiffmon commented Jan 15, 2022

frankzhang11 commented Jan 15, 2022

williamkrick commented Jan 17, 2022

cameronwhite commented Jan 19, 2022

jilliene commented Jan 19, 2022

williamkrick commented Jan 19, 2022

frankzhang11 commented Jan 20, 2022

spitzak commented Jan 20, 2022

spiffmon commented Jan 20, 2022 via email

williamkrick commented Jan 20, 2022

cameronwhite commented Jan 20, 2022