Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional debugging information when a DrawCommand fails #7055

Open
thw0rted opened this issue Sep 20, 2018 · 8 comments
Open

Additional debugging information when a DrawCommand fails #7055

thw0rted opened this issue Sep 20, 2018 · 8 comments

Comments

@thw0rted
Copy link
Contributor

Some background: I have a big, complex application built on Cesium. One of the systems we load data from recently had a major upgrade and now our application is crashing, and I can't for the life of me figure out why. I'm trying to reproduce a test case, but I think if I knew how to reproduce it in a simpler context, I could probably figure out how to stop it from happening in the first place.

Each time it crashes, I get a stack trace that points to this line in Context.js, which I managed to ascertain is because drawCommand._vertexArray has been destroyed. (drawCommand.owner is a BillboardCollection, if that helps.) It sounded a lot like #6916 but a) I'm already using 1.49 which includes the fix for that issue, and b) I've actually seen several other issues with very similar stack traces even though the root causes were wildly different.

That brings me to what I actually hope to achieve here. Yes, I would love some suggestions of what my next steps should be trying to debug this, but more than that I'd like some capability to generate a more helpful message than "somebody destroyed this at some point, for some reason". Is this already possible? Could it be added, probably behind a debug flag, for performance reasons?

@hpinkos
Copy link
Contributor

hpinkos commented Sep 20, 2018

Hi @thw0rted, sorry you're run into this problem, these kinds of issues can be a pain to debug. Was this happening in other version of Cesium, or is it more likely it's related to changes in data from your server?

I'm not sure how we could improve the debugging information here. It sounds like somewhere along the way, VertexArray.destroy is being called, but it's being passed to a DrawCommand anyway. When destroy is called on an object, destroyObject replaces every function with a function that will throw a DeveloperError. This way we know there's a bug because we are using an object that has already been destroyed and that should never happen.

I scanned through the code for BillboardCollection and I couldn't find any obvious reason why the vertex array would be destroyed. You can try adding something like

if (va[index].va.isDestroyed()) { debugger; }

here: https://github.com/AnalyticalGraphicsInc/cesium/blob/master/Source/Scene/BillboardCollection.js#L1851

To see if you can catch it a little further up the chain.

If you're able to put together a reproducible Sandcastle example, let us know! That should make it a lot easier to track down the problem. My only other suggestion is that sometimes things like are caused by a race condition triggered by picking.

@thw0rted
Copy link
Contributor Author

I do often have this problem while picking, but I think I'm seeing similar errors sometimes when just showing or hiding DataSources (or adding / removing entities) as well. I'll add the check you describe and see if there's anything else helpful from the debugger, but what I'd really like to know is what's destroying the VertexArray in the first place.

@thw0rted
Copy link
Contributor Author

Update: I added the debugger; statement in both BillboardCollection and PointPrimitiveCollection, which uses the same logic around L1013, as well as a check for isDestroyed just before the _bind call I mentioned in the OP. I also added a console.log statement in each place because sometimes Firefox was a bit flaky about actually respecting debugger;. Every time I can cause the crash, the check before command.vertexArray = va[index].va; does not trigger, i.e., at the time of assignment, the VA has not been destroyed.

It's still kind of unreliable, but the crash generally happens either when picking, or when toggling DataSource.show. Very infrequently, it happens during app startup, which involves querying our server, building a DataSource, and adding entities to it. I'm going to keep fiddling with our application logic to see if there's something special about the Entities we create now, compared to how it worked for the old version of the server software; any additional guidance about how to track down what destroyed the VA / where the race condition could have come from, would be much appreciated.

@hpinkos
Copy link
Contributor

hpinkos commented Sep 21, 2018

Thanks for the update @thw0rted! It sounds like you're starting to narrow it down.
Sorry, I don't have any other suggestions. Maybe @bagnell has an idea?

@thw0rted
Copy link
Contributor Author

Another update: I finally got it bulletproof, but at a cost. If I totally remove all billboards, labels, and points (basically anything 2D) from the entities I make, it never crashes. Just to really be sure, I turned the Billboard into a BoxGraphics with an ImageMaterial using the same texture -- works great, except of course that I can't size it dynamically relative to the viewport.

The absolutely infuriating part is that on my development system, everything is fine, but working against the upgraded (production) system, on an isolated network, I get these crashes that I can't otherwise reproduce. I'll keep poking at it next week. I'm planning to try bringing over a build based on 1.48 (maybe even 1.47?) just to be certain it's not something introduced when I upgraded to 1.49.

@shunter
Copy link
Contributor

shunter commented Sep 21, 2018

You could change destroyObject to capture the current stack trace at the time of destruction. That would help track down who is unexpected destroying your object.

@thw0rted
Copy link
Contributor Author

Great idea @shunter , I've tried this and it turns out the stack dump is approximately

Since BillboardCollection#update is so huge, I thought I'd just poke around at where this._vaf is accessed and noticed that there is an early exit at L1597 when _vaf or _vaf.va is undefined. I figured, it can't hurt to tack on || this._vaf.isDestroyed(). This definitely decreased the incidence of crashes but somehow I still get the same kind of failures. I'm going to keep working to figure out how that's possible -- somehow, draw commands are being created with a destroyed VAF, or they're getting shared between commands and destroyed in some kind of race condition.

In the meantime, I'd like some feedback. Is the test I added worthwhile? Calls to isDestroyed shouldn't be expensive, right? Is it reasonable to make this change permanent even if I can't give you a test that shows how I can arrive at that line with a destroyed _vaf?

@thw0rted
Copy link
Contributor Author

Small update: just checking for _vaf.isDestroyed() in BillboardCollection didn't get rid of all the crashes, but adding if (va.isDestroyed()) { return; } at the beginning of continueDraw appears to have stopped all crashes. Again, I need some time to test and make sure it's totally stable, but same questions from the previous post apply: is this a bad addition? Will it hurt performance? It feels like a band-aid over a wound of unknown origin, so I'd really like to figure out what this race condition is.

A thought occurred to me that might be important: my dev system is comparatively powerful, and we probably have somewhat less test data than what's on the production system (fewer Entities loaded). I'll try to add a bunch of extra test data and see if performance factors into this -- if picking takes many frames to compute, maybe that accounts for the "stale" command with the deleted VAF?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants