Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Websocket Crashes after updating to 0.43.0 #1808

Closed
philfi opened this issue May 26, 2021 · 24 comments
Closed

Multiple Websocket Crashes after updating to 0.43.0 #1808

philfi opened this issue May 26, 2021 · 24 comments
Assignees

Comments

@philfi
Copy link
Contributor

philfi commented May 26, 2021

Bug report

Recently updated from 0.35.0 to 0.43.0 and have encountered a few new crashes related to websockets usage. Didn't find any similar issues reported here but have found somewhat similar reports in the Starscream library, namely daltoniam/Starscream#773. However that issue was supposedly resolved in the 4.0.4 Starscream release that is used in the latest (0.43.0) release of apollo-ios. Have requested more information on the status of that Starscream issue but am curious if anyone else using graphql subscriptions has encountered this issue? Only have a few different crash stack traces to work off of atm but will update as the investigation proceeds. Any assistance/insight is greatly appreciated!

Versions

Please fill in the versions you're currently using:

  • apollo-ios SDK version: 0.43.0
  • Xcode version: 12.5
  • Swift version: 5.1
  • Package manager: Cocoapods

Steps to reproduce

Have been unable to reproduce thus far.

Further details

Will update with more information as we gather it but for now here are the stack traces that we have collected from crashes thus far, starting with the one that has been most frequent.

Stack Trace 1:

EXC_BREAKPOINT 0x0000000197b15fb8
Crashed: com.vluxe.starscream.websocket
0  CoreFoundation                 0x19aafb52c CFHash.cold.1 + 16
1  CoreFoundation                 0x19aa08320 CFHash + 532
2  CoreFoundation                 0x19aab3980 CFBasicHashGetCountOfKey + 212
3  CoreFoundation                 0x19aa0ceac CFSetContainsValue + 100
4  CoreFoundation                 0x19aa00574 CFRunLoopRemoveSource + 180
5  CFNetwork                      0x19b19452c _CFStreamErrorFromCFError + 175344
6  CoreFoundation                 0x19aa1ff80 CFWriteStreamWrite + 368
7  Starscream                     0x106bfd68c FoundationTransport.write(data:completion:) + 127 (FoundationTransport.swift:127)
8  Starscream                     0x106c13674 WSEngine.connectionChanged(state:) + 122 (WSEngine.swift:122)
9  Starscream                     0x106bfe038 FoundationTransport.stream(_:handle:) + 207 (FoundationTransport.swift:207)
10 Starscream                     0x106bfe26c @objc FoundationTransport.stream(_:handle:) + 4407140972 (<compiler-generated>:4407140972)
11 CoreFoundation                 0x19aa21c1c _signalEventSync + 216
12 CoreFoundation                 0x19aa21d34 ___signalEventQueue_block_invoke + 28
13 libdispatch.dylib              0x19a67c24c _dispatch_call_block_and_release + 32
14 libdispatch.dylib              0x19a67ddb0 _dispatch_client_callout + 20
15 libdispatch.dylib              0x19a68510c _dispatch_lane_serial_drain + 580
16 libdispatch.dylib              0x19a685c5c _dispatch_lane_invoke + 408
17 libdispatch.dylib              0x19a68fd78 _dispatch_workloop_worker_thread + 708
18 libsystem_pthread.dylib        0x1e6535814 _pthread_wqthread + 276
19 libsystem_pthread.dylib        0x1e653c76c start_wqthread + 8

Stack Trace 2:

EXC_BAD_ACCESS KERN_PROTECTION_FAILURE 0x0000001000000014
Crashed: com.vluxe.starscream.websocket
0  libobjc.A.dylib                0x1b5c27230 objc_opt_respondsToSelector + 20
1  CoreFoundation                 0x1a17ef9c8 _inputStreamCallbackFunc + 48
2  CoreFoundation                 0x1a1820f4c _signalEventSync + 212
3  CoreFoundation                 0x1a1820fac ___signalEventQueue_block_invoke + 24
4  libdispatch.dylib              0x1a14bd298 _dispatch_call_block_and_release + 24
5  libdispatch.dylib              0x1a14be280 _dispatch_client_callout + 16
6  libdispatch.dylib              0x1a149a4f0 _dispatch_lane_serial_drain$VARIANT$armv81 + 568
7  libdispatch.dylib              0x1a149afdc _dispatch_lane_invoke$VARIANT$armv81 + 404
8  libdispatch.dylib              0x1a14a4800 _dispatch_workloop_worker_thread + 692
9  libsystem_pthread.dylib        0x1ea0155a4 _pthread_wqthread + 272
10 libsystem_pthread.dylib        0x1ea018874 start_wqthread + 8

Stack Trace 3:

EXC_BAD_ACCESS KERN_INVALID_ADDRESS 0x0000000000000000
Crashed: com.vluxe.starscream.websocket
0  libobjc.A.dylib                0x1a0372110 objc_msgSend + 48
1  CoreFoundation                 0x18b2c25cc _inputStreamCallbackFunc + 52
2  CoreFoundation                 0x18b2f5c1c _signalEventSync + 216
3  CoreFoundation                 0x18b2f5d34 ___signalEventQueue_block_invoke + 28
4  libdispatch.dylib              0x18af5024c _dispatch_call_block_and_release + 32
5  libdispatch.dylib              0x18af51db0 _dispatch_client_callout + 20
6  libdispatch.dylib              0x18af5910c _dispatch_lane_serial_drain + 580
7  libdispatch.dylib              0x18af59c5c _dispatch_lane_invoke + 408
8  libdispatch.dylib              0x18af63d78 _dispatch_workloop_worker_thread + 708
9  libsystem_pthread.dylib        0x1d6e09814 _pthread_wqthread + 276
10 libsystem_pthread.dylib        0x1d6e1076c start_wqthread + 8

Stack Trace 4:

EXC_BAD_ACCESS KERN_INVALID_ADDRESS
Crashed: com.vluxe.starscream.websocket
0  libobjc.A.dylib                0x1a7575af4 objc_loadWeakRetained + 148
1  Starscream                     0x1024cdc00 FoundationTransport.read() + 174 (FoundationTransport.swift:174)
2  Starscream                     0x1024ce26c @objc FoundationTransport.stream(_:handle:) + 4407288428 (<compiler-generated>:4407288428)
3  CoreFoundation                 0x1924d5c1c _signalEventSync + 216
4  CoreFoundation                 0x1924d5d34 ___signalEventQueue_block_invoke + 28
5  libdispatch.dylib              0x19213024c _dispatch_call_block_and_release + 32
6  libdispatch.dylib              0x192131db0 _dispatch_client_callout + 20
7  libdispatch.dylib              0x19213910c _dispatch_lane_serial_drain + 580
8  libdispatch.dylib              0x192139c5c _dispatch_lane_invoke + 408
9  libdispatch.dylib              0x192143d78 _dispatch_workloop_worker_thread + 708
10 libsystem_pthread.dylib        0x1ddfe9814 _pthread_wqthread + 276
11 libsystem_pthread.dylib        0x1ddff076c start_wqthread + 8
@AnthonyMDev
Copy link
Contributor

Thank you for the report. These stack traces don't really give us enough information to deduce any sort of cause yet. It seems like it's more likely something that is going to be able to be tracked down with help from the Starscream community. But please keep us updated as you debug and come up with more information. If there is something involving our Starscream integration that is causing this, I'd love to know.

@philfi
Copy link
Contributor Author

philfi commented May 26, 2021

Yep didn't expect anyone to have a solve looking at these stack traces, was more looking to see if anyone else has encountered anything similar while upgrading to the more recent versions of apollo-ios (that include Starscream 4.x.x) that might aid in the information gathering and investigation of the issue. Will keep this thread updated accordingly.

@STomperi
Copy link
Contributor

I have encountered exactly the the same issue. Downgrading to Apollo GraphQL version 0.41.0 with StarScream 3.1.1 fixed crashing.

@philfi
Copy link
Contributor Author

philfi commented May 27, 2021

@STomperi thanks for weighing in, super helpful to hear that downgrading to a version running Starscream 3.1.1 is an option.

@AnthonyMDev
Copy link
Contributor

So the next step is diffing the versions to see what has changed! It could be a change in Apollo or in Starscream that is causing it. If you are able to confirm that it's a Starscream issue, we can just roll back our dependency for the time being. If it's an Apollo issue let me know!

@designatednerd
Copy link
Contributor

@AnthonyMDev Rolling back is a little easier said than done since it's a major version change and there were some pretty significant breaks, but it could be doable.

Looking at daltoniam/Starscream#773, it looks like there were some fixes applied that related to making sure things were nilled out in an internal deinit, so I went looking to make sure we're not doing anything silly on our end there. We're definitely cancelling the connection and removing ourselves as the delegate in our deinit, so I don't think that's the issue.

Looking at the docs around EXC_BREAKPOINT, are you able to get any further information from your crash report? It suggests maybe there's more info:

The Swift runtime uses trace traps for specific types of unrecoverable errors—see Addressing Crashes from Swift Runtime Errors for information on those errors. Some lower-level libraries, such as Dispatch, trap the process with this exception upon encountering an unrecoverable error, and log additional information about the error in the Additional Diagnostic Information section of the crash report. See Diagnostic Messages for information about those messages.

Does that help come up with any kind of messaging we might be able to get more info out of from the system?

I'm also curious if the issue is happening on versions of iOS < 13, since as I recall, a big piece of the 4.0 release of Starscream was setting up to use the URLSessionWebSocket under the hood for 13+.

@philfi
Copy link
Contributor Author

philfi commented May 28, 2021

@designatednerd haven't been able to reproduce myself but now that you mention it I am seeing this extra piece of information in some of the crash report data: *** CFHash() called with NULL ***. Will have to dig into that more, looks like there's some closed starscream issues referencing that message. I can confirm that of those affected in our user base, 99% are running iOS 14 devices though have also seen a few on 12 and 13.

@designatednerd
Copy link
Contributor

The crashes on 12 are the ones that pique my interest since that would be using a different underlying websocket engine instead of URLSessionWebSocket.

This issue seems like it might be relevant - @STomperi you submitted a patch with #1740 that wound up as part of the 0.43.0 release - are you saying that's not helping your crashes?

@philfi are you also seeing these crashes on 0.42.0? That has the 4.x version of Starscream but not the changes @STomperi added.

@philfi
Copy link
Contributor Author

philfi commented May 28, 2021

Sorry can only speak to 0.35.0 and 0.43.0 for now. Currently back on 0.35.0 and am planning on updating to 0.41.0 next (pre 4.x changes).

@designatednerd
Copy link
Contributor

OK cool, sounds like a plan.

@KavonSakul
Copy link

@designatednerd We are using 0.42.0 and we have been experiencing same crashes.

Crashed: com.vluxe.starscream.websocket
...
FoundationTransport.swift - Line 127
FoundationTransport.write(data:completion:) + 127

@kylebrowning
Copy link

Related,
image

Enabling the SplitNetworkTransport, and including the web socket results in massive memory leaks, in 12 minutes alone it generated 2.2gs of memory.

@designatednerd
Copy link
Contributor

Ick, that's no bueno.

Are you able to get more information about where the memory is leaking from Instruments, or provide a minimal replication project? That would be super, super helpful in tracking down the source of this.

@kylebrowning
Copy link

kylebrowning commented Jun 3, 2021

Basically if SSL connections are rejected it continuously tries to make new connections forever, never releasing the old ones.

@designatednerd
Copy link
Contributor

@kylebrowning can you confirm if you're on 0.42.0 vs 0.43.0? A fix for exponential socket connections went out with 0.43.0 so it would be good to know if there's still problems with that.

@kylebrowning
Copy link

I upgraded to 0.43.0 and it was still happening.

@designatednerd
Copy link
Contributor

blargh, but good to know. I'll see what we can figure out.

@kylebrowning
Copy link

@designatednerd maybe this will help.

This was in my console

write_timeout failed
2021-06-01 11:58:07.799951-0700 theathletic-ios-dev[5662:2600935] [connection] nw_endpoint_handler_set_adaptive_read_handler [C154.1 2606:4700::6810:e519.443 ready socket-flow (satisfied (Path is satisfied), viable, interface: en0, ipv4, ipv6, dns)] unregister notification for read_timeout failed
2021-06-01 11:58:07.800103-0700 theathletic-ios-dev[5662:2600935] [connection] nw_endpoint_handler_set_adaptive_write_handler [C154.1 2606:4700::6810:e519.443 ready socket-flow (satisfied (Path is satisfied), viable, interface: en0, ipv4, ipv6, dns)] unregister notification for write_timeout failed

When it eventually locked my UI on the main thread I hit pause on the execution and this is what I saw.
image (3)

@designatednerd
Copy link
Contributor

unregister notification for read_timeout failed seems...less than promising, I think that's under the hood not just of Starscream but of Apple's web socket implementation.

Will try to dive in to this.

@STomperi
Copy link
Contributor

The crashes on 12 are the ones that pique my interest since that would be using a different underlying websocket engine instead of URLSessionWebSocket.

This issue seems like it might be relevant - @STomperi you submitted a patch with #1740 that wound up as part of the 0.43.0 release - are you saying that's not helping your crashes?

@philfi are you also seeing these crashes on 0.42.0? That has the 4.x version of Starscream but not the changes @STomperi added.

That fix I made is for a crash that occurs very seldom, but was reliably noticed by thread sanitizer. So far I have seen only couple one or two crashes with app that has lot's of users.

On the other hand, StartScream 4.x crashes occur often. About once per day for each active user, so this bug is a really bad problem.

Also for me 0.42 crashes a lot in StarScream.

@AnthonyMDev
Copy link
Contributor

I want to invest some time in

Basically if SSL connections are rejected it continuously tries to make new connections forever, never releasing the old ones.

@kylebrowning Could you tell me what WebSocketEvent is being received to cause attemptReconnectionIfDesired() to be called?

If the web socket receives an error, it should stop trying to reconnect. When your SSL connection is rejected, what event is causing it to attempt to reconnect? I want to reproduce this case, but I'm not sure what the actual event that's causing it looks like.

@AnthonyMDev
Copy link
Contributor

@kylebrowning It would also be really helpful to get an idea of exactly what objects are leaking. When it keeps making new connections and retaining the old ones, what objects are you getting explosion on? I don't imagine that it's the actual Starscream.WebSocket object. It looks like that is only being created once. So when it calls connect() what is being created?

@AnthonyMDev
Copy link
Contributor

AnthonyMDev commented Jul 6, 2021

Due to the multiple stability issued we've experienced recently, we've decided to create a fork of Starscream and revert to using Starscream 3.x. See #1861 for more information.

This will remove some of the new functionality that Starscream 4.0 provided. Because we are on a fork of Starscream now, if you would like any of that functionality back, please feel free to make a PR to add it. You can make any additions to Starscream 3.x needed in a PR to our fork https://github.com/apollographql/Starscream and we would be happy to help with integrating them back into the ApolloWebSockets library.

A new version with these changes is expected to go out in the next couple of days. This will be a pretty big breaking change, but we believe it is for the best to solve the stability issues that have been uncovered. Please let us know if you continue to have issues after updating to this new version!

@AnthonyMDev AnthonyMDev modified the milestones: June 2021, July 2021 Jul 6, 2021
@AnthonyMDev
Copy link
Contributor

Version 0.45.0 has been released with the above mentioned changes. I'm closing this issue, in hopes that its resolved in this release. If you continue to have issues with Web Sockets please feel free to open a new issue!

@hwillson hwillson removed this from the MM-2021-07 milestone Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants