Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamingUpdateProcessor reconnect seems to hang when hitting stream_read_timeout #209

Closed
pballandras opened this issue May 24, 2023 · 4 comments

Comments

@pballandras
Copy link

pballandras commented May 24, 2023

Is this a support request?
🙅

Describe the bug
During the outage of May 18th 2023, we saw a number of calls simply hanging. The code using the SDK would simply stop and nothing seemed to be going on. I dug into the code briefly, but couldn't figure out entirely what caused our calls to simply hang. We call in our code client.variation(...) (where client: LDClient) and that works well during non-outage periods. Of course, during outages, stuff goes wrong, but I'm not sure if it's this call that hangs.

If you look at the logs down below, you'll see that the logs (that end there, it's not cut) end at exactly 5min from the time the StreamingUpdateProcessor initialized. In the code, the timeout is set at 5min as well.

So, to describe the bug, it would seem that there might exist an infinite loop around the logic for the connection (or reconnection) to stream.launchdarkly.com.

Edit: Here's where I was in the code:

  • ldclient/impl/datasource/streaming.py at line 94
  • ldclient/impl/events/event_processor.py at line 473
  • ldclient/impl/util.py at line 114

To reproduce
Ok so guys

  1. create an outage at LD :trollface:

No but for real, since the circumstances are kind of extreme, I'm not sure how to reproduce it. My guess is that by timing out the heartbeat on the StreamingUpdateProcessor you can reproduce this, but then again, I can't be sure... Sorry 🙏

Expected behavior
No hanging. Throwing an error would be absolutely fine in that case.

Logs

2023-05-18 11:35:56 INFO     Starting event processor
2023-05-18 11:35:56 INFO     Starting StreamingUpdateProcessor connecting to uri: https://stream.launchdarkly.com/all
2023-05-18 11:35:56 INFO     Waiting up to 5 seconds for LaunchDarkly client to initialize...
2023-05-18 11:35:56 INFO     StreamingUpdateProcessor initialized ok.
2023-05-18 11:35:56 INFO     Started LaunchDarkly Client: OK
[... time goes fast ...]
2023-05-18 11:40:56 WARNING  Unexpected error on stream connection: HTTPSConnectionPool(host='stream.launchdarkly.com', port=443): Read timed out., will retry
2023-05-18 11:40:56 INFO     Will reconnect after delay of 0.704595s
2023-05-18 11:51:11 WARNING  Error posting diagnostic event (will retry): HTTPSConnectionPool(host='events.launchdarkly.com', port=443): Read timed out. (read timeout=15)

SDK version
6.13.3

I thought we were on a more recent version, but looking at the changelog, I didn't find any explicit fix for this, but I may be wrong.

Language version, developer tools
Python 3.9

OS/platform
Ubuntu 18.04

Additional context
If I omitted anything, don't hesitate to ask me, I've been on the other side of support cases so I know users tend to forget the details.

Cheers!

@keelerm84
Copy link
Member

Thank you for writing up this detailed issue, and of course my apologies for the issues you were experiencing.

I will try to look at this as soon as I can and let you know once we have a resolution.

Filed internally as sc-204447

@keelerm84
Copy link
Member

Good morning. I have been looking into this, but so far I haven't been able to reproduce the problem.

To confirm a few things:

  • Do you have any special routing rules setup in your environment? Any proxies or firewalls that might affect these connections?

  • Was there unusually high load on the system during these times?

  • Are you using a remote data store (e.g. redis) or are you using the default in memory store?

  • You mention in the ticket looking at specific lines, but those don't seem to match up with the version you stated you are using. Just want to confirm this is a problem in 6.13.3 right?

  • Are there other processes within your application which might have locked the GIL and prevented other threads from continuing?

@pballandras
Copy link
Author

Good morning!
It is a hard case I presented you with indeed.

Do you have any special routing rules setup in your environment? Any proxies or firewalls that might affect these connections?

I would think not since the rest of the time we are fine. This happened only twice too and very close to the outage. This said I'll do a bit of research to be sure.

Was there unusually high load on the system during these times?

No, it's actually a pretty light workload.

Are you using a remote data store (e.g. redis) or are you using the default in memory store?

We use the default client, like this:

def _get_launch_darkly_client() -> LDClient:
    sdk_key = os.getenv(LAUNCHDARKLY_SDK_KEY_SECRET_ENV_VARIABLE) or _get_sdk_key()
    ldclient.set_sdk_key(sdk_key)
    return ldclient.get()

You mention in the ticket looking at specific lines, but those don't seem to match up with the version you stated you are using. Just want to confirm this is a problem in 6.13.3 right?

Yes that's my bad, I was convinced (then thoroughly disappointment that we were not) that we were on at least 7.x. So I looked on the 8x and or 7.x only to realize we were on 6.13.3.

Are there other processes within your application which might have locked the GIL and prevented other threads from continuing?

At that point, nothing was in the way, the only thing running was the flag fetch.


Keep in mind this might be somewhat of a hard edge case. I'm going to check out more in details our side too in the meantime.

LaunchDarklyReleaseBot pushed a commit that referenced this issue May 31, 2023
@pballandras
Copy link
Author

I'll close this as I don't think either of us want to spend more time on this. The problem has not come back and happened only during extreme situations, it's simply not worth both our time to investigate more. I'll see you guys again if there's ever another problem of the sort!

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants