-
-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(metrics): handle the case where the worker is already assigned to a thread #1171
Conversation
Strange that tests are failing, they aren't failing locally. Edit: the only difference between my build and the build in the tests is that mine is built with |
Yep, this is indeed the case. Compiling with the watcher causes the tests to be non-deterministic. I'll investigate. I also ran into #1172 |
There's still some issues remaining with this metric: When a worker restarts, the original thread is given to the "cgi pool" and a new worker is instantiated from "cgi pool" which may-or-may-not be the same as the previous thread. This will cause this metric to eventually be increased to the total number of php threads. I'll wait for #1137, which may address this issue. |
worker.go
Outdated
// if the worker can stay up longer than backoff*2, it is probably an application error | ||
upFunc := sync.Once{} | ||
go func() { | ||
backingOffLock.RLock() | ||
wait := backoff * 2 | ||
backingOffLock.RUnlock() | ||
time.Sleep(wait) | ||
worker.ready <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the correct approach. What if the worker script takes longer than 200ms to start and then fails? Additionally, this will unnecessarily delay startup in most cases.
This line should probably be moved somewhere to metrics.ReadyWorker
, right after frankenphp_handle_request()
was called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm not sure why we need to wait for them to be fully ready other than for ensuring that tests can be run. This implementation ensures that we can continue and can be reasonably confident that the workers are executing and not immediately crashing. Whether the application is ready to serve or not is in the domain of the application through liveness and health checks.
In fact, in the current implementation, it is continuing before all threads are ready anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it will definitely wait for all threads to be ready via the workersReadyWG
, which only reaches 0 after each thread has called frankenphp_handle_request()
. With this implementation we could reach a 'ready' state even if the worker script doesn't contain frankenphp_handle_request()
I guess it depends on when you would consider the worker thread to be 'ready', but why go with 'reasonable confidence' when you can have absolute confidence instead? (and be faster in most cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it will definitely wait for all threads to be ready via the workersReadyWG, which only reaches 0 after each thread has called frankenphp_handle_request().
I honestly don't remember what I ran into last night that specifically triggered this rabbit hole, other than turning on the watcher caused all kinds of problems with the metrics being output. TL;DR: I was just trying to simplify things to understand why it is causing issues, then running into other issues as I went, yak shaving my way through it all.
I ran into a number of minor issues that sometimes made it past the atomics, like restarting during shutdown. Atomics are terrible unless you are using CAS, and you are not guaranteed to get the latest value ... so I deleted them. From there, it was just figuring out what was going wrong with the WorkersReadyWG
because as mentioned, the current architecture doesn't really make sense for them, but we don't want to run tests until workers are started.
So, most of the refactoring is just in getting rid of atomics. This then uncovered new failing tests 100% of the time instead of randomly, particularly when restarting workers... and so on.
After all this, I still don't know why watchers specifically caused my issues 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it depends on when you would consider the worker thread to be 'ready', but why go with 'reasonable confidence' when you can have absolute confidence instead? (and be faster in most cases)
I moved it.
TL;DR: I was just trying to simplify things to understand why it is causing issues
I don't think I'll merge this PR, but continue working on a PR for your PR. This metric is fubar for now, but I think it is worth focusing on the work there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I originally intended my refactor PR to just be a small step towards more control over PHP threads. But then I realized it could potentially also fix Fibers and it kinda ended up doing more than I wanted it to.
Problem with these big PRs is that they kind of block other PRs (like this one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should take smaller steps in our PR. Unfortunately, I never did figure out what was broken here, short of refactoring the entire worker stuff. :(
It might be worth working on an actual design in a Discussion? That way we'd be able to work on different parts independently and break it down into smaller PRs. Maybe we could take a 'strangler fig' approach and reimplement it separate from the current code. So, we could just use build tags to switch between worker implementations (at least while we are working on it), and people could test it more easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1137 indeed addresses the issue with worker threads turning into regular threads. There'll probably be a lot of merge conflicts with this PR |
@AlliBalliBaba, I don't see any requested changes? |
Thank you! |
fixes #1163
There are also some miscellaneous fixes in here as well.
cc: @AlliBalliBaba