fix(metrics): handle the case where the worker is already assigned to a thread #1171

withinboredom · 2024-11-15T19:55:06Z

There are also some miscellaneous fixes in here as well.

I noticed that using the watcher introduced race conditions with the "ready waitgroup"
I removed atomics and tried to wrap my head around what was going on
I discovered fatal error: opcache_reset not found #1172
Instead of a WG, I used a channel to avoid race conditions and is also much simpler to reason about.
I found a potential memory leak when a worker restarts

withinboredom · 2024-11-15T23:12:15Z

Strange that tests are failing, they aren't failing locally.

Edit: the only difference between my build and the build in the tests is that mine is built with nowatcher. I'll install it and see if that is the issue.

withinboredom · 2024-11-16T10:32:22Z

Yep, this is indeed the case. Compiling with the watcher causes the tests to be non-deterministic. I'll investigate. I also ran into #1172

…f worker threads

…ing up

withinboredom · 2024-11-16T13:49:18Z

There's still some issues remaining with this metric:

When a worker restarts, the original thread is given to the "cgi pool" and a new worker is instantiated from "cgi pool" which may-or-may-not be the same as the previous thread. This will cause this metric to eventually be increased to the total number of php threads.

I'll wait for #1137, which may address this issue.

worker.go

AlliBalliBaba · 2024-11-16T14:41:37Z

worker.go

 		// if the worker can stay up longer than backoff*2, it is probably an application error
 		upFunc := sync.Once{}
 		go func() {
 			backingOffLock.RLock()
 			wait := backoff * 2
 			backingOffLock.RUnlock()
 			time.Sleep(wait)
+			worker.ready <- struct{}{}


I'm not sure if this is the correct approach. What if the worker script takes longer than 200ms to start and then fails? Additionally, this will unnecessarily delay startup in most cases.
This line should probably be moved somewhere to metrics.ReadyWorker, right after frankenphp_handle_request() was called.

Honestly, I'm not sure why we need to wait for them to be fully ready other than for ensuring that tests can be run. This implementation ensures that we can continue and can be reasonably confident that the workers are executing and not immediately crashing. Whether the application is ready to serve or not is in the domain of the application through liveness and health checks.

In fact, in the current implementation, it is continuing before all threads are ready anyway.

Currently it will definitely wait for all threads to be ready via the workersReadyWG, which only reaches 0 after each thread has called frankenphp_handle_request(). With this implementation we could reach a 'ready' state even if the worker script doesn't contain frankenphp_handle_request()
I guess it depends on when you would consider the worker thread to be 'ready', but why go with 'reasonable confidence' when you can have absolute confidence instead? (and be faster in most cases)

Currently it will definitely wait for all threads to be ready via the workersReadyWG, which only reaches 0 after each thread has called frankenphp_handle_request().

I honestly don't remember what I ran into last night that specifically triggered this rabbit hole, other than turning on the watcher caused all kinds of problems with the metrics being output. TL;DR: I was just trying to simplify things to understand why it is causing issues, then running into other issues as I went, yak shaving my way through it all.

I ran into a number of minor issues that sometimes made it past the atomics, like restarting during shutdown. Atomics are terrible unless you are using CAS, and you are not guaranteed to get the latest value ... so I deleted them. From there, it was just figuring out what was going wrong with the WorkersReadyWG because as mentioned, the current architecture doesn't really make sense for them, but we don't want to run tests until workers are started.

So, most of the refactoring is just in getting rid of atomics. This then uncovered new failing tests 100% of the time instead of randomly, particularly when restarting workers... and so on.

After all this, I still don't know why watchers specifically caused my issues 😆

I guess it depends on when you would consider the worker thread to be 'ready', but why go with 'reasonable confidence' when you can have absolute confidence instead? (and be faster in most cases)

I moved it.

TL;DR: I was just trying to simplify things to understand why it is causing issues

I don't think I'll merge this PR, but continue working on a PR for your PR. This metric is fubar for now, but I think it is worth focusing on the work there.

Yeah I originally intended my refactor PR to just be a small step towards more control over PHP threads. But then I realized it could potentially also fix Fibers and it kinda ended up doing more than I wanted it to.
Problem with these big PRs is that they kind of block other PRs (like this one).

Maybe we should take smaller steps in our PR. Unfortunately, I never did figure out what was broken here, short of refactoring the entire worker stuff. :(

It might be worth working on an actual design in a Discussion? That way we'd be able to work on different parts independently and break it down into smaller PRs. Maybe we could take a 'strangler fig' approach and reimplement it separate from the current code. So, we could just use build tags to switch between worker implementations (at least while we are working on it), and people could test it more easily.

worker.go

AlliBalliBaba · 2024-11-16T14:57:55Z

#1137 indeed addresses the issue with worker threads turning into regular threads. There'll probably be a lot of merge conflicts with this PR

withinboredom · 2024-11-16T15:43:30Z

@AlliBalliBaba, I don't see any requested changes?

dunglas · 2024-11-21T12:23:47Z

Thank you!

handle the case where the worker is already assigned to a thread

941506e

withinboredom changed the title ~~handle the case where the worker is already assigned to a thread~~ [metrics] handle the case where the worker is already assigned to a thread Nov 15, 2024

withinboredom added 7 commits November 16, 2024 11:48

use testport

ffebef3

remove atomics

1899b34

ensure we only update it once per thread

3a35c42

ensure worker wg is updated correctly now

d225d58

drop workerReadyWg and make sure worker thread is removed from pool o…

d4bb687

…f worker threads

use channels instead of waitgroup to avoid race conditions with ready…

93bada2

…ing up

remove unreachable code

5f92d99

withinboredom requested review from AlliBalliBaba and dunglas November 16, 2024 13:40

remove test log

cb7dc2e

dunglas approved these changes Nov 16, 2024

View reviewed changes

AlliBalliBaba requested changes Nov 16, 2024

View reviewed changes

prevent a go-routine leak

0fe81b4

withinboredom added 4 commits November 16, 2024 19:07

remove deletion of worker thread

4023b59

send worker ready signal once actually ready

290d65c

send shutdown metric

32330b4

wait for ready in parallel

3eff85e

AlliBalliBaba approved these changes Nov 20, 2024

View reviewed changes

dunglas approved these changes Nov 21, 2024

View reviewed changes

dunglas changed the title ~~[metrics] handle the case where the worker is already assigned to a thread~~ fix(metrics): handle the case where the worker is already assigned to a thread Nov 21, 2024

dunglas merged commit 08e99fc into main Nov 21, 2024
46 checks passed

dunglas deleted the fix/worker-ready-metric branch November 21, 2024 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metrics): handle the case where the worker is already assigned to a thread #1171

fix(metrics): handle the case where the worker is already assigned to a thread #1171

withinboredom commented Nov 15, 2024 •

edited

Loading

withinboredom commented Nov 15, 2024 •

edited

Loading

withinboredom commented Nov 16, 2024 •

edited

Loading

withinboredom commented Nov 16, 2024 •

edited

Loading

AlliBalliBaba Nov 16, 2024

withinboredom Nov 16, 2024 •

edited

Loading

AlliBalliBaba Nov 16, 2024

withinboredom Nov 16, 2024

withinboredom Nov 16, 2024

AlliBalliBaba Nov 16, 2024

withinboredom Nov 16, 2024 •

edited

Loading

withinboredom Nov 16, 2024

AlliBalliBaba commented Nov 16, 2024

withinboredom commented Nov 16, 2024

dunglas commented Nov 21, 2024

fix(metrics): handle the case where the worker is already assigned to a thread #1171

fix(metrics): handle the case where the worker is already assigned to a thread #1171

Conversation

withinboredom commented Nov 15, 2024 • edited Loading

withinboredom commented Nov 15, 2024 • edited Loading

withinboredom commented Nov 16, 2024 • edited Loading

withinboredom commented Nov 16, 2024 • edited Loading

AlliBalliBaba Nov 16, 2024

Choose a reason for hiding this comment

withinboredom Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

AlliBalliBaba Nov 16, 2024

Choose a reason for hiding this comment

withinboredom Nov 16, 2024

Choose a reason for hiding this comment

withinboredom Nov 16, 2024

Choose a reason for hiding this comment

AlliBalliBaba Nov 16, 2024

Choose a reason for hiding this comment

withinboredom Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

withinboredom Nov 16, 2024

Choose a reason for hiding this comment

AlliBalliBaba commented Nov 16, 2024

withinboredom commented Nov 16, 2024

dunglas commented Nov 21, 2024

withinboredom commented Nov 15, 2024 •

edited

Loading

withinboredom commented Nov 15, 2024 •

edited

Loading

withinboredom commented Nov 16, 2024 •

edited

Loading

withinboredom commented Nov 16, 2024 •

edited

Loading

withinboredom Nov 16, 2024 •

edited

Loading

withinboredom Nov 16, 2024 •

edited

Loading