Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(metrics): handle the case where the worker is already assigned to a thread #1171

Merged
merged 14 commits into from
Nov 21, 2024

Conversation

withinboredom
Copy link
Collaborator

@withinboredom withinboredom commented Nov 15, 2024

fixes #1163

There are also some miscellaneous fixes in here as well.

  1. I noticed that using the watcher introduced race conditions with the "ready waitgroup"
  2. I removed atomics and tried to wrap my head around what was going on
  3. I discovered fatal error: opcache_reset not found #1172
  4. Instead of a WG, I used a channel to avoid race conditions and is also much simpler to reason about.
  5. I found a potential memory leak when a worker restarts

cc: @AlliBalliBaba

@withinboredom withinboredom changed the title handle the case where the worker is already assigned to a thread [metrics] handle the case where the worker is already assigned to a thread Nov 15, 2024
@withinboredom
Copy link
Collaborator Author

withinboredom commented Nov 15, 2024

Strange that tests are failing, they aren't failing locally.

Edit: the only difference between my build and the build in the tests is that mine is built with nowatcher. I'll install it and see if that is the issue.

@withinboredom
Copy link
Collaborator Author

withinboredom commented Nov 16, 2024

Yep, this is indeed the case. Compiling with the watcher causes the tests to be non-deterministic. I'll investigate. I also ran into #1172

@withinboredom
Copy link
Collaborator Author

withinboredom commented Nov 16, 2024

There's still some issues remaining with this metric:

When a worker restarts, the original thread is given to the "cgi pool" and a new worker is instantiated from "cgi pool" which may-or-may-not be the same as the previous thread. This will cause this metric to eventually be increased to the total number of php threads.

I'll wait for #1137, which may address this issue.

worker.go Outdated Show resolved Hide resolved
worker.go Outdated
// if the worker can stay up longer than backoff*2, it is probably an application error
upFunc := sync.Once{}
go func() {
backingOffLock.RLock()
wait := backoff * 2
backingOffLock.RUnlock()
time.Sleep(wait)
worker.ready <- struct{}{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the correct approach. What if the worker script takes longer than 200ms to start and then fails? Additionally, this will unnecessarily delay startup in most cases.
This line should probably be moved somewhere to metrics.ReadyWorker, right after frankenphp_handle_request() was called.

Copy link
Collaborator Author

@withinboredom withinboredom Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm not sure why we need to wait for them to be fully ready other than for ensuring that tests can be run. This implementation ensures that we can continue and can be reasonably confident that the workers are executing and not immediately crashing. Whether the application is ready to serve or not is in the domain of the application through liveness and health checks.

In fact, in the current implementation, it is continuing before all threads are ready anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it will definitely wait for all threads to be ready via the workersReadyWG, which only reaches 0 after each thread has called frankenphp_handle_request(). With this implementation we could reach a 'ready' state even if the worker script doesn't contain frankenphp_handle_request()
I guess it depends on when you would consider the worker thread to be 'ready', but why go with 'reasonable confidence' when you can have absolute confidence instead? (and be faster in most cases)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it will definitely wait for all threads to be ready via the workersReadyWG, which only reaches 0 after each thread has called frankenphp_handle_request().

I honestly don't remember what I ran into last night that specifically triggered this rabbit hole, other than turning on the watcher caused all kinds of problems with the metrics being output. TL;DR: I was just trying to simplify things to understand why it is causing issues, then running into other issues as I went, yak shaving my way through it all.

I ran into a number of minor issues that sometimes made it past the atomics, like restarting during shutdown. Atomics are terrible unless you are using CAS, and you are not guaranteed to get the latest value ... so I deleted them. From there, it was just figuring out what was going wrong with the WorkersReadyWG because as mentioned, the current architecture doesn't really make sense for them, but we don't want to run tests until workers are started.

So, most of the refactoring is just in getting rid of atomics. This then uncovered new failing tests 100% of the time instead of randomly, particularly when restarting workers... and so on.

After all this, I still don't know why watchers specifically caused my issues 😆

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on when you would consider the worker thread to be 'ready', but why go with 'reasonable confidence' when you can have absolute confidence instead? (and be faster in most cases)

I moved it.

TL;DR: I was just trying to simplify things to understand why it is causing issues

I don't think I'll merge this PR, but continue working on a PR for your PR. This metric is fubar for now, but I think it is worth focusing on the work there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I originally intended my refactor PR to just be a small step towards more control over PHP threads. But then I realized it could potentially also fix Fibers and it kinda ended up doing more than I wanted it to.
Problem with these big PRs is that they kind of block other PRs (like this one).

Copy link
Collaborator Author

@withinboredom withinboredom Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should take smaller steps in our PR. Unfortunately, I never did figure out what was broken here, short of refactoring the entire worker stuff. :(

It might be worth working on an actual design in a Discussion? That way we'd be able to work on different parts independently and break it down into smaller PRs. Maybe we could take a 'strangler fig' approach and reimplement it separate from the current code. So, we could just use build tags to switch between worker implementations (at least while we are working on it), and people could test it more easily.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker.go Outdated Show resolved Hide resolved
@AlliBalliBaba
Copy link
Collaborator

#1137 indeed addresses the issue with worker threads turning into regular threads. There'll probably be a lot of merge conflicts with this PR

@withinboredom
Copy link
Collaborator Author

@AlliBalliBaba, I don't see any requested changes?

@dunglas dunglas changed the title [metrics] handle the case where the worker is already assigned to a thread fix(metrics): handle the case where the worker is already assigned to a thread Nov 21, 2024
@dunglas dunglas merged commit 08e99fc into main Nov 21, 2024
46 checks passed
@dunglas dunglas deleted the fix/worker-ready-metric branch November 21, 2024 12:23
@dunglas
Copy link
Owner

dunglas commented Nov 21, 2024

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

frankenphp_worker_php_ready_worker is skewing to negative values with worker restarts
3 participants