Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance fix for the logger in the executor #3734

Merged
merged 4 commits into from
Nov 13, 2021

Conversation

ivan-valkov
Copy link
Contributor

What this PR does / why we need it:

The logger in the executor spins a goroutine for each prediction and waits until the payload can be logged. With a slow downstream request logger and a huge spike of requests, the executor can OOM.

This PR makes a couple of changes to allow for more graceful handling of such spikes of traffic:

  • Increase the default work queue size and make it configurable - now we use the channel to buffer work. We no longer use goroutines that wait to write to the channel as a buffer
  • Add a configurable timeout for when the buffer is full. This means logs can be dropped if there is too much work for the executor and downstream log processing to handle.
  • Increase the default workers - they are waiting on I/O most of the time so they can be more than they used to be.

There is also a general refactor of the consumer/producer pattern we use to make it easier to understand and use without changing the behaviour.

This PR also adds a benchmark for profiling the behaviour of the executor. It was used to confirm that before we can see a linear increase in the number of goroutines with the number of requests. These goroutines also never finished if the request logging was taking ages to process. The new implementation shows a steady number of goroutines allocated for logging no matter how many requests we have.

We will need to expose the following flags in the operator so that they can be configured from there:

  • logger_workers
  • log_work_buffer_size
  • log_write_timeout_ms

Which issue(s) this PR fixes:

Fixes #3726

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Copy link
Contributor

@michaelcheah michaelcheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Comment on lines +74 to +75
logWorkBufferSize = flag.Int("log_work_buffer_size", 10000, "Limit of buffered logs in memory while waiting for downstream request ingestion")
logWriteTimeoutMs = flag.Int("log_write_timeout_ms", 2000, "Timeout before giving up writing log if buffer is full. If <= 0 will immediately drop log on full log buffer.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can this not use the default values in executor/logger/collector.go?

@seldondev
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: michaelcheah
To complete the pull request process, please assign ivan-valkov
You can assign the PR to them by writing /assign @ivan-valkov in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ukclivecox
Copy link
Contributor

@ivan-valkov Great. Will open another PR to add Helm chart parameters.

@ukclivecox ukclivecox merged commit eaf7a22 into SeldonIO:master Nov 13, 2021
stephen37 pushed a commit to stephen37/seldon-core that referenced this pull request Dec 21, 2021
* WIP benchmark

* better benchmark with pprof

* fix wip

* cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OOM when stress test the Seldon model, which may be caused by the logging of request and response payloads
4 participants