-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
orca: server side custom metrics implementation #5531
Conversation
orca/service_impl.go
Outdated
return &serviceImpl{minReportingInterval: opts.MinReportingInterval} | ||
// Register creates a new ORCA service implementation configured using the | ||
// provided options and registers the same on the provided service registrar. | ||
func Register(s grpc.ServiceRegistrar, opts ServerOptions) (*Server, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete and instead document how to use NewServer
/ have an example in a _test.go
file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Got rid of the NewService() function. In the process, also figured out that I was not actually checking if the interval received in the StreamCoreMetrics RPC was less than the configured minimum. Added that as well.
orca/service_impl.go
Outdated
interval := req.GetReportInterval() | ||
if err := interval.CheckValid(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this extra check?
From what I can tell, there's some definition of "normal" in a time.Duration, but the library deals with it in a pretty sane manner. Why not just use what it figures out?
The AsDuration method performs the conversion on a best-effort basis. Durations with denormal values (e.g., nanoseconds beyond -99999999 and +99999999, inclusive; or seconds and nanoseconds with opposite signs) are normalized during the conversion to a time.Duration. To manually check for invalid Duration per the documented limitations in duration.proto, additionally call the CheckValid method:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We receive this value on the wire, and I believe this value could have invalid duration value because durationpb.New()
does not do any kind of validation. https://github.com/protocolbuffers/protobuf-go/blob/v1.28.1/types/known/durationpb/duration.pb.go#L167
The duration.AsDuration()
implementation https://github.com/protocolbuffers/protobuf-go/blob/v1.28.1/types/known/durationpb/duration.pb.go#L176 returns minInt or maxInt in the case of an overflow.
If we use, duration.CheckValid()
, we will be able to return a meaningful error to the user and still continue to use the configured minimum.
Why do you see this extra check as problematic? Do you just see it as a wasted effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's problematic in my mind because it enforces an unnecessary constraint on the system. Namely, if you have >99999999 in the ns field, it's converted to a pretty reasonable number by the library, but then this check considers it to be invalid anyway.
We already reject it if it's too small. How about:
dur := interval.AsDuration()
if err := interval.CheckValid(); err != nil && dur > 30*time.Minute {
// Probably not what the user intended.
logger.Warning(...)
return s.minReportingInterval
}
Also, do we really want the minimum interval in this case? Why is the default the same as the minimum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub is so messed up. None of the review comments show up along with the source code and I'm not able to reply to them all at once. I have taken care of all comments except this one.
I don't quite understand what you are suggesting here though. Why should we check for dur > 30*time.Minute
?
There are multiple minimums:
- There is an absolute minimum, which is 30s, and this is what we call the default.
- There is a minimum enforced by the user when they create the service. This is
ServiceOptions.MinReportingInterval
, and this needs to be greater than or eq to the absolute minimum of 30s. If it is less than the absolute minimum, we end up using the absolute minimum for this. - Then, the incoming RPC can specify an interval, and this needs to be greater than the configured service minimum. If not, we use the service minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified it to match Java's implementation, where the only things that are checked are:
- whether it is specified or not
- whether it is negative
- whether it is less than the configured minimum.
orca/service.go
Outdated
// utilization metric. | ||
func (s *Service) DeleteCPUUtilization() { | ||
s.mu.Lock() | ||
s.cpu = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no difference between zero CPU utilization and deleted (presumably "not reporting")? If not, do we need this API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu, and memory are special because they have dedicated fields in the orca load report proto. I initially had the delete API for them, just to be in sync with Java. But I think it does not make much sense to have these delete APIs for two reasons:
- they can always explicitly set it to
0
. - actual applications which are setting cpu and memory will hardly have a reason to delete them.
We can always add it if there is demand for it.
The other issue of not reporting when they are not set is not something that I'm doing right now. Maybe it will save a few bytes if done. Do you think it is worth doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other issue of not reporting when they are not set is not something that I'm doing right now. Maybe it will save a few bytes if done. Do you think it is worth doing?
I'm unsure of the semantic differences between "unset" and "set to zero". If the protocol supports both (optional field vs. always present) then do we need to model that difference in our API as well? Did Java?
orca/orca.go
Outdated
// It is safe to access the underlying metric recorder inside the wrapper at | ||
// this point, as the user's RPC handler is done executing, and therefore | ||
// there will be no more calls to CallMetricRecorderFromContext(). | ||
rw, ok := ctxWithRecorder.Value(callMetricRecorderCtxKey{}).(*recorderWrapper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little awkward that we have to find this in the context considering we just stored it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ... it is. But is there another option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to prescribe one, and I also didn't want to suggest it since you just changed the function signature the other way, but....:
rw := newRecorderWrapper()
ctxWithRecorder := contextWithRecorder(ctx, rw)
...
// use rw directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
orca/service.go
Outdated
@@ -90,16 +108,9 @@ func Register(s grpc.ServiceRegistrar, opts ServiceOptions) (*Service, error) { | |||
return nil, fmt.Errorf("concrete type of provided grpc.ServiceRegistrar is %T, only supported type is %T", s, &grpc.Server{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we start with the API taking a *grpc.Server
and change it in the future to a grpc.ServiceRegistrar
when we can support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Based on our offline discussion, did not add the delete API back. So, things stay as-is for now. PTAL. Thanks. |
Also changed the interceptor implementation to not pull the recorder out of the context, but use it directly. |
Implementation of the server API as defined in https://github.com/grpc/proposal/blob/master/A51-custom-backend-metrics.md.
Fixes #5474.
RELEASE NOTES: n/a