-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
emit processed bytes metric #10407
emit processed bytes metric #10407
Conversation
@jihoonson The proposal looks great, I did this PR because bytes ingested is not available for streaming or batch tasks. I see your proposal only includes metrics for batch tasks, probably we can do another PR for emitting bytes ingested for kafka/kinesis directly through I believe the bytes ingested reported through your proposed changes will be all types of |
Yeah, it sounds reasonable to me to add bytes ingested for both streaming and batch. My proposal only talks about metrics for batch, but I have been also thinking about the metrics system for ingestion in general. Currently, both batch and streaming ingestion uses
Yes, correct. More precisely, most of metrics including both the bytes read and the bytes written will be available for individual phase (determining partitions, indexing, etc) as well as for the overall metrics across all phases. |
I added this class with the vision that more metrics can be added in future regarding ingestion as this class is available at task and InputSource/InputEntity level as well. Also I don't see |
@jihoonson are you actively working on your proposal ? do you think you can reuse the |
@pjain1 sorry, I forgot about this PR. I could reuse it, but I'm still not sure why they are in a separate class vs having all those metrics in one place. Are you thinking a case where you want to selectively disable the new metric? If so, when would you want to do it? Even in that case, I would rather think about another way to selectively enable/disable metrics instead of having each metric in different classes. In the current implementation (before this PR), what doesn't make sense to me is sharing the same metrics between batch and realtime tasks because what we want to see for them will be pretty different even though some of them can be shared. So, IMO, probably it will probably be best to add new classes each of which have all metrics for batch and realtime tasks, respectively. |
As long as we get the metrics about how many raw bytes are processed from the source (including scans for determining shard specs) I think I am ok with any approach you follow. It doesn't necessarily be the code from this PR, I am already using this code internally so thought if its reused there would be less conflicts but totally up to you. Thanks |
Apologize, I accidentally clicked the button which published my previous comment incomplete. Updated it now. |
Makes sense, so as long as #10407 (comment) is satisfied things seems good to me. |
@pjain1 thanks. Yes, per-phase metrics and total metrics will be available for raw input bytes. Other than the issue we have talked, this PR makes sense to me. I don't think my proposal necessarily blocks this PR or vice versa. I just wanted to make sure what design is best for us. I can review this PR probably this week. |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions. |
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions. |
@pjain1 I have started taking a look at this PR. Deepest apologies that it has not been reviewed yet. Any chances, you can help resolve the conflicts ? |
This issue is no longer marked as stale. |
hey @somu-imply, I can look into resolving conflicts over the weekend. However, @jihoonson was working on something similar iirc so not sure whats the status of it and if this is needed anymore. |
Closing this as #13520 is already merged. Thanks a lot for the work on this @somu-imply and @pjain1 ! |
Description
Currently there is no way to know how much data is processed by task during ingestion. This PR adds
ingest/events/processedBytes
metric to emit number of bytes read since last emission time.This PR adds
InputStats
class which is present in all task types and acts as holder for task level counts like processed bytes in this case. Thus standardized metrics throughout the task types can be added in future and emitted usingInputStatsMonitor
which is automatically initialized for all tasksThis PR provides convenient wrapper class named
CountableInputEntity
which can warp anyInputEntity
to count number of bytes processed through thatInputEntity
, thus its easier for new implementations to emit this metric just by wrapping the base input entity in this while creatingInputEntityIteratingReader
Since Kafka and Kinesis does not use
InputEntity
, therefore processed bytes is increment directly inSeekableStreamIndexTaskRunner
as it has access toInputStats
This does not support Firehoses
This PR has:
Key changed/added classes in this PR
InputStats
InputStatsMonitor
CountableInputEntity
AbstractBatchIndexTask
SeekableStreamIndexTask