-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect metrics from TF Events files #173
Comments
I cannot figure out a better solution than PVC. We need help from the community |
Can you explain the issue? Is this just a question of making the events file accessible by two processes e.g.
Using a PVC to share the TF.events file seems perfectly reasonable. We can also support object stores (S3, HDFS, GCS) since TF can read/write those directly. |
I heard @gaocegege will make a WIP PR for TFJob support first in the next week. We can implement the metrics collector for TF Event independent from TFJob support. |
@YujiOshima how will this be compatible with PyTorch job? |
@johnugeorge I will parse a tf.Event file with event.proto. |
@YujiOshima I am not aware of equivalent official one in pytorch. What are the other options to support it? |
@johnugeorge I think there is two way.
|
The problem is that we are forcing the workers to use a particular library or particular format. Currently, I think this is the only way. |
@johnugeorge I don't think we are forcing users to do things a particular way; the idea is to make Katib pluggable with respect to how metrics are collected. Support for TF.Events is just one of the methods that we want to be well supported in order to support TF, |
Tf event metrics collector was added by #235 . |
@YujiOshima: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Relevant issues
#87 Study Job CRD; Don't require users to write code to do HP tuning.
#39 support TFJob and other frameworks
We'd like to be able to collect metrics from a TF.Events file produced by a TensorFlow training job.
So at a high level what we need is
The text was updated successfully, but these errors were encountered: