-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameter Server: Run TF server by default #36
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on!
images/default-ps/Dockerfile
Outdated
@@ -0,0 +1,5 @@ | |||
ARG BASE_IMAGE=tensorflow/tensorflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize we'd have to build our own Docker images. I thought grpc_tensorflow_server.py was already included in the standard TensorFlow image. Might be worth filing an FR for that.
images/default-ps/main.py
Outdated
@@ -0,0 +1,22 @@ | |||
# A very simple parameter server that joins the server defined by the cluster spec passed as environment variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use grpc_tensorflow_server.py (just make a copy)? Ideally this binary is eventually part of the TensorFlow Docker image. So the less divergence from the repo the better.
images/default-ps/build_and_push.sh
Outdated
@@ -0,0 +1,22 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid creating additional Docker images? What if we create a ConfigMap with grpc_tensorflow_server.py (or app.py) and then mount that into the PS container. I think that would allow us to us a stock TensorFlow image.
pkg/spec/tf_job.go
Outdated
@@ -91,6 +92,8 @@ type TfReplicaSpec struct { | |||
// TfPort is the port to use for TF services. | |||
TfPort *int32 `json:"tfPort,omitempty" protobuf:"varint,1,opt,name=tfPort"` | |||
TfReplicaType `json:"tfReplicaType"` | |||
//TfVersion is only used when TfReplicaType == PS to automatically start a PS server | |||
TfVersion string `json:"tfVersion,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a TfJobSpec just have an optional TfImage? This would be a Docker Image URI for a container with default binaries. Right now we have two "tensorboard" and "grpc_std_server"?
Now that I'm looking at the code I think its better if we can avoid adding a layer of indirection in the form of a mapping from Tf versions to Docker images.
@jlewi We can use https://hub.docker.com/r/tensorflow/tf_grpc_server/ instead of building our own images, but there are only basically two versions: |
@wbuchwalter I don't think so. I think those images are outdated and unlikely to be compatible with the latest versions of TF. I don't think TF provides a lot of cross version guarantees. So ideally all Docker images in the job are derived from the same TensorFlow docker image so they are using the same version of TF. |
@jlewi I implemented your suggestion and it's now using a Let me know if you are okay with this or if you thinks this is too brittle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wbuchwalter Instead of doing the json to string conversion in Python can we do it in the controller? e.g. in the call to setDefaults you could just convert the spec to the format expected and pass it as an argument to the Python program.
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# ============================================================================== | ||
"""Python-based TensorFlow GRPC server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a TODO indicating that we'd eventually like to get rid of this once grpc_tensorflow_server is included in the TF container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@jlewi Since the |
@wbuchwalter You're right thanks for reminding me. It seems like there are two approaches
There are a couple things I don't like about doing it in Create.
So I think my preference would be to figure out how to make cluster_spec available in setDefaults. One solution would be to do two passes over the replicas to fill in setDefaults. First pass would figure out the port to use for each process. Then for the second pass, we can compute cluster_spec and use that to fill out Command. |
Sorry for the delay responding to this, currently travelling. Another option would be to be able to specify a Unless you have other feature in mind that could benefit from this mechanism, I think the current approach (doing the transform in python at runtime) still is cleaner albeit not ideal. What do you think? Do you see another way of doing what you said that I missed? |
Thanks for the explanation (I should have thought of that). In that case, I think its fine to do the transformation in Create and set the command appropriately. I'd prefer not to introduce extra python code. I'd also prefer not to use PodLabels as a way of signaling to Create that it needs to set the command. Instead can we set a private variable on either TfReplicaSet or TfJob to indicate that Create needs to set the command appropriately? |
@jlewi Could you give it another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Only one major comment regarding whether we should create a ConfigMap for each job.
pkg/controller/controller.go
Outdated
return watchVersion, nil | ||
} | ||
|
||
//Create a ConfigMap containing the source for a simple grpc server (pkg/controller/grpc_tensorflow_server.py) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after "//"
pkg/controller/controller.go
Outdated
func (c *Controller) createPSConfigMap() error { | ||
//If a ConfigMap with the same name already exists, it was created by an earlier operator | ||
//we delete and recreate it in case the grpc_tensorflow_server.py was updated in the meantime | ||
cm, err := c.KubeCli.CoreV1().ConfigMaps(c.Namespace).Get(spec.PSConfigMapName(), metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this cause problems? Suppose a job is happily running using the supplied config map. Now suppose the operator gets restarted. The operator would then try to delete and recreate the ConfigMap which seems problematic since some jobs could already be mounting it.
Here are some options I can think of for avoiding this.
- Create a unique ConfigMap for every job
- Do not delete the ConfigMap if it already exists; just reuse it.
I favor the first option because the ConfigMap is just a temporary solution until the TensorFlow container has the gRPC server binary built in. I think ConfigMap per job makes it more robust and I think this is better than the slight efficiency we get by reusing ConfigMaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
1 ConfigMap per job would also mean specifying the volumes configuration for the default PS in the create phase (probably where I specify the Command) because the name of the ConfigMap won't be known in advance anymore and will rely on the RuntimeId.
Are you ok with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
pkg/spec/tf_job.go
Outdated
@@ -91,6 +96,10 @@ type TfReplicaSpec struct { | |||
// TfPort is the port to use for TF services. | |||
TfPort *int32 `json:"tfPort,omitempty" protobuf:"varint,1,opt,name=tfPort"` | |||
TfReplicaType `json:"tfReplicaType"` | |||
//TfImage is only used when TfReplicaType == PS to automatically start a PS server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Space after "//" here and for other comments please.
pkg/spec/tf_job.go
Outdated
@@ -75,7 +79,8 @@ const ( | |||
type ContainerName string | |||
|
|||
const ( | |||
TENSORFLOW ContainerName = "tensorflow" | |||
TENSORFLOW ContainerName = "tensorflow" | |||
PsDefaultImage = "wbuchwalter/mlkube-tensorflow-ps" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we default to a TensorFlow image now? Can we make this a flag to make it easily customizable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is something I forgot to remove from my first implementation (with custom images).
Now, if you want to use the default PS, you would declare a template like this:
- replicas: 2
tfReplicaType: PS
tfImage: tensorflow/tensorflow:1.3.0
So it's already easily customizable.
I chose not to set any default for tfImage
because it case of version mismatch between the PS and the other nodes it might be quite hard for the users to understand where the issue comes from.
Let me know if you would rather have a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So tfImage is specified at the Replica level? Why not make it a property of the job instead? This way we only have to specify it once and we can reuse it for the TensorBoard replica and the PS. If you want to specify a custom image at the Replica level you can already do that by filling out a PodTemplateSpec although it would be more verbose.
pkg/trainer/replicas.go
Outdated
@@ -88,6 +90,32 @@ func (s *TFReplicaSet) Labels() KubernetesLabels { | |||
"runtime_id": s.Job.job.Spec.RuntimeId}) | |||
} | |||
|
|||
// Transforms the tfconfig to work with grpc_tensorflow_server | |||
func transformClusterSpecForDefaultPS(clusterSpec ClusterSpec) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write a unittest for this please?
pkg/trainer/replicas.go
Outdated
// We do the appropriate transformations here | ||
cs := transformClusterSpecForDefaultPS(s.Job.ClusterSpec()) | ||
s.Spec.Template.Spec.Containers[0].Command = []string{"python", "/ps-server/grpc_tensorflow_server.py"} | ||
s.Spec.Template.Spec.Containers[0].Args = []string{"--cluster_spec", cs, "--job_name", "ps", "--task_id", fmt.Sprintf("%v", index)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you split it into command and args rather than just using command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No good reason, I will change this.
pkg/controller/controller.go
Outdated
|
||
//grab server sources from files | ||
filePaths := map[string]string{ | ||
"grpc_tensorflow_server.py": "./grpc_tensorflow_server/grpc_tensorflow_server.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the path of grpc_tensorflow_server.py a flag that gets plumbed through?
Do you plan on updating the Dockerfile for the controller to include grpc_tensorflow_server.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure I get this right, do you want this to be a flag at the controller level, so users could specify a custom `grpc_tensorflow_server.py, or do you want a flag at the job level to customize where this file gets mounted in the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking at the controller level not the job level. Right now the Docker image and controller both have to hardcode the same location of grpc_tensorflow_server.py. That seems brittle.
If you make it a flag in main.go that gets plumbed through, then the Docker image can put grpc_tensorflow_server.py anywhere in the image and we can just specify the location as an argument to main.py.
I think I responded to all your questions but let me know if I missed something. |
Hopefully it's starting to look better this time 😬
Thanks for your help getting this right. |
Sorry for the slow reply.
Seems reasonable to me.
Seems reasonable as well. |
@jlewi Ready for review. I also made two changes to the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Please consider filing an issue to add an E2E test to cover the case where we use default PS.
images/tf_operator/build_and_push.py
Outdated
@@ -51,6 +53,8 @@ def run(command, cwd=None): | |||
help="Use Google Container Builder to build the image.") | |||
parser.add_argument("--no-gcb", dest="use_gcb", action="store_false", | |||
help="Use Docker to build the image.") | |||
parser.add_argument("--no-push", dest="should_push", action="store_false", | |||
help="Push the image once build is finished.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you also need to define a "--push" argument with action "store_push"? and then do
parser.set_defaults(should_push=True)?
The help string for --no-push also looks incorrect.
Can you also pull in the latest changes that I just committed and resolve the conflict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -147,7 +148,7 @@ func (c *Controller) handleTfJobEvent(event *Event) error { | |||
//NewJob(kubeCli kubernetes.Interface, job spec.TfJob, stopC <-chan struct{}, wg *sync.WaitGroup) | |||
|
|||
c.stopChMap[clus.Metadata.Name] = stopC | |||
c.jobs[clus.Metadata.Namespace + "-" + clus.Metadata.Name] = nc | |||
c.jobs[clus.Metadata.Namespace+"-"+clus.Metadata.Name] = nc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think there should be spaces after the plus signs. Maybe run gofmt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I add this spaces back and run gofmt, it will remove them.
I am running go 1.9.1
, could you have a different go version with different gofmt rules?
pkg/spec/controller.go
Outdated
@@ -5,6 +5,9 @@ type ControllerConfig struct { | |||
// This should match the value specified as a container limit. | |||
// e.g. alpha.kubernetes.io/nvidia-gpu | |||
Accelerators map[string]AcceleratorConfig | |||
|
|||
// Path to the file containing the grpc server sources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit "sources" -> "source"
pkg/spec/tf_job.go
Outdated
@@ -75,7 +79,8 @@ const ( | |||
type ContainerName string | |||
|
|||
const ( | |||
TENSORFLOW ContainerName = "tensorflow" | |||
TENSORFLOW ContainerName = "tensorflow" | |||
DefaultTFImage = "tensorflow/tensorflow:latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pin the default image to a particular version of TF? e.g. 1.2, 1.3 etc...? We can upgrade the default when we release a new TfJob CRD.
My concern is that TensorFlow introduces a lot of breaking changes with each TF version. So if people are relying on the default TF version supplied by the CRD, their jobs could start breaking suddenly when TF releases a new version.
pkg/trainer/replicas.go
Outdated
|
||
var buf bytes.Buffer | ||
isFirstJob := true | ||
for _, k := range keys { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefer this code to be written as a series of joins so we don't have to have statements dealing with first character
so something like (without the syntax errors)
jobs = []string{}
for _, jobType := range keys {
hosts := []string{}
for _, h := range clusterSpec[jobType] {
hosts = append(hosts, h)
}
s = jobType + "|" + strings.join(hosts, ";")
jobs = append(jobs, s)
}
return strings.join(jobs, ",")
The code might not generate the correct result but hopefully it illustrates what I mean by using strings.join
pkg/trainer/replicas.go
Outdated
log.Errorf("Error building PS ConfigMap: %v", err) | ||
return err | ||
} | ||
_, err = s.ClientSet.CoreV1().ConfigMaps(NAMESPACE).Create(cm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be using the namespace of the job as opposed to the constant NAMESPACE?
I think #39 added support for creating TfJobs in other namespaces.
It looks like all the other resources use s.Job.job.Metadata.Namespace for the namespace
Can you delete the constant "NAMESPACE" in training.go since it should be unused?
pkg/trainer/replicas.go
Outdated
@@ -237,6 +333,16 @@ func (s *TFReplicaSet) Delete() error { | |||
} | |||
} | |||
|
|||
// If the ConfigMap for the default parameter server exists, we delete it | |||
_, err = s.ClientSet.CoreV1().ConfigMaps(NAMESPACE).Get(s.defaultPSConfigMapName(), meta_v1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update NAMESPACE as mentioned above.
@jlewi Should be good for another round. |
/test all |
Reviewed 1 of 8 files at r1, 4 of 14 files at r4, 3 of 11 files at r5, 1 of 1 files at r6, 2 of 4 files at r7. images/tf_operator/build_and_push.py, line 57 at r5 (raw file): Previously, wbuchwalter (William Buchwalter) wrote…
Great thanks. pkg/controller/controller.go, line 151 at r5 (raw file): Previously, wbuchwalter (William Buchwalter) wrote…
Thanks. I filed a PR to set up lint checks as part of our testing. Comments from Reviewable |
LGTM. I'm going to try to fix the test though before I merge it. |
/test all |
2 similar comments
/test all |
/test all |
images/tf_operator/build_and_push.py
Outdated
@@ -112,6 +118,8 @@ def run(command, cwd=None): | |||
else: | |||
run(["docker", "build", "-t", image, context_dir]) | |||
logging.info("Built image: %s", image) | |||
|
|||
if args.should_push: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if args.should_push statement should be inside the else block. We only push the image if we aren't using GCB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed,
/retest |
@wbuchwalter: you can't request testing unless you are a kubernetes member. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test |
/test all |
Review status: 10 of 16 files reviewed at latest revision, 5 unresolved discussions. Comments from Reviewable |
Woo Hoo! |
First draft for #16.
tfVersion
. While we could default tolatest
or something else, I feel like it might be difficult to figure out where the issue comes from in case of mismatch between PS and workers version.1.1.0
. Do we want to add more releases?I will need to change the repository where the PS default image are pushed before merging.
This change is