v1.1.1 (2021-08-03)
- Add job namespace to
tf_operator_jobs_*
counters (#1283, @alembiewski) - feat: upgrade kubeflow common and volcano version (#1276, @shinytang6)
- Add task type annotation for pods when EnableGangScheduling is true. (#1268, @jiangkaihua)
- Fix invalid pointer when tfjob is deleted (#1285, @johnugeorge)
- fix get_logs pod_names type and iteration blocking (#1280, @Windfarer)
- fix calling custom_api.delete_namespaced_custom_object args error (#1281, @Windfarer)
- fix: Remove the dup comment tag (#1274, @gaocegege)
- Fix: Remove Github CD workflow (#1263, @PatrickXYS)
- Fix: the "follow" of TFJobClient.get_logs (#1254, @Windfarer)
- Update container image for v1.1.1 (#1328, @Jeffwan)
- add a specific version of tensorflow_datasets (#1305, @jazzsir)
- Remove vendor folder (#1288, @Jeffwan)
- add podgroups rule in cluster-role.yaml (#1272, @huone1)
- Use remote Kustomize build option in standalone installation instructions (#1266, @verult)
v1.1.0 (2021-03-24)
- feat: Remove k8s.io/kubernetes (#1235, @gaocegege)
- Migrate to public ECR (#1256, @PatrickXYS)
- feat: Add API Documentation WIP (#1249, @gaocegege)
- feat: Update developers guide and readme (#1244, @gaocegege)
- Move TF Operator e2e tests to AWS Prow (#1204, @ChanYiLin)
- crd definition support multiple evaluator (#1240, @oikomi)
- support multiple evaluators (#1239, @oikomi)
- feat: Change the message for running condition (#1230, @gaocegege)
- feat(server): Use apiextension client to check if crd exists (#1228, @gaocegege)
- checkCRDExists func return true when k8s cluster is not connected (#1207, @oikomi)
- feat: Add CD using GitHub Actions (#1196, @gaocegege)
- Migrate controller implementation to kubeflow/common fashion (#1171, @ChanYiLin)
- Support success policy for TFJob (#1165, @terrytangyuan)
- add distributed training example of using TF 2.1 Strategy API (#1164, @jazzsir)
- Set completion time when job exceed specified deadline. (#1150, @SimonCqk)
- Support ClusterSpec Propagation Feature in TF 1.14 (#1149, @zhujl1991)
- Add watch function for TFJob python Client API (#1122, @jinchihe)
- Enhance tfjobs sdk docs (#1114, @jinchihe)
- Generate TFJob Python SDK (#1103, @jinchihe)
- feat: Support pprof when monitoring is specified (#1102, @gaocegege)
- feat: Use kubeflow/common (#1088, @gaocegege)
- Add support for aarch64 (#1098, @MrXinWang)
- feat: Do not set TF_CONFIG for local training (#1080, @gaocegege)
- feat: Replace gometalinter with golangci-lint (#1081, @gaocegege)
- Add controller-name label for Pod and service (#1067, @hougangliu)
- Add qps and burst options (#1063, @ScorpioCPH)
- Avoid unnecessary update when tfjob is complete (#1051, @cheyang)
- set annotation automatically when EnableGangScheduling is set to true (#1032, @ChanYiLin)
- feat(pod): Support custom gang scheduler via CLI argument (#1050, @gaocegege)
- Fix kubeflow overlay (#1260, @PatrickXYS)
- fix: Do not validate evaluator (#1238, @gaocegege)
- fix: Remove default resync period (#1237, @gaocegege)
- fix: Observe the creation when failed to create the pod (#1236, @gaocegege)
- fix: Remove vendor cp command (#1232, @gaocegege)
- Fix completion time setting bug (#1226, @shaowei-su)
- feat(deploy): Add standalone deployment yaml (#1218, @gaocegege)
- Fix updateStatus no worker Crashoff (#1215, @kuikuikuizzZ)
- fix: Fix the log message (#1203, @gaocegege)
- Fix the typo (#1178, @pingsutw)
- Fix setup cluster issue and Pylint issue in CI tests (#1179, @jinchihe)
- Fix the link to run_e2e_workflow.py script (#1154, @terrytangyuan)
- Fix evaluator runconfig (#1146, @richardsliu)
- Fix sdk test issue that's caused by kubenertes Client bug. (#1143, @jinchihe)
- fix(controller): calculate satisfied with && instead of || (#1120, @GuoHaiqing)
- fix comment, add +optional flag to comment. (#1137, @EDGsheryl)
- fix(ConvertTFJobToUnstructured): ConvertTFJobToUnstructured uses function ToUnstructured to convert TFJob to Unstructured (#1118, @leileiwan)
- fix the reconcile flow (#1111, @ChanYiLin)
- Fix example Mnist With Summaries (#1073, @andreyvelich)
- fix bug: When executing
tf-operator.v1 -version
, GitSHA is always 'not provided' (#1046, @asdfsx) - fix(UI): show correct namespace and name when deleting job through dashboard (#1044, @gbin10533)
- Minor fix to add CoreV1 to scheme (#1037, @johnugeorge)
- fix(docs): Fix link for simple_TFJob_test (#1038, @gaocegege)
- fix: Remove dup code (#1022, @gaocegege)
- tf-operator: Consolidate manifests (#1255, @yanniszark)
- TFJob Operator: Move manifests development upstream (#1247, @yanniszark)
- Update vendor as kubeflow/common is updated. (#1252, @jiangkaihua)
- docs: Add Ant Group to ADOPTERS.md (#1243, @terrytangyuan)
- chore: Add tencent cloud (#1234, @gaocegege)
- add vip (#1233, @oikomi)
- chore: Update changelog (#1227, @gaocegege)
- Update kubeflow common to 0.3.2 (#1225, @shaowei-su)
- chore: Remove useless expectation (#1217, @gaocegege)
- chore: Update codegen (#1211, @gaocegege)
- add Evaluator type for CRD example (#1209, @oikomi)
- add err log for create client set failed and code minor optimization (#1210, @oikomi)
- chore: Remove the kanban update workflow (#1201, @gaocegege)
- chore: Refactor cmd (#1199, @gaocegege)
- bugfix for multi_worker_strategy-with-keras.py (#1198, @jiaqianjing)
- Fix error when
conditions
is empty. (#1185, @Corea) - b/168938304 - Inclusive Language Fix-It, repo has non-inclusive language (#1190, @sculd)
- chore: Update OWNERS (#1177, @gaocegege)
- Update developer_guide.md (#1176, @pingsutw)
- Update swagger-codegen-cli URL (#1172, @jinchihe)
- Use go mod (#1144, @xychu)
- Make tf_operator use static compilation in container (#1160, @MrXinWang)
- Update tf_job_client.py remove unused variable. (#1157, @NikeNano)
- Update e2e_testing.md (#1155, @NikeNano)
- Disable istio sidecar injection in simple tfjob test (#1148, @Bobgy)
- OWNERS: Add ChanYiLin as approver (#1147, @ChanYiLin)
- Remove unused function arg (#1145, @zhujl1991)
- docs: Add roadmap (#1140, @gaocegege)
- simple_tfjob_tests py3 version (#1134, @gabrielwen)
- add tf-operator test in py3 (#1133, @gabrielwen)
- Distroless image for TF operator (#1124, @krishnadurai)
- SDK support getting the TFJob training logs (#1130, @jinchihe)
- Copy third party vendor source code to Docker image (#1128, @richardsliu)
- Add third party licenses (#1127, @richardsliu)
- remove tfjob dashboard (#1119, @ChanYiLin)
- Update checking status API name (#1117, @jinchihe)
- Add more APIs for TFJob done (#1116, @jinchihe)
- feat: Add adopters in README (#1092, @gaocegege)
- Support for ppc64le (#1082, @zoyun)
- use multi-stage build to build tf-operator image (#1072, @hmtai)
- add ppc64le support for the example dist-mnist (#1084, @alongzhi)
- add the dockerfile for ppc64le (#1083, @alongzhi)
- Updating issue bot configs (#1074, @rbrishabh)
- Delete v1beta2 api (#1075, @johnugeorge)
- add ldflag verion (#1052, @yeya24)
- Add verify-codegen in travis CI (#1070, @ohmystack)
- Set tfjob defaults in test utils (#1071, @ohmystack)
- Update codegen (#1069, @ohmystack)
- rewrite dockerfile (#1062, @hmtai)
- Renaming labels to common types (#1064, @johnugeorge)
- add total suffix in counter metrics (#1055, @yeya24)
- Update k8s libraries to 1.12.3 (#1054, @johnugeorge)
- add flag kubeconfig (#1049, @yeya24)
- Easily detect the GOPATH in current development environment. (#1047, @xauthulei)
- Update gang scheduler name (#1028, @goodluckbot)
- Set worker 0 completed if pod's phase goto succeeded (#1042, @ScorpioCPH)
- Removing unnecessary Rbac authorization (#1036, @johnugeorge)
- refactor: add GenPodGroupName method to extract podGroupName in diffe… (#1034, @zlcnju)
- update release script (#1040, @kunmingg)
- Update image base to UBI8 GA (#1023, @pdmack)
v1.0.1-rc.2 (2021-01-27)
Merged pull requests:
- Fix completion time setting bug #1226 (shaowei-su)
- Update kubeflow common to 0.3.2 #1225 (shaowei-su)
v1.0.1-rc.1 (2021-01-18)
Closed issues:
- checkCRDExists func return true when k8s cluster is not connected #1206
- How to install it without kubeflow #1195
- Pod get re-created after it exited and get garbage collected #1186
- Surface Pod and other Errors that Prevent TFJob from starting #1131
- Jobs failing when a node is preempted #999
Merged pull requests:
- feat(deploy): Add standalone deployment yaml #1218 (gaocegege)
- chore: Remove useless expectation #1217 (gaocegege)
- Fix updateStatus no worker Crashoff #1215 (kuikuikuizzZ)
- chore: Update codegen #1211 (gaocegege)
- add err log for create client set failed and code minor optimization #1210 (oikomi)
- add Evaluator type for CRD example #1209 (oikomi)
- checkCRDExists func return true when k8s cluster is not connected #1207 (oikomi)
- fix: Fix the log message #1203 (gaocegege)
v1.0.1-rc.0 (2020-12-22)
Closed issues:
- tf-operator panic without worker role #1192
- TFJob completion with active services/endpoints resources #1191
- Having trouble viewing logs using Kubernetes dashboard #1189
- [feature] Support SuccessPolicy/FailurePolicy Based on % of Succeeded/Failed Workers #1188
- TFJob cannot utilize GPUs in the node. #1184
- [bug] With Python SDK, TFJob won't stop running #1183
- [bug] [Python SDK] tfjob_client.get_logs broken #1182
- How to create a python sdk for mxnet-operator #1181
- [feature] python sdk should report errors in created TFJobs #1180
- Could not introduce k8s.io/kube-openapi@master #1174
- can tf-operator used in distribute scene, such as Multi-node #1173
- Multi-worker training with Keras only use one GPU #1169
- NCCL WARN Failed to open libibverbs.so[.1] #1168
- tf-job-operator pod restarts #1167
- swagger-codegen-cli-2.4.6.jar not found #1166
- Cut release for tf-operator project #1163
- Replace reconciler implementation with kubeflow/common JobController #1161
- Error while replicating mnist_with_summaries #1159
- Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory #1158
- TFjob pods hang without explanation #1156
- [Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141
- evaluator� should be set in TF_CONFIG when using Estimator distribute strategy #1139
- Is there any case to run the different command in tfReplicaSpecs? #1138
- should gpu resource be released when tfjob failed because of image pull problem? #1136
- tf-job-operator CrashLoopBackOff #1135
- How to change the log level of tf-job-operator #1132
- Support getting the training process via Python SDK #1129
- Popgroup is not created automatically. #1121
- TFConfig should be demonstrated more specifically. #1115
- [chore] Remove tfjob dashboard #1113
- read TF_CONFIG env from configMap #1112
- Long job names result in jobs stuck forever #1101
- [Question] can't the base image "registry.access.redhat.com/ubi8/ubi:latest" in Dockerfile be replaced with "debian:buster" ? #1099
- can i install tf-operator alone without kubeflow? #1096
- c #1095
- TFJob test is failing on master and v0.7 branch for kubeflow/kubeflow #1094
- TFJob tests should use pytest #1093
- Multiple Evaluator replicas gives InvalidTFJobSpec #1091
- Java client for current version of TFjob #1090
- [enhancement] Replace common with kubeflow/common #1087
- Lack of documents for deployment #1086
- Performance problem about pod informer #1079
- [bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078
- Separate cluster scoped and namespace scoped resources #1077
- TFJob 1.0 #1076
- [bug] Keep tf-job-role as deprecated label in this version #1068
- GenLabels may select wrong Pods #1066
- Can I create a tf-operator pod without using GO? #1065
- tf-job-dashboard cannot work #1060
- [discussion] Should We Add CleanPodPolicy PS? #1059
- Refactor dockerfile #1058
- remove v1beta1 in v0.5.3 cause incompatible issue when using go mod #1057
- Invalid value: "v1beta1": must appear in spec.versions #1056
- Example on EKS: Device or resource busy #1053
- can we add PriorityClassName when we create TF-job Podgroup? #1048
- TFjob still running while chief pod is completed #1045
- Is there any document for how to run TFJob in AllReduce Strategy #1039
- tf-operator version conficts #1035
- Add E2E test for gang-scheduling #1033
- gang schedule annotation #1031
- [feature] Can we use one headless service for one job? #1030
- Will tf-operator upgrading k8s to 1.13? #1029
- no error log for create tfjob fail #1026
- Creating tfjob in dashboard usability issues #1024
- Deleting tf-job through the dashboard is not working #1019
- Create common CRD validate and mutating webhook for all operator #1016
- error with kubeflow instalation #996
- Shall we consider upgrading k8s to 1.11.3 #985
- TFJob Dashboard is not support pvc #980
- ERROR handle object: patching object from cluster: merging object with existing state: unable to recognize "/var/folders/tl/zzfcr4zs53vgnpqqjq4n08sh0000gn/T/ksonnet-mergepatch020443124": no matches for kind "TFJob" in version "kubeflow.org/v1beta1" #976
- Create CRD conversion webhook #967
- Performance issue when there is a lot of completed jobs #965
- Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #964
- Proposal for a Common Operator #960
- Delete pod with unknown status in reconcilePods #956
- Create distributed training example for TF 2.0 #953
- Consider using KubeBuilder to reduce boilerplate code #925
- e2e test for dashboard/backend/handler/api_handler.go #921
- Use pod group instead of PDB for gang scheduling #916
- shareProcessNamespace not working with TFJob #902
- [feasibility-research] Handle machine failure #900
- Should limit the size of logs of tf_operator container #888
- Log message severity isn't properly reported in stackdriver #864
- E2E test for invalid spec errors #810
- [v1alpha2] Delete resources according to cleanuppolicy exactly once #804
- refactor the code of TFJobController for unittest #757
- e2e test for cleanupTFJob #756
- [build] Replace Python with Make or Bazel #739
- Export TF/Tensorboard/TF Summaries to prometheus #722
- [discussion] Maintain Helm Chart #716
- [discussion] Capacity planning #708
- [v1alpha2] Generate CRD validation in Kubernetes 1.11 #622
- Set labels and annotations for svc created by tf_operator #609
- mnist test isn't part of CI #597
- [v1alpha2] Push the example docker image to google or dockerhub registry #590
- feat: use fake client-set and informer add controller unittest. #540
- Run submit_release_job.sh in CI #519
- Add environment name in ControllerConfig #450
- [dashboard] How to handle storage? #449
- [dashboard] GPU limits are not taken into account #448
- [dashboard] Ability to create a TensorBoard instance #447
- [examples] Add termination policy in examples/tf_job.yaml #438
- add boilerplate header #430
- [logging] Extra flag problem #427
- [CI] Add hack/verify-codegen.sh in Travis CI #426
- E2E workflows should ignore failures #423
- [enhancement] Add OWNERS in subdirectories #415
- [enhancement] Fix the warnings reported by goreportcard.com #394
- [discussion] Separate the operator and UI dashboard #389
- [enhancemnet] Separate release image and test image #385
- [enhancement][CI] Replace Travis CI with Prow #382
- use Python3 for all python code? #377
- What to do about example TFJob YAML specs? #375
- E2E test for non-default namespace #170
- OpenAPI Client Generation for Java, Python #167
- Prevent scheduling deadlocks #165
- TfDebugger support #132
- Refactor code in py into a proper python package #114
- Update instructions and code to work with Kubernetes 1.8 #108
- Build sample container as part of release process #81
- Run lint (Python, Go) as a presubmit test #53
- Optimize scheduling of TF Processes #35
- E2E test that verifies invalid jobs are failed #30
- E2E test(s) to verify that permanent and retryable errors are handled correctly. #29
Merged pull requests:
- chore: Remove the kanban update workflow #1201 (gaocegege)
- chore: Refactor cmd #1199 (gaocegege)
- bugfix for multi_worker_strategy-with-keras.py #1198 (jiaqianjing)
- feat: Add CD using GitHub Actions #1196 (gaocegege)
- b/168938304 - Inclusive Language Fix-It, repo has non-inclusive language #1190 (sculd)
- Fix error when
conditions
is empty. #1185 (Corea) - Fix setup cluster issue and Pylint issue in CI tests #1179 (jinchihe)
- Fix the typo #1178 (pingsutw)
- chore: Update OWNERS #1177 (gaocegege)
- Update developer_guide.md #1176 (pingsutw)
- Update swagger-codegen-cli URL #1172 (jinchihe)
- Migrate controller implementation to kubeflow/common fashion #1171 (ChanYiLin)
- Support success policy for TFJob #1165 (terrytangyuan)
- add distributed training example of using TF 2.1 Strategy API #1164 (jazzsir)
- Make tf_operator use static compilation in container #1160 (MrXinWang)
- Update tf_job_client.py remove unused variable. #1157 (NikeNano)
- Update e2e_testing.md #1155 (NikeNano)
- Fix the link to run_e2e_workflow.py script #1154 (terrytangyuan)
- Set completion time when job exceed specified deadline. #1150 (SimonCqk)
- Support ClusterSpec Propagation Feature in TF 1.14 #1149 (zhujl1991)
- Disable istio sidecar injection in simple tfjob test #1148 (Bobgy)
- OWNERS: Add ChanYiLin as approver #1147 (ChanYiLin)
- Fix evaluator runconfig #1146 (richardsliu)
- Remove unused function arg #1145 (zhujl1991)
- Use go mod #1144 (xychu)
- Fix sdk test issue that's caused by kubenertes Client bug. #1143 (jinchihe)
- docs: Add roadmap #1140 (gaocegege)
- fix comment, add +optional flag to comment. #1137 (EDGsheryl)
- simple_tfjob_tests py3 version #1134 (gabrielwen)
- add tf-operator test in py3 #1133 (gabrielwen)
- SDK support getting the TFJob training logs #1130 (jinchihe)
- Copy third party vendor source code to Docker image #1128 (richardsliu)
- Add third party licenses #1127 (richardsliu)
- Distroless image for TF operator #1124 (krishnadurai)
- Add watch function for TFJob python Client API #1122 (jinchihe)
- fix(controller): calculate satisfied with && instead of || #1120 (GuoHaiqing)
- remove tfjob dashboard #1119 (ChanYiLin)
- fix(ConvertTFJobToUnstructured): ConvertTFJobToUnstructured uses function ToUnstructured to convert TFJob to Unstructured #1118 (leileiwan)
- Update checking status API name #1117 (jinchihe)
- Add more APIs for TFJob done #1116 (jinchihe)
- Enhance tfjobs sdk docs #1114 (jinchihe)
- fix the reconcile flow #1111 (ChanYiLin)
- Generate TFJob Python SDK #1103 (jinchihe)
- feat: Support pprof when monitoring is specified #1102 (gaocegege)
- Add support for aarch64 #1098 (MrXinWang)
- feat: Add adopters in README #1092 (gaocegege)
- feat: Use kubeflow/common #1088 (gaocegege)
- add ppc64le support for the example dist-mnist #1084 (alongzhi)
- add the dockerfile for ppc64le #1083 (alongzhi)
- Support for ppc64le #1082 (zoyun)
- feat: Replace gometalinter with golangci-lint #1081 (gaocegege)
- feat: Do not set TF_CONFIG for local training #1080 (gaocegege)
- Delete v1beta2 api #1075 (johnugeorge)
- Updating issue bot configs #1074 (rbrishabh)
- Fix example Mnist With Summaries #1073 (andreyvelich)
- use multi-stage build to build tf-operator image #1072 (hmtai)
- Set tfjob defaults in test utils #1071 (ohmystack)
- Add verify-codegen in travis CI #1070 (ohmystack)
- Update codegen #1069 (ohmystack)
- Add controller-name label for Pod and service #1067 (hougangliu)
- Renaming labels to common types #1064 (johnugeorge)
- Add qps and burst options #1063 (ScorpioCPH)
- rewrite dockerfile #1062 (hmtai)
- add total suffix in counter metrics #1055 (yeya24)
- Update k8s libraries to 1.12.3 #1054 (johnugeorge)
- add ldflag verion #1052 (yeya24)
- Avoid unnecessary update when tfjob is complete #1051 (cheyang)
- feat(pod): Support custom gang scheduler via CLI argument #1050 (gaocegege)
- add flag kubeconfig #1049 (yeya24)
- Easily detect the GOPATH in current development environment. #1047 (xauthulei)
- fix bug: When executing
tf-operator.v1 -version
, GitSHA is always 'not provided' #1046 (asdfsx) - fix(UI): show correct namespace and name when deleting job through dashboard #1044 (gbin10533)
- Set worker 0 completed if pod's phase goto succeeded #1042 (ScorpioCPH)
- update release script #1040 (kunmingg)
- fix(docs): Fix link for simple_TFJob_test #1038 (gaocegege)
- Minor fix to add CoreV1 to scheme #1037 (johnugeorge)
- Removing unnecessary Rbac authorization #1036 (johnugeorge)
- refactor: add GenPodGroupName method to extract podGroupName in diffe… #1034 (zlcnju)
- Update gang scheduler name #1028 (goodluckbot)
v1.0.0-rc.0 (2019-06-24)
Closed issues:
- Prometheus support in TF Job #988
- TFJob 1.0 #968
- Revisit Pdb calls during the reconciles while job is completed #824
- RFC: adding more examples of TFJob for distributed learning tasks #436
Merged pull requests:
- set annotation automatically when EnableGangScheduling is set to true #1032 (ChanYiLin)
- Update image base to UBI8 GA #1023 (pdmack)
- fix: Remove dup code #1022 (gaocegege)
v0.5.3 (2019-06-03)
Closed issues:
- Podgroup is constantly created and deleted after tfjob is success or failure #1011
- tfjob startTime should set immediately after create instead of wait pod of one replicaType are all running #1000
- Create TFJob v1 documentation #990
Merged pull requests:
- fix bug for check PodPending #1021 (wackxu)
- Add uuid to id for leader election #1020 (fisherxu)
- Prometheus Monitoring for TF Operator #1018 (krishnadurai)
- Add wackxu as reviewers #1017 (wackxu)
- do not ignore DeletedFinalStateUnknown event when delete pod #1015 (wackxu)
- Add pending status for pastBackoffLimitOnFailure #1014 (wackxu)
- make resyncPeriod configurable #1013 (wackxu)
- fix sync PodGroup logic #1012 (wackxu)
- Polish documentation for TFJob V1 #1009 (richardsliu)
- Scope presubmits by version #1008 (richardsliu)
- Fix panic as pod is nil #1006 (ScorpioCPH)
- Corrects Dev Guide with up to date paths #1005 (krishnadurai)
- fix set startTime logic #1001 (wackxu)
- status: Avoid setting last transition time #982 (gaocegege)
v0.5.2 (2019-05-23)
Closed issues:
- Failed to update TFJob status in version v1 #1003
- tf-operator delete pod and service repeatedly #997
- Update kustomize files for tf-operator v1 #991
- Can not create tfjob using examples/v1beta1/dist-mnist/tf_job_mnist.yaml in self-created k8s cluster and tf-operator #975
- Cannot running tfjob pod #944
- [Test Flake] 503 accessing the test server exit handler #793
Merged pull requests:
- Remove deprecated field #1007 (ScorpioCPH)
- Delete v1beta1 #1004 (richardsliu)
- Fix presubmits and update TFJob examples for v1beta2 and v1 #1002 (richardsliu)
v0.5.1 (2019-05-15)
Closed issues:
- tf-operator panic when cleanupTFJob #994
- Create TFJob v1 API and controller from v1beta2 #989
- MasterRole label initialization #987
- Missing evaluator info from cluster section of TFCONFIG #972
- [FeatureRequest] Support dynamic volume provisioning for TFJob and PyTorchJob #949
- How to prevent tfjob from Running while there are still pods in Pending status? #948
- tf operator ui could not list and create tf job #946
- Consider restructuring tests under the shared control package #938
- TF operator v1beta2 API #935
- [v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550
Merged pull requests:
- fix repeat delete service and pod #998 (wackxu)
- set CompletionTime first when tfjob exceeds limit #995 (wackxu)
- TF job v1 #993 (richardsliu)
- Revert "Fix ineffassign error in masterRole assignment" #992 (terrytangyuan)
- Prune tf-operator OWNERS file #986 (richardsliu)
- Remove trailing spaces in distributed_tfjob.yaml #983 (terrytangyuan)
- modify dockerfile for ppc64le #981 (dreamryx)
- Update func name to pastBackoffLimit in comment #979 (terrytangyuan)
- Fix incorrect event message for PodGroup deletion #978 (terrytangyuan)
- enhance tfjob validation error message #977 (hougangliu)
- Fix ineffassign error in masterRole assignment #974 (terrytangyuan)
- Issue Label Bot Alias Yaml #973 (hamelsmu)
- Skip status update if no changes #969 (xychu)
- Minor changes #966 (johnugeorge)
- Avoid hardcoded tf container name in log #940 (ywskycn)
v0.5.0 (2019-03-26)
Closed issues:
- Support for multiple CRD versions #932
- tf-job-operator RBAC #929
- Use kube-batch as scheduler by default when gang-scheduling is enabled #920
- Rename top level python package - py -> kubeflow-tf-job #914
- [scalability testing] large number of replicas (100) #830
- [scalability testing] large number of jobs (100?) running concurrently? #829
- [doc] API Documentation #731
Merged pull requests:
- add ActiveDeadlineSeconds and BackoffLimit features #963 (ChanYiLin)
- Update tf-operator base image #962 (pdmack)
- Remove usage of crd client for checking CRD existence #961 (johnugeorge)
- Use kube-batch as scheduler by default when gang-scheduling is enabled #957 (zionwu)
- Use podGroup instead of PDB in v1beta2 #954 (thandayuthapani)
- udpate quick start for tfjobs #952 (jinchihe)
- Renaming the labels to consistent format #951 (johnugeorge)
- Add doc.go to v1beta1 API #950 (richardsliu)
- Add doc.go to common/v1beta2 #947 (richardsliu)
- renaming top level python package (issue #914) #945 (zabbasi)
- Support multiple CRD versions for TFJob #943 (richardsliu)
- Replace kube-arbitrator with kube-batch #936 (terrytangyuan)
- refactor the dockerfile #893 (chenzhiwei)
v0.4.0 (2019-02-13)
Closed issues:
- Deprecate v1alpha2 controller and API #934
- Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #928
- Use status subresource in TFJob CRD #927
- Remove genclient:noStatus and call updateStatus() from controller #924
- tfjob dashboard namespaced #923
- TFJob with 1 replicas can't use gang-scheduling #922
- [v1alpha2] Support for custom rpc_layer in TFConfig #906
- is there any lighter way to deploy tf-operators? #904
- When the distributed training job fails, the PS node and some worker node pod are deleted, and only worker 0 is retained. #903
- [feasibility-research] TF AllReduce Strategy #901
- Add validation for evaluator #894
- Running TFJob on GPU only #887
- There is a spelling mistake in developer_guide.md #882
- PS failed but tfjob status is running #881
- how can I get distributed tfjob log when set "cleanPodPolicy: All" #877
- the information of "tfReplicaStatuses" is none when tfjob is in In termination state #889
- Support custom defined cluster domain #875
- TFJob doesn't properly handle PS error. #869
- Code restructuring #866
- Delete v1alpha1 controller and API #865
- how to save the model on PVC #850
- Support error handling for TF distributed strategies #844
- TF operator UI not showing jobs #836
- Why are lastTransitionTime's all the same #806
- Kubernetes API review for TFOperator #742
- [v1alpha2] Add e2e test cases for evaluator #651
- E2E test to validate pod names #645
- Distribution strategies #628
- [discussion] specify total GPU count for distributed training #384
- E2E tests should reuse clusters #214
- Support Draft for packaging #136
- Set termination timestamp #109
Merged pull requests:
- Upgrading k8s to 1.11.2 #942 (johnugeorge)
- Add status subresource to CRD; use UpdateStatus #939 (richardsliu)
- Add v1beta2 APIs and controller logic #937 (richardsliu)
- Delete v1alpha2 API #933 (richardsliu)
- Removing PDB check for Min available replicas #930 (johnugeorge)
- Remove gosimple, go 1.8 and add gotype #919 (stpabhi)
- fix wrong var used of wait_for_condition method #918 (hougangliu)
- Update backend apiVersion to v1beta1 to fix tfjob list. #917 (stpabhi)
- Add .swp to gitignore. #912 (gabrielwen)
- Verify pod names. #911 (gabrielwen)
- Add more detailed events/messages for TFJobs #910 (richardsliu)
- Add evaluator to E2E test. #909 (gabrielwen)
- Move ks_util into kubeflow/testing #908 (jlewi)
- job: Fix log output #905 (wangzewang)
- Minor fixes #899 (johnugeorge)
- GetCondition func fix #898 (johnugeorge)
- Don't reinitialize replica statuses after TFJob completes #897 (richardsliu)
- Adding master role label for TFJob #896 (johnugeorge)
- Add validation for evaluator #895 (DeliangFan)
- build backend in travis ci #892 (chenzhiwei)
- bug: get condition just get last, should get special type condition #891 (zjj2wry)
- feat: optimize code #890 (zjj2wry)
- Fix tf-operator presubmit failures by installing go manually #885 (richardsliu)
- Fix spelling mistake in developer_guide.md #884 (cndaimin)
- Add mnist example with TF summary #880 (richardsliu)
- Fix typo in developer_guide.md #878 (Jeffwan)
- Support custom defined cluster domain #876 (ScorpioCPH)
- Upgrade k8s dependency to 1.10.1 #874 (richardsliu)
- Add an e2etest for testing restart policy #873 (ChanYiLin)
v0.4.0-rc.1 (2018-11-28)
Closed issues:
- [v1alpha2] E2E test for replica restart policy #639
v0.4.0-rc.0 (2018-11-19)
Closed issues:
- create TFjob resource object successfully, but did not create pod #871
- Create a script/tool to migrate users to v1beta1 API #858
- Implement v1beta1 controller for TFjob #857
- Add examples using TF distributed training #843
- Add E2E tests for TensorFlow distribution strategies #842
- run tfjob failded with self build image #840
- Create v0.3-branch #838
- Update kube-arbitrator to kube-batch #837
- [docs] Add instructions about how to contribute e2e test cases #822
- Build the tf-operator every night #747
- Document how to use gang scheduling with TFJob #743
- tf-operator should ensure that CRD exists #710
- Improve our test harness to make it easy to write lots of E2E tests #373
Merged pull requests:
- gopkg: Use version instead of branch #872 (gaocegege)
- Update tf-operator v1beta1 documentation and examples #870 (richardsliu)
- Delete v1alpha1 API and controller #868 (richardsliu)
- Import jobcontroller from common package #867 (richardsliu)
- TF operator v1beta1 e2etests #863 (richardsliu)
- Add an e2etest for running distributed training TFJob #862 (richardsliu)
- TF operator v1beta1 API implementation #861 (richardsliu)
- Add an example for TF distributed training #860 (richardsliu)
- TF operator v1beta1 APIs #859 (richardsliu)
- vendor: Update to 1.10 #856 (gaocegege)
- Rename kube-arbitrator to kube-batch. #855 (rexxar-liang)
- Add the ability to skip a E2E test #853 (richardsliu)
- Add documentation for writing E2E tests #852 (richardsliu)
- Refactor tf-operator E2E tests (part 2) #849 (richardsliu)
- Add richardsliu to OWNERS #847 (richardsliu)
- Remove v1alpha1 E2E tests #846 (richardsliu)
- Refactor E2E tests (part 1) #845 (richardsliu)
- add frontend-dir & port flag to backend program #841 (lovejoy)
- Update changelog to include changes in v0.3. #839 (jlewi)
- travis: Fix a typo #835 (gaocegege)
- fix ERROR: logging before flag.Parse issue #834 (lovejoy)
- Ankush Signing Out #833 (ankushagarwal)
- feat: ensure that tfjob crd exists. #820 (Muzry)
v0.3.0 (2018-09-22)
Closed issues:
- How to run in stand-alone mode #826
- Event reporting pod exited with non-zero exit code is improperly formatted #818
- invalid-tfjob test results don't show up in gubernator/test grid #816
- Invalid TFJob spec can cause the TFJob operator pod to crash repeatedly #813
- Should scheduleName be a TFJob field or is it sufficient to be a podTemplateField #801
- reconcile should be triggered on update; even if no changes #800
- Backwards compatibility support "Master" as chief #794
- Add Pytorch V1alpha2 Implementation #785
- [enhancement] Add SchedulerName in V1alpha2 #782
- Ability to prefer using all gpus on a single node #781
- test_runner.py is using wrong util module for JobTimeoutError #780
- [Test Flake] Intermittent test failures: tensorflow.python.framework.errors_impl.UnavailableError: OS Error #778
- Latest docker Image on wrong commit #775
- PS still running after tfjob is complete #774
- TF_CONFIG in tf-operator:v20180724-13863edf missing Environment: cloud #772
- TF_CONFIG cluster spec has wrong FQDN name #770
- Error syncing tfjob: Failed to found the port #768
- Events don't show up in kubectl describe tfjobs #763
- E2E test for TF estimator API #762
- v1alpha2 doesn't work TF.estimator for TF <= 1.6 ; need to add environment:cloud to TF_CONFIG #761
- Update and move README.md to website #760
- Scope TFJob operator to only claim jobs in a given namespace #759
- Surface invalid spec errors in a more user friendly way #755
- TFJobs UI returns 500s and json parse errors displaying pod information or creating job #754
- [v1alpha2] Job should be marked completed when worker 0 exits but other workers are still running #751
- [testing] CleanPodPolicy needs E2E test #750
- v1 and v2 E2E tests appear to be stomping on each other #748
- [Test Flake] tf_job_client.py needs to handle case where conditions is none #744
- tf-dashboard show workers of all the tfjobs when querying a specific tfjob #737
- [build] Delete build/images/tf_operator/build_and_push.py #736
- tf-operator synPdb failed when enable-gang-scheduler #729
- not proper log message #727
- Unable to check logs in TFJob ui for v1apha2 #723
- Pod stuck in unknown status when kubernetes node is down #720
- [proposal] cleanup jobs after finished #718
- [v1alpha2] Remove redundant code about status #713
- [v1alpha2] Invalid Job Status #712
- Model exchange #709
- [v1alpha2] Invalid job spec not reported in TFJob status #707
- [v1alpha2] Invalid Job spec crashes operator #706
- [v1alpha2] Support cluster spec via command line argument #705
- [v1alpha2] Error when host name is not svc.cluster.local #703
- unable to create a tfjob in the UI; namespace not set #701
- Wrong comment when setting default CleanPodPolicy #698
- how to upgrade smoothly from v1alpha1 to v1alpha2? #697
- file_cache is unavailable when using oauth2client >= 4.0.0 #696
- [v1alpha2] Validate the TFJob converted from unstructured #682
- [v1alpha2] CreatedCondition is not set #680
- [v1alpha2] ks apply on existing job; "unable to find api field in struct Unstructured for the json field "metadata"" #674
- Make it easier to debug/develope E2E tests #655
- [v1alpha2][log] Use logrus instead of glog in service_control #635
- latest.Status.StartTime is nil:invalid memory address or nil pointer dereference #608
- tf-operator throws runtime error: invalid memory address or nil pointer dereference #596
- [v1alpha2] Add PDB of TFReplicaSet for gang scheduling by kube-arbitrator #575
- Get rid of the restriction that the container should be named "tensorflow" #563
- [proposal]TFJob condition for v1alpha2 #562
- [feature] Add Cleanup Policy to TFJob Spec #536
- Update releaser to use Argo. #400
- Enable kube-arbitrator as scheduler for tensorflow #349
Merged pull requests:
- Fix postusbmit builds; registry is incorrect. #832 (jlewi)
- postsubmit should push image to gcr.io/kubeflow-images-public #831 (jlewi)
- Fix typos #828 (ScorpioCPH)
- Cleanup unused functions #827 (johnugeorge)
- allow users to config resources while creating job in dashboard #825 (ChanYiLin)
- remove schedulerName from tfjob spec #823 (ChanYiLin)
- Adding corev1 scheme to Events #821 (johnugeorge)
- pod: Fix eventf #819 (gaocegege)
- Set artifacts dir so that the output of invalid-job test will be picked up. #817 (jlewi)
- If a TFJob spec is invalid mark the job as failed with an appropriate condition #815 (jlewi)
- Estimator e2etest #814 (richardsliu)
- Minor restructuring #812 (johnugeorge)
- Fix a bunch of issues with logging. #811 (jlewi)
- tensorflow: Support old versions of estimator #809 (gaocegege)
- Adding to Owners #808 (johnugeorge)
- tfjob: Send event to object #807 (gaocegege)
- Revert "Avoid triggering reconcileTFJobs if no TFJob update (#796)" #805 (gaocegege)
- Dockerfile: Fix typo #803 (gaocegege)
- Dockerfile: Replace v1 with v2 #802 (gaocegege)
- Typo in TTLSecondsAfterFinished json field #799 (jian-he)
- Avoid logging for non-TFJob pod #798 (jian-he)
- Job completion time is not set for job FAILED state #797 (jian-he)
- reconcileTFJobs is always triggered even with no update #796 (jian-he)
- Add an e2etest to verify clean pod policy in TF job operator #795 (richardsliu)
- Restructing common utility functions #792 (johnugeorge)
- Mark TFJob succeeded if worker 0 completed. #791 (ScorpioCPH)
- Rename the
async
keyword argument toasync\_req
#790 (ojarjur) - Scope tf-operator to a namespace #789 (ankushagarwal)
- Fix the name of the "JobTime(out)Error" class #788 (ojarjur)
- Add SchedulerName in V1alpha2 #787 (ChanYiLin)
- OWNERS: Add ChanYiLin as reviewers #784 (ChanYiLin)
- Add retries to deal with test flakes related to UnavailableError. #779 (jlewi)
- Renaming TFJobController to TFController #777 (johnugeorge)
- Shared implementation of operator code #773 (johnugeorge)
- WaitForJob should use conditions for v1alpha2. #771 (jlewi)
- OpenAPI: update openapi_generated.go to support TTLSecondsAfterFinished #769 (jetmuffin)
- Refactoring TF operator code #767 (johnugeorge)
- TFCONFIG needs to set environment:cloud to support older versions. #766 (jlewi)
- Improve meta information in log messages to make it easier to debug jobs #765 (jlewi)
- Replace contents of README.md with a link to kubeflow.org #764 (jlewi)
- linter: Rename gas to gosec and fix linting errors #758 (gaocegege)
- Cleanup tf-job after a configured TTL #753 (ccding)
- Name Error: mv tj_job_mnist.yaml to tf_job_mnist.yaml #752 (xieydd)
- Prevent multiple versions of an E2E test from clobbering each other. #749 (jlewi)
- Revert "cleanup jobs after finished (#725)" #746 (ccding)
- Fix a test flake caused by conditions being None #745 (jlewi)
- OWNER: Add cheyang, Remove mitake #741 (gaocegege)
- build: Remove the useless script #740 (gaocegege)
- fix list all the pods of tfjob #738 (cheyang)
- test fix: None type check #735 (kunmingg)
- fix not proper log message #734 (ChanYiLin)
- Generate api information in OpenAPI model and register types to scheme #733 (jetmuffin)
- Add err msg for TFJob from Unstructured #732 (xychu)
- Add retrying to log_status function #728 (ankushagarwal)
- Update developer_guide.md for v1alpha2 #726 (lovejoy)
- cleanup jobs after finished #725 (ccding)
- Fix a log function issue #724 (lovejoy)
- delete pdb when tfjob is terminated #721 (ChanYiLin)
- [v1alpha2] Add PDB of TFReplicaSet for gang scheduling by kube-arbitrator #717 (codeflitting)
- [v1alpha2]Remove redundant code about status and fix bug of invalid job status #715 (codeflitting)
- OWNERS: Add yph152 and codeflitting as reviewers #714 (gaocegege)
- [v1alpha2] Add more validation of TFJobSpec #711 (codeflitting)
- Fix sub domain issue #704 (ScorpioCPH)
- add ValidateAlphaTwoTFJobSpec to check v1alpha2.TFJobSpec is valid #702 (codeflitting)
- [v1alpha2] controller_pod_test: Add test cases for evaluator #700 (codeflitting)
- Wrong comment when setting default CleanPodPolicy #699 (jian-he)
- Use logrus instead of glog in service_control #695 (codeflitting)
- [v1alpha2]update tfjob condition for Created #694 (yph152)
- CHANGELOG: Add #693 (gaocegege)
- setup cors for redirects under iap #688 (kkasravi)
- define cleanup policy #685 (cheyang)
v0.2.0-rc1 (2018-06-21)
Closed issues:
- [v1alpha2] Make restart policy a pointer #692
- [v1alpha2] Need conditions Succeeded and Failed indicating when job is done #673
- [v1alpha2] add pod label with job name (without namespace) #672
- [v1alpha2] Pods not deleted when job finishes #671
- [v1alpha2] conditions not updated #668
- [v1alpha2] Move control interface to separate pakckage #665
- [v1alpha2] Move test util to separate package #664
- [feasibility study] Investigate strategy to stop PS after job is completed #661
- Speedup E2E test by running build and setup cluster in parallel #659
- In TFjob, when the workers Completed, i want the ps Completed too, how can i do? #657
- [v1alpha2] service names are prefixed with namespace #654
- [v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653
dep ensure
give warning onk8s.io/apiserver
#647- [v1alpha2] pod names don't include random salt #644
- [v1alpha2]Unable to create pod #641
- GPU tests failing; ks env doesn't exist #640
- TFJob not marked as success when master exits but not workers #634
- v1alpha2 - pod names don't include replica type #633
- tensorflow on kubernetes how to pass in worker_host and ps_host to container if I use tf-operator #630
- [v1alpha2] Set event for tfjob when spec is not valid #620
- [v1alpha2] RealServiceControl does not set owner reference #616
- tf_job_client blocks forever #606
- [v1alpha2] Need to add the v1alpha2 binaries to our Docker image #600
- [v1alpha2] Need ksonnet package #599
- Support deploying v1alpha2 and v1alpha1 controllers simultaneously #598
- [v1alpha2] Remove controller_utils.go #591
- [v1alpha2] Add CI test #589
- [question] dist_mnist example failed to run #588
- [enhancement] Fix the gofmt support #586
- can not set labels #580
- v1alpha2 should use headless services #574
- TFJob operator should pass through annotations to the pod #573
- [test] Test failed because of ImagePullBackOff #567
- [discussion] Do we need to maintain helm chart now? #564
- TfJob operator stops working on invalid spec #561
- Add a timeout flag in tf-operator to preserve resources after job completion for a given period #558
- [go] Use dep instead of glide to reduce the size of vendor #556
- [v1alpha2]tfjob restartPolicy for Never #555
- Servable not found for request: Latest(mnist) #552
- [v1alpha2] Enhance the logic about sync #547
- [v1alpha2] The state of distributed model training. #544
- [test] copy labels and anotations to pod from tfjob #543
- [v1alpha2] Potential bugs when there is one worker succeeded #538
- [v1alpha2] Use structured log #537
- Unable to deploy the example TfJob in the user guide #535
- [log] investigate zap #534
- [v1alpha2] Try to not to always claim pods #533
- [v1alpha2] Suppport customized port #532
- [v1alpha2][test] Avoid potential data race problem #530
- [v1alpha2] Do not set default to always for restartpolicy #524
- [v1alpha2] start using kubeconfig #522
- v1alpha2 integration #521
- E2E test steps should exit with non zero exit code if test fails #514
- TFJob operator surface queue metrics #503
- [v1alpha2] Sync commits with v1alpha1 #490
- [api] Remove pending pods from active pods #484
- [enhancement] Set StartTime for TFJob status #475
- [Feature] Support "eval" worker in tf-operator #444
- Use OpenAPI validation for CRDs in k8s 1.9 #437
- default install of kubeflow no longer install tf-job-dashboard #435
- Add appropriate logging fields to the tf-operator log messages #424
- Use DAG functionality of Argo in our E2E tests #422
- [enhancement] Refactor docs #379
- Post submits are failing with Argo #370
- tf-job-operator pod hangs and doesn't restart if it can't delete one of the TfJob pods #366
- Refactor TFJobStatus in CRD API #333
- Deprecate the TfImage field #330
- Deprecate TfPort and set default port for users #327
- [enhancement] Add e2e test cases for recorder #317
- Make the TfJob controller more event driven #314
- Potential data race, maybe #302
- [discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283
- Does TfJob controller need to do master election? #263
- Setup Prow PR Dashboard #255
- API: some comments about API changes from PR #215 review #249
- e2e test for the case that the chief is not master #235
- Use conditions instead of phase #223
- Submitted tfjobs cease to start running under unknown conditions #203
- Tutorials #195
- Don't leave pods running just to get logs #128
- Add hyperparameter tuning? #112
- Phase is wrong unexpected TfJob phase: Done #110
- Copy chart to kubernetes/charts #93
- Create a web page to list releases #70
- tensorflow 1.4 and estimator support #61
- Set a default value for restartPolicy #55
- Use headless services for Training jobs #40
- More validation of TfJob #25
Merged pull requests:
- *: Add cleanpod policy for v1alpha2 #691 (gaocegege)
- status: Fail the TFJob if PS is failed #690 (gaocegege)
- Use tf_job_name not tf_job_key as the label name. #689 (jlewi)
- pkg: Delete pods and services after finished #686 (gaocegege)
- informer: Add comments and TODO #684 (gaocegege)
- and some safety check #683 (u2takey)
- Remove code that is no longer used. #681 (jlewi)
- change comment with more related link #679 (u2takey)
- return err if the spec area is nil after unmashal for tfjob v1alpha2 #678 (jiaxuanzhou)
- fix typo #677 (u2takey)
- fix restart policy with comment #676 (u2takey)
- label: Remove namespace from labels #675 (gaocegege)
- controller: Move control interface to control package #670 (gaocegege)
- defaults: Rename the type #669 (gaocegege)
- Enable the E2E tests for v1alpha2. #667 (jlewi)
- *: Move test util to separate package #666 (gaocegege)
- Update dep and vendor #663 (xychu)
- server: Make threadiness configurable #662 (gaocegege)
- dist-mnist: Move to examples #660 (gaocegege)
- tfjob: Add test for copy labels and annotations #658 (gaocegege)
- *: Remove namespace from service name #656 (gaocegege)
- pod: Add test for exit code #652 (gaocegege)
- [v1alpha2] Estimator support - Do not include
evaluator
in cluster spec #650 (xychu) - pods: Add cluster spec test #649 (gaocegege)
- pod: Submit an event when the user specifies the restartpolicy for pod template #648 (gaocegege)
- v1alpha2 E2E tests for termination policy #646 (jlewi)
- status: Add test cases for failure #643 (gaocegege)
- Add proper error handling for deploying the tests. #642 (jlewi)
- pods: Add restart policy #638 (gaocegege)
- status: Support chief #637 (gaocegege)
- *: Set name for the pod
tfjob.name-type-index
#636 (gaocegege) - Modify presubmits to support testing with v1alpha2 #632 (jlewi)
- Updates to enable e2e test for v1alpha2 #629 (ankushagarwal)
- Use logging.exception to capture stack traces in logs #627 (ankushagarwal)
- Pass TFJob API version instead of hardcoding it #626 (ankushagarwal)
- [v1alpha2] Add distributed state management #625 (yph152)
- pkg: Send events when reveive invalid spec #623 (gaocegege)
- pkg: Support customized port #621 (gaocegege)
- Add apiVersion parameter to simple_tfjob component #619 (ankushagarwal)
- api_handler: Fix import order #618 (gaocegege)
- service_control: Set owner ref for service and add test cases #617 (gaocegege)
- test: Add test cases for service ref manager and control interface #615 (gaocegege)
- controller: Refactor and add test cases for helper #614 (gaocegege)
- [dashboard] Upgrade to v1alpha2 #613 (wbuchwalter)
- controller: Improve coding styles #612 (gaocegege)
- Informer: Use unstructured #610 (gaocegege)
- TFJob client should not block forever trying to get the namespace object #607 (jlewi)
- crd: Add validation using OpenAPI 3.0 #605 (gaocegege)
- dist_mnist: Add unused_argv #604 (gaocegege)
- service: Refactor to the slice structure #603 (gaocegege)
- Delete the old releaser code which is no longer used. #602 (jlewi)
- Add tf-operator.v2 to release.py so that we build a Docker image containing the v1alph2 controller #601 (jlewi)
- Add a new command-line argument for release.py #595 (chaoleili)
- controller: Remove dup code and use k8s.io/kubernetes/controller #594 (gaocegege)
- test: Fix data race problem #593 (gaocegege)
- .travis.yml: Fix cmd errors #592 (gaocegege)
- Fix the gometalinter support #587 (wgliang)
- Format go code and fix spelling errors #585 (wgliang)
- docs: Add quick start for v1alpah2 #584 (gaocegege)
- mnist: Add correponding yaml config #583 (gaocegege)
- pod: Add update logic #582 (gaocegege)
- .pylinrc: Add dist_mnist #581 (gaocegege)
- .travis.yml: Add failure notification in GitHub #579 (gaocegege)
- controller_status: Remove pending pods from active pods #578 (gaocegege)
- api: OpenAPI support #577 (gaocegege)
- controller_service: Headless service #576 (gaocegege)
- replace glide with dep in the developer guide #572 (ChanYiLin)
- [v1alpha2]fix bug int to string for index #571 (yph152)
- Fix missing string for logging placeholder #570 (zacharyzhao)
- Update py_lint and py_test #569 (ankushagarwal)
- Update test worker image to kubeflow-ci #568 (ankushagarwal)
- chart: Remove #566 (gaocegege)
- add OwnerReferences to pdb #565 (ChanYiLin)
- Correct typos in README #559 (ntenenz)
- vendor: Use dep instead of glide and prune it #557 (gaocegege)
- set completion time on success #554 (u2takey)
- README: Add tf-operator v1alpha2 design doc #553 (gaocegege)
- Add dist mnist model for e2e test #549 (ScorpioCPH)
- controller: Refactor controller_pod #548 (gaocegege)
- Replace kubeflow-images-staging with kubeflow-images-public #546 (ankushagarwal)
- copy labels and anotations to pod from pod template #542 (u2takey)
- add workqueue and reflect metrics #541 (zjj2wry)
- fix the bug of keeping creating new pdb #539 (ChanYiLin)
- signals: Add #531 (gaocegege)
- OWNERS: Add @ddysher and @willb as reviewers #529 (gaocegege)
- developer_guide: Add instructions for v1alpha2 #528 (gaocegege)
- v1alpha2: Add implementation #526 (gaocegege)
- v1alpha2: Add API and codegen #523 (gaocegege)
- Reenable cluster teardown. #520 (jlewi)
- Only identify specific exit codes as retryable error #518 (0olwzo0)
- update OWNERS #516 (mitake)
- Create a script to release the TFJob operator image #515 (jlewi)
- Fix output on test failure #511 (jose5918)
- Adds gcloudignore #510 (jose5918)
- RFC: Add a new command for generating example TFjobs #509 (mitake)
- Use a CentOS 7 base image for the tf-operator image #469 (tmckayus)
v0.1.0 (2018-03-29)
Closed issues:
- [v1alpha2] Implement condition update #502
- E2E tests timing out; job appears to remain in running state even though job is done. #500
- [v1alpha2] TF_CONFIG should be configurable by user #499
- [test] All log is 404 in argo #496
- Presubmit shows succeeded, but some test actually failed. #479
- Waiting pods start too long #461
- [test] Add unit test for pkg/controller #455
- Create a suitable OWNERS file in /dashboard #443
- Tide is misconfigured for this repository. #433
- CI failed to setup the cluster #420
- [docs] Add dashboard readme #411
- Make coverall results advisory and not report as failure #406
- Presubmits failing due to lint #404
- [enhancement] Fix go vet errors which not caught by the compilers #395
- User facing website for Kubeflow that details how to choose a stack #371
- [discussion] How to set clusterspec #369
- [enhancement] Rename the cmd/tf_operator to cmd/tf-operator #363
- Local releaser fails due to version_tag #360
- Helm test failure not reported to gubernator #355
- [discussion] Whether to create CRD in helm charts #353
- Should resourcelock be in the same namespace as controller? #352
- Helm test tf-job does not pass validation #351
- Move tensorflow/k8s to kubeflow/tf-operator #350
- Get rid of TensorBoard replica #347
- Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs #346
- Deprecate the ENV MY_POD_NAMESPACE and MY_POD_NAME #341
- [feature] Does tfJob support setting different label/envVar for each worker(replicas >1)? #340
- [Discussion] Time to start tagging releases for the TF operator? #339
- [discussion] Should group name be tensorflow.org or kubeflow.io or kubeflow.org? #337
- dashboard silient error during calling non-existent tfjob #335
- in dashboard, silent error when nonexistent namespace is specified #334
- Deprecate the IsDefaultPS field #329
- [Convention] Replace Tf with TF in CRD #328
- Standardise labels for issues and PRs #326
- Manage Pods directly instead of using Job controllers #325
- TfJobs dashboard not showing jobs #324
- TfJobs dashboard doesn't work with K8s API server proxy or envoy proxy #323
- Recreating a failed/successful job with same name doesn't work #322
- Releaser incorrectly tags images as "dirty" #321
- Reenable the releaser #320
- E2E tests are not isolated #318
- Need to mark prow job as failed if any tests fail #315
- Remove outdated branch wbuchwalter-patch-1 #311
- E2E test delete and recreate job with same name #310
- TrainingJob.reconcile not called periodically #309
- rename master to chief #306
- Assign resource quota for TensorBoard #304
- Jobs evicted for lack of memory, potentially add resource field to tf-job prototype #301
- [Discussion] Operators vs. controller pattern #300
- [bug] Add a default pod template for PS #297
- Bunch of pylint error messages #294
- Fix Head #293
- Operator deployment fails post-v20180108-190394d #292
- Promote last known good release #290
- [bug] metadata.ownerReferences.apiVersion is not set #288
- fail to run example job. invalid job spec: tfReplicaSpec.TfPort can''t be nil #284
- [bug] Build log 404 in https://prow.k8s.io/?repo=tensorflow%2Fk8s #282
- [feature] Seperate the CRD and controller #281
- Gaps in test coverage #280
- Regression in flag name: controller-config-file #279
- [bug] glog before flag.Parse() #275
- build new code to new image and find some problem #274
- Fix the releaser so we can build new images #270
- deploy.py gives gcloud api error '... Version "1.8.1-gke.1" is invalid.' #268
- Pods terminated without waiting #267
- Attach appropriate header (copyright) to go files #266
- suppose i've install the tfjob in my k8s cluster #265
- what's the folder pkg for? #264
- Build failing because of lint issues #256
- what's the main change between version 0.2 and version 0.3? #247
- SetupCluster failures unexpected keyword argument 'client_configuration' #242
- GPU test marked as succeeded but airflow step is failing #240
- Use Kubeflow & ksonnet to install TfJob #239
- tf_smoke.py distributed computing doesn't work on minikube #238
- example-job can not work in private k8s cluster #233
- Test failures aren't properly reported in Gubernator #229
- [CRD] Request for input and output dirs in TFJobSpec #224
- TfJob should be marked as failed if setup fails #218
- panic: runtime error: invalid memory address or nil pointer dereference can not run in k8s 1.8.5 #212
- Rethink the TFJob CRD #209
- ksonnet configs for deploying the TfJob CRD & Controller #208
- Make default TfImage configurable by users #207
- refactor the TfJob to use Informer and Controller #206
- Use Argo workflow engine for CI/CD or releases #205
- Potential issue with Tensorboard / value of simple best-practices example with tboard #202
- Investigate using buildah to build our images #201
- E2E tests pre & postsubmits are failing #196
- Publishing a client to pypi #193
- Don't require a master or chief #192
- Make cloning the repo and building the artifacts separate commands in py/release.py #189
- Handle the case where grpcServerFilePath is the empty string #188
- Make Airflow logs accessible #185
- Complement docs for Python 3rd party dependencies #181
- Helm Test fails because grpcServerFilePath is the empty string #179
- Helm should only set --controller_config_file conditionally #175
- Troubleshooting Guide: no matches for tensorflow.org/, Kind=TfJob #174
- no matches for tensorflow.org/, Kind=TfJob #173
- Failed to build TFOperator #171
- E2E test for GPUs #164
- TfJob doesn't work on minikube #160
- Deleted jobs re-starting #156
- Use coveralls.io to report and check code coverage #155
- Clarify scope of tensorflow/k8s #150
- After init helm, install chart failed #149
- Helm test; insufficient permissions on RBAC clusters #135
- Need to trim trailing slash of host string in TfJobRestClient.Watch() #130
- results of lint test aren't reported in junit file used by gubernator #126
- Collaborators need to be K8s members to trigger tests #122
- Extend Test Infrastructure to run multiple E2E tests in parallel #120
- initResource() failed; findAllTfJobs returned error: #118
- Latest tag on gcr.io is not up to date #116
- duplicate #115
- postsubmit results aren't showing up in testrgrid #113
- TensorBoard replica set not deleted when job deleted. #107
- helm permission issue on 1.8.1 #106
- Run python unittests as part of pre/post/periodic tests #101
- E2E tests are failing #96
- E2E Test log should capture output from helm-test #95
- Rename TfJob kind to remove mlkube.io #89
- Setup travis for tensorflow/k8s #88
- Update repo to use its new location tensorflow/k8s #86
- mlkube.io -> tensorflow/k8s #85
- Update prow to use repo tensorflow/k8s #84
- periodic test is failing #83
- runner.py needs to create build-log.txt with stdout/stderr of test #82
- E2E tests leaking GKE clusters #80
- No results show up if you click on mlkube-build-periodic #76
- No results show up in prow test grid for presubmit jobs #75
- Include TfJob name in labels #72
- Simplify/Clarify Accelerators config #71
- Clean up examples; don't require cloning the repo #68
- How to create TF Jobs from the user side? #67
- Change version from beta -> alpha #65
- API Review #64
- Setup release process for CRD #63
- Post submit jobs don't correctly upload artifacts to GCS #62
- presubmit test(bootstrap.py) doesn't properly check out PRs #59
- E2E Test for default PS server #58
- UI / Kubernetes Dashboard Integration #57
- E2E test for GPUs #54
- Integrate with Prow for Continuous Testing #46
- Consider how we manage replicas (stateful sets, managing pods directly) #45
- Use K8s Garbage Collection #42
- func c.findAllTfJobs() in controller.go will never reach #41
- Rename project #34
- Structured (Json) logging for Tf Processes #32
- Permanent errors don't cause job failure #28
- If handling Add event fails, TfJob should be marked as failed with appropriate error #26
- Structured Logging For the operator #24
- Operator Log Spam; replicas.go:287] No container named: tensorflow found for pod; assuming POD is running #23
- Provide a default value for TfPort, replicas, and tfReplicaType #22
- Setup continuous build of containers #19
- Should this be converted to a Custom Resource Definition (CRD) in anticipation of 1.7 #17
- Run TensorFlow server for parameter servers by default #16
- TensorBoard Integration #13
- Dependency management #7
- Better GPU support #6
- TfJobRestClient.Create doesn't set kind appropriately #5
- Add a creationTimestamp #4
Merged pull requests:
- Fix outdated information about GPUs in README #513 (mindprince)
- Don't leave pods running when a job completes. #512 (jlewi)
- Fix bug with jobs not being marked as completed. #501 (jlewi)
- release: Fix style #498 (gaocegege)
- pkg: Fix the code changed in #486 #497 (gaocegege)
- fixed some golint warning #486 (AK-ayush)
- Support testing on minikube. #485 (jlewi)
- add LabelsByIndex method to eliminate code duplication #474 (rc-zhang)
- Use headless services for Training jobs #471 (rc-zhang)
- Fix field selectors in controller #465 (wbuchwalter)
- Run ks upgrade #464 (lluunn)
- Fix owners file id #462 (lluunn)
- Remove deprecated package retryutil #460 (ScorpioCPH)
- Change test cluster to kubeflow-ci #459 (lluunn)
- *: Remove APIExtension clientset #454 (gaocegege)
- travis: Ignore generated code #453 (gaocegege)
- Create PDB of TFReplicaSet for gang scheduling by kube-arbitrator #452 (mitake)
- Add OWNERS file for dashboard #446 (wbuchwalter)
- Make local release cross-platform + fix #445 (wbuchwalter)
- Add proxying to front-end development server. #442 (wbuchwalter)
- Fix dashboard + proxy incompatibility #441 (wbuchwalter)
- change kubeflow.io to kubeflow.org #440 (Jimexist)
- Remove unreachable code #434 (ScorpioCPH)
- *: Remove type ContainerName #432 (gaocegege)
- add boilerplate header for go file #431 (wackxu)
- format the python files with yapf #429 (mitake)
- clientset: Fix code which is changed manually #428 (gaocegege)
- Delete Dockerfile to build a docker image to use for prow. #425 (jlewi)
- Fix setup_cluster. #421 (jlewi)
- Add ScorpioCPH as approver/reviewer #419 (ScorpioCPH)
- Create resources (Services/Jobs) only once #418 (ScorpioCPH)
- Dashboard: Dev Guide #417 (wbuchwalter)
- Use logrus for structured logging #416 (ankushagarwal)
- Create an initial OWNERS file. #414 (jlewi)
- Docs should refer to Kubeflow user guide for deploying the TFJob operrator #412 (jlewi)
- Run glide update to update glide.lock #410 (ankushagarwal)
- Fix typo in Makefile #409 (ankushagarwal)
- Add a field SchedulerName to TFJob for specifying a scheduler #408 (mitake)
- Fix lint issues with python3 and a bug in lint script #405 (jlewi)
- Support using our E2E workflow to build a Docker image for releases. #403 (jlewi)
- add go 1.10 support in travis #402 (Jimexist)
- use yapf to format python code #401 (Jimexist)
- Fix bug with jobs not working if you recreate a job with same name as previous job #399 (jlewi)
- Fixes go vet errors #397 (swiftdiaries)
- Fixed-363: Rename cmd/tf_operator -> cmd/tf-operator #393 (AK-ayush)
- README: Add community section and quick links #392 (gaocegege)
- Remove TensorBoard related code in operator #391 (gaocegege)
- Fix something after move to kubeflow/tf-operator #390 (sdf611097)
- Add a prow_config.yaml file to configure our prow jobs. #388 (jlewi)
- fix a typo in the README file. #387 (ChanYiLin)
- *: Replace the repo name #386 (gaocegege)
- travis: Add go build command #383 (gaocegege)
- config.sh: Remove #381 (gaocegege)
- Use ksonnet to easily define TFJobs to be run as tests #374 (jlewi)
- Fix repo name env #372 (jose5918)
- controller.go: Fix a glog typo #368 (gaocegege)
- fix -version option: print version #367 (caogj)
- *: Add copyright owner in go files #364 (gaocegege)
- Fix local releaser #361 (jose5918)
- nit: try to simplify e2e main.go #359 (Jimexist)
- Use Argo rather than Airflow to run our E2E tests #358 (jlewi)
- Add an option to release.py to specify the tag for the image to use. #357 (jlewi)
- Fix helm test #356 (jose5918)
- feat(group): Update CRD group to kubeflow.org #354 (gaocegege)
- Deprecate the ENV MY_POD_NAME and use default namespace #348 (ScorpioCPH)
- feat(crd): Separate CRD and controller #345 (gaocegege)
- Create Pod instead of Job #344 (ScorpioCPH)
- Deprecate IsDefaultPS in TFJob CRD API #343 (ScorpioCPH)
- Update documentation #342 (jose5918)
- feat(dashboard): Namespace handling #338 (wbuchwalter)
- feat(dashboard): better error handling in dashboard code #336 (Jimexist)
- Rename Tf to TF #332 (ScorpioCPH)
- Delete binary file #331 (ScorpioCPH)
- Take test failures into account when setting prow job status #319 (jlewi)
- remove unused file rename.sh #316 (caogj)
- add UpdateFunc to handle update events #313 (mqliang)
- pkg: Add recorder support #312 (gaocegege)
- Fix a bunch of problems in TfJob CRD that crept in while tests were broken #308 (jlewi)
- replace TPR with CRD #307 (mqliang)
- fix broken link #305 (caogj)
- Fix python lint checks #303 (jlewi)
- Fix setting defaults. #299 (jlewi)
- Add service account name to dashboard if RBAC. #298 (ConnorDoyle)
- The flag should be --controller-config-file. #295 (jlewi)
- Fix the junit XML file format. #291 (jlewi)
- *: Fix API Version #289 (gaocegege)
- *: Implement the List interface for TfJobList #278 (gaocegege)
- cmd: Fix the flag error caused by pflag #277 (gaocegege)
- types.go: Fix CRDKind #276 (gaocegege)
- Move around due to new directories layout #273 (ScorpioCPH)
- bugfix: set faliures=true if failed deleting configmap #272 (mqliang)
- Fix our continuous release process #271 (jlewi)
- update initialClusterVersion to 1.7.11-gke.1 #269 (cwbeitel)
- Misc Cleanup. #262 (jlewi)
- Add proposed directories layout #261 (ScorpioCPH)
- record event when tf_operator failover #260 (zjj2wry)
- follow kubernetes flag convension #259 (zjj2wry)
- refactor dashboard backend, use versioned tfjob clientset #258 (zjj2wry)
- apply goimports -w to generated files #257 (Jimexist)
- add gometaliner into travis build #254 (Jimexist)
- fix(no-dup): reduce dup code in printVersion #253 (Jimexist)
- Improve utilities for E2E tests. #251 (jlewi)
- Fix leaking of clusters in E2E tests #80 #250 (jlewi)
- feat(pipenv): Use pipenv to lock down python dependencies #248 (Jimexist)
- fix(lint): add prop types and fix all eslint errors #246 (Jimexist)
- refactor code and format imported package #245 (zjj2wry)
- feat(lint): apply prettier to format frontend src/ code #244 (Jimexist)
- feature(lint): use prettier and lint-staged for frontend javascript code #243 (Jimexist)
- Fix issues with tf_job_gpu test #241 (jlewi)
- Use the release/test python scripts pulled from the repo. #237 (jlewi)
- Don't run glide install in travis builds. #236 (jlewi)
- refactor the controller logic #234 (wackxu)
- feat(coverage): add covealls support #232 (Jimexist)
- use glide install --strip-vendor remove subpackage vendor #231 (zjj2wry)
- update k8s dependency to stable version #230 (wackxu)
- let tfJob image configurable #228 (zjj2wry)
- remove todo, add gitSHA into version information #227 (zjj2wry)
- controller.go: Fix a print error #226 (gaocegege)
- replace tf-job-operator-config configmap when it already exist #225 (zjj2wry)
- Add the vendor directory to the repository. #222 (zjj2wry)
- allow using WORKER:0 as chief #221 (lluunn)
- Fix issue with handling of json errors. #220 (jlewi)
- Set state to failed if there is a problem initializing job #219 (jlewi)
- On GKE mounting volumes should no longer be required for GPUs. #217 (jlewi)
- update developer guide #216 (ddysher)
- Refactor the TfJob to use K8s libraries #215 (wackxu)
- Add a basic GPU job test as part of our E2E tests. #213 (jlewi)
- minor spelling porxy => proxy #211 (cbockman)
- Add terminationPolicy to TfJobSpec #204 (lluunn)
- Split cloning the repo and building the images into two steps in our airflow pipeline #200 (jlewi)
- Create separate commands to clone and build the repo #199 (jlewi)
- Install yarn and nodejs inside the Airflow container. #198 (jlewi)
- Update the Airflow deployment to use Docker images built from a clean tree #197 (jlewi)
- Fix some cuda issues on Azure #194 (wbuchwalter)
- Fixing front page documentation to have grpcServerFilePath #190 (hyperbolic2346)
- Add an option to build Docker images with GCB. #187 (jlewi)
- replace deprecated tf.initialize_all_variables #184 (DjangoPeng)
- build_and_push.py: Support python3 #183 (gaocegege)
- tf_job_design_doc: Fix the apiVersion #182 (gaocegege)
- py: Add requirements.txt #180 (gaocegege)
- resolve a merge conflict imported by commit ae8c31 #178 (DjangoPeng)
- tf_job_design_doc.md: Fix a typo #177 (gaocegege)
- Fix helm templates so that we don't require a configmap. #176 (jlewi)
- replace Google and Golang repos with corresponding github repos #172 (DjangoPeng)
- Stop hardcoding namespace for TfJob config map #169 (haitch)
- Tooling to make it easier to run a bunch of TfJob tests. #168 (jlewi)
- Run python lint and unittests as part of our E2E test pipeline #166 (jlewi)
- A binary to run pylint and python unittests #163 (jlewi)
- fix dev guide #162 (lluunn)
- Integrate Airflow with Prow #158 (jlewi)
- rename jlewi/mlkube.io in glide.yaml #153 (moon03432)
- add Create(), Delete() in TfJobClient interface #152 (moon03432)
- change jobname from task-runtimeid-index to jobname-task-runtimeid-index #151 (moon03432)
- Create binaries to run steps in an E2E test pipeline. #148 (jlewi)
- Fix a typo in the command line help. #147 (jlewi)
- ignore too-many-locals. #146 (jlewi)
- On RBAC clusters, test needs a service account with appropriate permissions #145 (jlewi)
- Airflow pipeline to run our tests #144 (jlewi)
- fix(*): amend the number of worker and ps in example yaml spec for a distributed job #142 (lienhua34)
- fix a log issue #141 (moon03432)
- rename clus to tfjob in controller.go #138 (moon03432)
- rename InClusterConfig() to GetClusterConfig() #137 (moon03432)
- Remove trailing slash of host #134 (ScorpioCPH)
- Turn release.py into a binary to build the artifacts for all the different contexts #133 (jlewi)
- Minor fix typo and redundancy #131 (ScorpioCPH)
- Update developer_guide.md #129 (Jimexist)
- Use K8s Garbage Collection #127 (jlewi)
- Dashboard V1 #125 (wbuchwalter)
- More verbose logging of resource deletion #124 (jlewi)
- Fix rbac settings in chart. #123 (jlewi)
- Fix issue in tpr_util.Delete() #121 (wbuchwalter)
- Tag docker images with "latest". #119 (jlewi)
- Update API group in the chart #117 (sozercan)
- Helm instructions #111 (jlewi)
- Name label #105 (jlewi)
- Update helm install syntax in readme #104 (sozercan)
- Change group to tensorflow.org and version to v1alpha1. #103 (jlewi)
- [WIP] Notebook demonstrating use of TfJob on GKE #102 (jlewi)
- Fix bugs in the release script. #100 (jlewi)
- Fix bugs in the release script. #99 (jlewi)
- Update release.py so we can run it continuously. #98 (jlewi)
- Fix the E2E test by specifying cloud when deploying the helm package. #97 (jlewi)
- Need to set environment to enable Estimators with TF <=1.3 #94 (jlewi)
- Update README.md #92 (Jimexist)
- Add python lint check to travis and fix python lint issues #91 (jlewi)
- #71 Simplify accelerators config #90 (wbuchwalter)
- Update test infrastructure to use repo tensorflow/k8s #87 (jlewi)
- Create symbolic links in GCS to output of presubmit results. #79 (jlewi)
- Fix periodic results (#76) #78 (jlewi)
- Another attempt to fix periodic jobs. #77 (jlewi)
- Fix location of the post submit results. #74 (jlewi)
- Overhaul the documentation #73 (jlewi)
- Release scripts #69 (jlewi)
- Record latest green from postsubmit #66 (jlewi)
- Fix presubmit jobs and periodic jobs #60 (jlewi)
- Fix periodic test #56 (jlewi)
- Updated chart with batch.jobs and extensions.deployments cluster roles #52 (sozercan)
- Added RBAC support for tf-operator chart #51 (sozercan)
- PR to test Prow presubmit integration. #50 (jlewi)
- E2E test for the CRD #49 (jlewi)
- Create configs for setting up Prow for continuous testing. #47 (jlewi)
- Fix bug that prevents permanent errors from causing job failure. #44 (jlewi)
- Always check for existing TfJobs and instantiate controllers for them. #43 (jlewi)
- support multi namespaces #39 (loadwiki)
-
Use Jinja templates and a Python script to build example Docker images for examples [\#37](https://github.com/kubeflow/tf-operator/pull/37) ([jlewi](https://github.com/jlewi))
- Parameter Server: Run TF server by default #36 (wbuchwalter)
- Set default values for Replicas, TfPort, TfReplicaType. #31 (jlewi)
- Fix a couple bugs. #27 (jlewi)
- [WIP] Update to CustomResourceDefinition instead of ThirdPartyResource. #20 (jlewi)
- Update glide config. #18 (jlewi)
- Add TensorBoard Integration #15 (wbuchwalter)
- Changes to support CI using Travis. #14 (jlewi)
- Add Environment Variables in Controller Config #12 (wbuchwalter)
- Fix tests #11 (wbuchwalter)
- Helm charts renaming #10 (wbuchwalter)
- Simplify GPU configuration process. #9 (jlewi)
- Fix build, add Glide for dependency management. #8 (wbuchwalter)
- Update links in README.md #3 (wbuchwalter)
- A more thorough E2E test. #2 (jlewi)
* This Changelog was automatically generated by github_changelog_generator