-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul the documentation #73
Conversation
* Create a separate design doc, tf_job_design_doc.md that more fully explains the design and motivation. * Create a separate developer guide with information about building the repository and other information specific to developers/contributors. * Rewrite README.md to focus on information relevant to users of TfJob. Add a lot more information about how to monitor jobs as well as information explaining how it works.
@wbuchwalter PTAL |
Resolve dependencies (if you don't have glide install, check how to do it [here](https://github.com/Masterminds/glide/blob/master/README.md#install)) | ||
|
||
``` | ||
glide install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the fix for the vendoring issue here?
Otherwise go install
will fail and users might spend time trying to debug before reading further.
README.md
Outdated
We expect | ||
helm or ksonnet could be used to add syntactic sugar to create more convenient APIs for users not familiar | ||
with Kubernetes. | ||
Custom Resources require Kubernetes 1.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: require Kubernetes >= 1.7
README.md
Outdated
|
||
Logging still needs work. | ||
## Staging code | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be adressed in the README?
I think it falls outside the scope of this project, so maybe remove it from the README to get a clearer document.
Also while this alternatives are definitely doable, I am not sure they should really be encouraged, putting that in the README (especially if this falls under the tensorflow/ org) almost encourage people to do it this way.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I removed it.
I do think there's value in documenting how to setup K8s clusters to enable a team of datascientists/ml researchers to be successful. For that I do think access to a shared filesystem both locally and on the K8s cluster would be very valuable. I'll punt on that for now.
README.md
Outdated
|
||
``` | ||
glide install | ||
${REPLICA-TYPE}-${RUNTIME_ID}-${INDEX}-${RADNOM} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ${RANDOM}
README.md
Outdated
then your logs may also be shipped to an appropriate data store for | ||
further analysis. | ||
|
||
#### GKE | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is not something specific to TfJob, is it worth mentionning?
I'm guessing this is already documented on K8s or Stackdriver doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think users will want to know how to get the relevant logs for a TFJob; i.e. how to get the logs for the master? worker i? etc...
So I think the docs should explain that for relevant clouds (feel free to put together a PR with the same for Azure).
Unfortunately, the story on GCP leaves something to be desired because you can't filter by pod labels currently. Eventually once that's fixed and #72 is fixed users will be to just filter by various pod labels and I will update the docs to explain relevant filter expressions for TfJob.
I think that's relevant to using TfJob effectively.
README.md
Outdated
* **Avoiding Breakages** | ||
* During Alpha there is no guarantees about TfJob API | ||
compaitibility. | ||
* To avoid being broken by changes you can pin to particular |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you can pin to a particular
Looks good, love the design doc, more project should have one. |
Review status: 0 of 4 files reviewed at latest revision, 6 unresolved discussions. developer_guide.md, line 16 at r2 (raw file): Previously, wbuchwalter (William Buchwalter) wrote…
Done. README.md, line 36 at r2 (raw file): Previously, wbuchwalter (William Buchwalter) wrote…
Done. README.md, line 64 at r2 (raw file): Previously, wbuchwalter (William Buchwalter) wrote…
Done. README.md, line 339 at r2 (raw file): Previously, jlewi (Jeremy Lewi) wrote…
Done. README.md, line 496 at r2 (raw file): Previously, wbuchwalter (William Buchwalter) wrote…
Done. Comments from Reviewable |
Reviewed 2 of 4 files at r1, 2 of 2 files at r3. Comments from Reviewable |
Going to take that as an lgtm. |
Create a separate design doc, tf_job_design_doc.md that more fully
explains the design and motivation.
Create a separate developer guide with information about building the
repository and other information specific to developers/contributors.
Rewrite README.md to focus on information relevant to users of TfJob.
Add a lot more information about how to monitor jobs as well as information
explaining how it works.
This change is