Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul the documentation #73

Merged
merged 4 commits into from
Oct 24, 2017
Merged

Overhaul the documentation #73

merged 4 commits into from
Oct 24, 2017

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Oct 23, 2017

  • Create a separate design doc, tf_job_design_doc.md that more fully
    explains the design and motivation.

  • Create a separate developer guide with information about building the
    repository and other information specific to developers/contributors.

  • Rewrite README.md to focus on information relevant to users of TfJob.
    Add a lot more information about how to monitor jobs as well as information
    explaining how it works.


This change is Reviewable

* Create a separate design doc, tf_job_design_doc.md that more fully
  explains the design and motivation.
* Create a separate developer guide with information about building the
  repository and other information specific to developers/contributors.

* Rewrite README.md to focus on information relevant to users of TfJob.
  Add a lot more information about how to monitor jobs as well as information
  explaining how it works.
@jlewi jlewi requested a review from wbuchwalter October 23, 2017 01:03
@jlewi
Copy link
Contributor Author

jlewi commented Oct 23, 2017

@wbuchwalter PTAL

Resolve dependencies (if you don't have glide install, check how to do it [here](https://github.com/Masterminds/glide/blob/master/README.md#install))

```
glide install
Copy link
Contributor

@wbuchwalter wbuchwalter Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the fix for the vendoring issue here?
Otherwise go install will fail and users might spend time trying to debug before reading further.

README.md Outdated
We expect
helm or ksonnet could be used to add syntactic sugar to create more convenient APIs for users not familiar
with Kubernetes.
Custom Resources require Kubernetes 1.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: require Kubernetes >= 1.7

README.md Outdated

Logging still needs work.
## Staging code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be adressed in the README?
I think it falls outside the scope of this project, so maybe remove it from the README to get a clearer document.

Also while this alternatives are definitely doable, I am not sure they should really be encouraged, putting that in the README (especially if this falls under the tensorflow/ org) almost encourage people to do it this way.
What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I removed it.

I do think there's value in documenting how to setup K8s clusters to enable a team of datascientists/ml researchers to be successful. For that I do think access to a shared filesystem both locally and on the K8s cluster would be very valuable. I'll punt on that for now.

README.md Outdated

```
glide install
${REPLICA-TYPE}-${RUNTIME_ID}-${INDEX}-${RADNOM}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ${RANDOM}

README.md Outdated
then your logs may also be shipped to an appropriate data store for
further analysis.

#### GKE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is not something specific to TfJob, is it worth mentionning?
I'm guessing this is already documented on K8s or Stackdriver doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think users will want to know how to get the relevant logs for a TFJob; i.e. how to get the logs for the master? worker i? etc...

So I think the docs should explain that for relevant clouds (feel free to put together a PR with the same for Azure).

Unfortunately, the story on GCP leaves something to be desired because you can't filter by pod labels currently. Eventually once that's fixed and #72 is fixed users will be to just filter by various pod labels and I will update the docs to explain relevant filter expressions for TfJob.

I think that's relevant to using TfJob effectively.

README.md Outdated
* **Avoiding Breakages**
* During Alpha there is no guarantees about TfJob API
compaitibility.
* To avoid being broken by changes you can pin to particular
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you can pin to a particular

@wbuchwalter
Copy link
Contributor

Looks good, love the design doc, more project should have one.
I am travelling again, but I'll try to complete the doc with Azure instructions this week, as well as refactor the helm chart.

@jlewi
Copy link
Contributor Author

jlewi commented Oct 24, 2017

Review status: 0 of 4 files reviewed at latest revision, 6 unresolved discussions.


developer_guide.md, line 16 at r2 (raw file):

Previously, wbuchwalter (William Buchwalter) wrote…

Could you move the fix for the vendoring issue here?
Otherwise go install will fail and users might spend time trying to debug before reading further.

Done.


README.md, line 36 at r2 (raw file):

Previously, wbuchwalter (William Buchwalter) wrote…

Nit: require Kubernetes >= 1.7

Done.


README.md, line 64 at r2 (raw file):

Previously, wbuchwalter (William Buchwalter) wrote…

Nit: you can pin to a particular

Done.


README.md, line 339 at r2 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Good point. I removed it.

I do think there's value in documenting how to setup K8s clusters to enable a team of datascientists/ml researchers to be successful. For that I do think access to a shared filesystem both locally and on the K8s cluster would be very valuable. I'll punt on that for now.

Done.


README.md, line 496 at r2 (raw file):

Previously, wbuchwalter (William Buchwalter) wrote…

Nit: ${RANDOM}

Done.


Comments from Reviewable

@wbuchwalter
Copy link
Contributor

Reviewed 2 of 4 files at r1, 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 6 unresolved discussions.


Comments from Reviewable

@jlewi
Copy link
Contributor Author

jlewi commented Oct 24, 2017

Going to take that as an lgtm.

@jlewi jlewi merged commit 4887cfd into master Oct 24, 2017
@jhseu jhseu deleted the design_doc branch November 3, 2017 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants