-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update data-pipelines.md #1645
Update data-pipelines.md #1645
Conversation
Saw a typo and got a bit carried away. I'm a big DVC fan! π« π¨βπ»
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @noahshpak we appreciate your interest! I see that typo in the 3rd bullet, thanks for updating the text. But I do have some followup comments (below, one per bullet). Please lmk if you don't have the capacity to deal with this and we'll take it over. Best
content/docs/start/data-pipelines.md
Outdated
- _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - describing | ||
project in way that it can be reproduced (built) is the fist necessary step | ||
before introducing CI/CD systems. | ||
reproducible ML pipelines (builds) facilitates CI/CD systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely better. But not sure about "(builds)" What builds? Builds what? "builds" the plural noun? No need to summarize so much that it's ambiguous, let's try to be as explicit as necessary. Same with "facilitates": how does it facilitate them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! I agree. Hopefully my additions are more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A followup here, and let me commit my other changes... I may have more follow-ups β³
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more changes I'll apply:
(Planing to merge the Restyled PR.) |
content/docs/start/data-pipelines.md
Outdated
model). Storing these | ||
files in Git makes it easy to version and share. | ||
- _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - reproducible | ||
ML pipelines allow CI/CD systems to retrain models on fresh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this describes a very specific use case. CI/CD can include running the last stage for example, full retraining, just training remotely instead of local ... etc
content/docs/start/data-pipelines.md
Outdated
- _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - reproducible | ||
ML pipelines allow CI/CD systems to retrain models on fresh | ||
datasets with identical training, save the results, and even produce reports | ||
about the whole process. See [CML.dev](https://cml.dev/) for some examples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this @shcheklein ?
- _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - reproducible | |
ML pipelines allow CI/CD systems to retrain models on fresh | |
datasets with identical training, save the results, and even produce reports | |
about the whole process. See [CML.dev](https://cml.dev/) for some examples. | |
- _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - reproducible | |
ML pipelines allow CI/CD systems to reproduce all or part of the process (e.g. | |
model training) on production datasets. Using DVC, they can then save the | |
results, and even produce reports about the whole process. See | |
[CML.dev](https://cml.dev/) for some examples. |
Feel free to edit/commit it and merge the Restyled PR if you agree π or lmk and I'll do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it still describes one very specific case "reproduce all or part of the process ... on production datasets"
let's go back a bit, gusy. What was the problem with initial version? So, that I understand how to help you solve it.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, let's just fix the typo then? if the problem is not clear. Changing language a bit is fine either if it sounds/reads better, but let's not change the essence if there is no clear understanding what exactly are we fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I meant. The typo is fixed in the suggestion I left. We just need to commit it. Allow me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me the original version echos reproducibility rather than pipeline flexibility. The piece that matters for CI/CD is DVC can figure out when something important has changed (data, code, etc) and re-run a pipeline correctly. Like what @jorgeorpinel said "pipelines allow CI/CD systems reproduce all or part of the process."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Noah, I'll chat with Ivan and continue this elsewhere if needed, since this has been merged. But feel free to contribute another PR if you prefer π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! Again, thanks for all of the feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. I agree with the pipelines allow CI/CD systems reproduce all or part of the process
part. on production datasets
and mentioning train
make it specific.
The piece that matters for CI/CD is DVC can figure out when something important has changed (data, code, etc) and re-run a pipeline correctly
not sure this it true. Since it's already committed it can force dvc repro
for example to guarantee that external system gets the same result. There are other examples when we can "abuse" CI/CD to do the training first time- we don't change anything per-se - everything comes with a commit, system runs it first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think CML.dev link still make a lot of sense.
- _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - describing | ||
project in way that it can be reproduced (built) is the fist necessary step | ||
projects in way that it can be reproduced (built) is the fist necessary step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it -> they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops I merged the REstyled PR #1647. I'll continue this elsewhere... β³
Thanks again @noahshpak ! |
Saw a typo and got a bit carried away. I'm a big DVC fan! π« π¨βπ»