diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 89c92db07e..60384d8dc4 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -1,4 +1,10 @@ -# Data Pipelines +--- +title: 'Get Started: Data Pipelines' +description: 'Learn how to build and use DVC pipelines to capture, organize, +version, and reproduce your data science and machine learning workflows.' +--- + +# Get Started: Data Pipelines Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC @@ -6,8 +12,10 @@ introduces a mechanism to capture _data pipelines_ — series of data processes that produce a final result. DVC pipelines and their data can also be easily versioned (using Git). This -allows you to better organize your project, and reproduce your workflow and -results later exactly as they were built originally! +allows you to better organize projects, and reproduce your workflow and results +later — exactly as they were built originally! For example, you could capture a +simple ETL workflow, organize a data science project, or build a detailed +machine learning pipeline. ## Pipeline stages @@ -299,7 +307,7 @@ important problems: and which commands will generate the pipeline results (such as an ML model). Storing these files in Git makes it easy to version and share. - _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - describing - projects in way that it can be reproduced (built) is the fist necessary step + projects in way that it can be reproduced (built) is the first necessary step before introducing CI/CD systems. See our sister project, [CML](https://cml.dev/) for some examples.