From 55c081049f5cfd54216a5f60e6d32ebc7ba94b6a Mon Sep 17 00:00:00 2001 From: kurianbenoy Date: Fri, 5 Jun 2020 13:37:01 +0530 Subject: [PATCH 1/9] remove instances of simply --- .../2017-08-23-ml-model-ensembling-with-fast-iterations.md | 2 +- content/blog/2019-03-05-march-19-dvc-heartbeat.md | 2 +- content/blog/2019-05-21-may-19-dvc-heartbeat.md | 2 +- content/blog/2020-02-17-a-public-reddit-dataset.md | 4 ++-- content/blog/2020-04-16-april-20-community-gems.md | 2 +- content/docs/command-reference/install.md | 2 +- content/docs/command-reference/remote/add.md | 2 +- content/docs/tutorials/get-started/data-access.md | 2 +- content/docs/user-guide/external-dependencies.md | 2 +- content/docs/user-guide/managing-external-data.md | 2 +- src/utils/shared/expiration.js | 2 +- 11 files changed, 12 insertions(+), 12 deletions(-) diff --git a/content/blog/2017-08-23-ml-model-ensembling-with-fast-iterations.md b/content/blog/2017-08-23-ml-model-ensembling-with-fast-iterations.md index 9b9082d5c5..9883526928 100644 --- a/content/blog/2017-08-23-ml-model-ensembling-with-fast-iterations.md +++ b/content/blog/2017-08-23-ml-model-ensembling-with-fast-iterations.md @@ -197,7 +197,7 @@ process for this project `gist:gvyshnya/7f1b8262e3eb7a8b3c16dbfd8cf98644#dvc.bat` If you then further edit ensemble configuration setup in `code/config.R`, you -can simply leverage the power of DVC as for automatic dependencies resolving and +can leverage the power of DVC as for automatic dependencies resolving and tracking to rebuild the new ensemble prediction as follows `gist:gvyshnya/9d80e51ba3d7aa5bd37d100ed82376ee` diff --git a/content/blog/2019-03-05-march-19-dvc-heartbeat.md b/content/blog/2019-03-05-march-19-dvc-heartbeat.md index 1b839f2225..f9e1ce1e28 100644 --- a/content/blog/2019-03-05-march-19-dvc-heartbeat.md +++ b/content/blog/2019-03-05-march-19-dvc-heartbeat.md @@ -144,7 +144,7 @@ liking and see your data files listed there. ### Q: [Managing data and pipelines with DVC on HDFS](https://discordapp.com/channels/485586884165107732/485596304961962003/545562334983356426) With DVC, you could connect your data sources from HDFS with your pipeline in -your local project, by simply specifying it as an external dependency. For +your local project, by specifying it as an external dependency. For example let’s say your script `process.cmd` works on an input file on HDFS and then downloads a result to your local workspace, then with DVC it could look something like: diff --git a/content/blog/2019-05-21-may-19-dvc-heartbeat.md b/content/blog/2019-05-21-may-19-dvc-heartbeat.md index da6653ce14..8413daa702 100644 --- a/content/blog/2019-05-21-may-19-dvc-heartbeat.md +++ b/content/blog/2019-05-21-may-19-dvc-heartbeat.md @@ -256,7 +256,7 @@ $ dvc metrics show metrics.json \ There are a few options to add a new dependency: -- simply opening a file with your favorite editor and adding a dependency there +- opening a file with your favorite editor and adding a dependency there without md5. DVC will understand that that stage is changed and will re-run and re-calculate md5 checksums during the next DVC repro; diff --git a/content/blog/2020-02-17-a-public-reddit-dataset.md b/content/blog/2020-02-17-a-public-reddit-dataset.md index b7bb152bd4..7ef70e0db7 100644 --- a/content/blog/2020-02-17-a-public-reddit-dataset.md +++ b/content/blog/2020-02-17-a-public-reddit-dataset.md @@ -110,7 +110,7 @@ you'll need to [install DVC](https://dvc.org/doc/install); one of the simplest ways is `pip install dvc`. Say you have a directory on your local machine where you plan to build some -analysis scripts. Simply run +analysis scripts. You can run the command: ```dvc $ dvc get https://github.com/iterative/aita_dataset \ @@ -225,7 +225,7 @@ $ dvc import https://github.com/iterative/aita_dataset \ ``` Then, because the dataset in your workspace is linked to our dataset repository, -you can update it by simply running: +you can update it by running: ```dvc $ dvc update aita_clean.csv diff --git a/content/blog/2020-04-16-april-20-community-gems.md b/content/blog/2020-04-16-april-20-community-gems.md index 64376744a2..3f208bf360 100644 --- a/content/blog/2020-04-16-april-20-community-gems.md +++ b/content/blog/2020-04-16-april-20-community-gems.md @@ -106,7 +106,7 @@ $ dvc pull process_data_stage.dvc You can also use `dvc pull` at the level of individual files. This might be needed if your DVC pipeline file creates 10 outputs, for example, and you only want to pull one (say, `model.pkl`, your trained model) from remote DVC storage. -You'd simply run +You'd just run ```dvc $ dvc pull model.pkl diff --git a/content/docs/command-reference/install.md b/content/docs/command-reference/install.md index d554b12662..9eabf54e5e 100644 --- a/content/docs/command-reference/install.md +++ b/content/docs/command-reference/install.md @@ -262,7 +262,7 @@ matching what is referenced by the DVC-files. To follow this example, start with the same workspace as before, making sure it is not in a _detached HEAD_ state by running `git checkout master`. -If we simply edit one of the code files: +We can edit one of the code files: ```dvc $ vi src/featurization.py diff --git a/content/docs/command-reference/remote/add.md b/content/docs/command-reference/remote/add.md index 04d9cce860..5f209a3455 100644 --- a/content/docs/command-reference/remote/add.md +++ b/content/docs/command-reference/remote/add.md @@ -197,7 +197,7 @@ $ dvc remote add -d myremote "azure://" To start using a GDrive remote, fist add it with a [valid URL format](/doc/user-guide/setup-google-drive-remote#url-format). Then -simply use any DVC command that needs it (e.g. `dvc pull`, `dvc fetch`, +use any DVC command that needs it (e.g. `dvc pull`, `dvc fetch`, `dvc push`), and follow the instructions to connect your Google Drive with DVC. For example: diff --git a/content/docs/tutorials/get-started/data-access.md b/content/docs/tutorials/get-started/data-access.md index 9416e69a1c..25500426a7 100644 --- a/content/docs/tutorials/get-started/data-access.md +++ b/content/docs/tutorials/get-started/data-access.md @@ -27,7 +27,7 @@ includes files and directories tracked by **both Git and DVC**. ## Just download it -One way is to simply download the data with `dvc get`. This is useful when +One way is to download the data with `dvc get`. This is useful when working outside of a DVC project environment, for example in an automated ML model deployment task: diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index fbd42f231f..82c26d22fe 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -35,7 +35,7 @@ directory. ## Examples As examples, let's take a look at a [stage](/doc/command-reference/run) that -simply moves a local file from an external location, producing a `data.txt.dvc` +moves a local file from an external location, producing a `data.txt.dvc` stage file (DVC-file). > Note that some of these commands use the `/home/shared` directory, typical in diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 994640eadc..e45ad57b17 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -43,7 +43,7 @@ in the same external/remote file system first. ## Examples For the examples, let's take a look at a [stage](/doc/command-reference/run) -that simply moves local file to an external location, producing a `data.txt.dvc` +that moves local file to an external location, producing a `data.txt.dvc` DVC-file. ### Local file system path diff --git a/src/utils/shared/expiration.js b/src/utils/shared/expiration.js index 7e426ea7b3..5b882008f7 100644 --- a/src/utils/shared/expiration.js +++ b/src/utils/shared/expiration.js @@ -17,7 +17,7 @@ function getExpirationDate({ date, expires }) { /* This is the primary logic to check if a date is expired, - It simply uses Moment to parse a date input and comparse that to the current + It uses Moment to parse a date input and comparse that to the current time. Use this on the result of getExpirationDate to get both pieces of information. From 093bf5e91098972fd561e592d1c43e36c0af0995 Mon Sep 17 00:00:00 2001 From: kurianbenoy Date: Fri, 5 Jun 2020 13:48:23 +0530 Subject: [PATCH 2/9] remove instances of word just --- content/docs/api-reference/open.md | 4 ++-- content/docs/command-reference/checkout.md | 2 +- content/docs/command-reference/commit.md | 4 ++-- content/docs/command-reference/import-url.md | 6 +++--- content/docs/command-reference/list.md | 2 +- content/docs/command-reference/metrics/show.md | 2 +- content/docs/command-reference/pull.md | 8 ++++---- content/docs/command-reference/push.md | 6 +++--- content/docs/command-reference/status.md | 2 +- content/docs/command-reference/update.md | 2 +- content/docs/tutorials/get-started/data-access.md | 2 +- content/docs/tutorials/get-started/data-pipelines.md | 4 ++-- content/docs/tutorials/get-started/data-versioning.md | 2 +- content/docs/tutorials/get-started/experiments.md | 2 +- content/docs/tutorials/pipelines.md | 4 ++-- 15 files changed, 26 insertions(+), 26 deletions(-) diff --git a/content/docs/api-reference/open.md b/content/docs/api-reference/open.md index cddae26836..50d71540bb 100644 --- a/content/docs/api-reference/open.md +++ b/content/docs/api-reference/open.md @@ -113,7 +113,7 @@ should handle the event-driven parsing of the document in this case.) This increases the performance of the code (minimizing memory usage), and is typically faster than loading the whole data into memory. -> If you just needed to load the complete file contents into memory, you can use +> If you want to load the complete file contents into memory, you can use > `dvc.api.read()` instead: > > ```py @@ -127,7 +127,7 @@ typically faster than loading the whole data into memory. ## Example: Accessing private repos -This is just a matter of using the right `repo` argument, for example an SSH URL +This is a matter of using the right `repo` argument, for example an SSH URL (requires that the [credentials are configured](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh) locally): diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md index e0be68bc46..4a29ad5eaf 100644 --- a/content/docs/command-reference/checkout.md +++ b/content/docs/command-reference/checkout.md @@ -151,7 +151,7 @@ baseline-experiment <- First simple version of the model bigrams-experiment <- Uses bigrams to improve the model ``` -We can now just run `dvc checkout` that will update the most recent `model.pkl`, +We can now run `dvc checkout` that will update the most recent `model.pkl`, `data.xml`, and other files that are tracked by DVC. The model file hash `662eb7f64216d9c2c1088d0a5e2c6951` will be used in the `train.dvc` [stage file](/doc/command-reference/run): diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 213e931c9f..42b944db14 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -44,7 +44,7 @@ further detailed below. other change that doesn't cause changed stage outputs. However, DVC will notice that some dependencies and have changed, and expect you to reproduce the whole pipeline. If you're sure no pipeline results would change, - just use `dvc commit` to force update the related DVC-files and cache. + use `dvc commit` to force update the related DVC-files and cache. Let's take a look at what is happening in the first scenario closely. Normally DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the @@ -280,4 +280,4 @@ Data and pipelines are up to date. ``` Instead of reproducing the pipeline for changes that do not produce different -results, just use `commit` on both Git and DVC. +results, use `commit` on both Git and DVC. diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index ab5894f4d0..5b011e7632 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -29,7 +29,7 @@ external data source changes. Example scenarios: - A shared dataset on a remote storage that is managed and updated outside DVC. > Note that `dvc get-url` corresponds to the first step this command performs -> (just download the file or directory). +> (downloads the file or directory). The `dvc import-url` command helps the user create such an external data dependency without having to manually copying files from the supported remote @@ -78,7 +78,7 @@ Specific explanations: is necessary to track if the specified remote file (URL) changed to download it again. -- `remote://myremote/path/to/file` notation just means that a DVC +- `remote://myremote/path/to/file` notation means that a DVC [remote](/doc/command-reference/remote) `myremote` is defined and when DVC is running. DVC automatically expands this URL into a regular S3, SSH, GS, etc URL by appending `/path/to/file` to the `myremote`'s configured base path. @@ -146,7 +146,7 @@ $ git checkout 2-remote $ mkdir data ``` -You should now have a blank workspace, just before +You should now have a blank workspace, before [Versioning Basics](/doc/tutorials/get-started/data-versioning). diff --git a/content/docs/command-reference/list.md b/content/docs/command-reference/list.md index 7464ed557b..1f74a08497 100644 --- a/content/docs/command-reference/list.md +++ b/content/docs/command-reference/list.md @@ -19,7 +19,7 @@ positional arguments: DVC, by effectively replacing data files, models, directories with DVC-files (`.dvc`), hides actual locations and names. This means that you don't see data files when you browse a DVC repository on Git hosting (e.g. -Github), you just see the DVC-files. This makes it hard to navigate the project +Github), you see the DVC-files. This makes it hard to navigate the project to find data artifacts for use with `dvc get`, `dvc import`, or `dvc.api`. diff --git a/content/docs/command-reference/metrics/show.md b/content/docs/command-reference/metrics/show.md index 398a5d4ee4..8bece891e5 100644 --- a/content/docs/command-reference/metrics/show.md +++ b/content/docs/command-reference/metrics/show.md @@ -32,7 +32,7 @@ compares them with a previous version. ## Options - `-a`, `--all-branches` - print metric file contents in all Git branches - instead of just those present in the current workspace. It can be used to + instead of using those present in the current workspace. It can be used to compare different experiments. Note that this can be combined with `-T` below, for example using the `-aT` flag. diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 4905677898..31023dd114 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -35,7 +35,7 @@ The default remote is used (see `dvc config core.remote`) unless the `--remote` option is used. See `dvc remote` for more information on how to configure a remote. -With no arguments, just `dvc pull` or `dvc pull --remote `, it downloads +With no arguments, use `dvc pull` or `dvc pull --remote `, it downloads only the files (or directories) missing from the workspace by searching all [DVC-files](/doc/user-guide/dvc-file-format) currently in the project. It will not download files associated with earlier commits @@ -59,7 +59,7 @@ reflinks or hardlinks to put it in the workspace without copying. See ## Options - `-a`, `--all-branches` - determines the files to download by examining - DVC-files in all Git branches instead of just those present in the current + DVC-files in all Git branches instead of those present in the current workspace. It's useful if branches are used to track experiments or project checkpoints. Note that this can be combined with `-T` below, for example using the `-aT` flag. @@ -94,7 +94,7 @@ reflinks or hardlinks to put it in the workspace without copying. See - `-j `, `--jobs ` - number of threads to run simultaneously to handle the downloading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs + `4 * cpu_count()`. For SSH remotes, the default value is `4`. Using more jobs may improve the total download speed if a combination of small and large files are being fetched. @@ -136,7 +136,7 @@ The workspace looks almost like in this └── train.dvc ``` -We can now just run `dvc pull` to download the most recent `data/data.xml`, +We can now run `dvc pull` to download the most recent `data/data.xml`, `model.pkl`, and other DVC-tracked files into the workspace: ```dvc diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 65f41494e4..e5a0be54c7 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -54,7 +54,7 @@ none are specified on the command line nor in the configuration. The default remote is used (see `dvc config core.remote`) unless the `--remote` option is used. See `dvc remote` for more information on how to configure a remote. -With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads +With no arguments, `dvc push` or `dvc push --remote REMOTE`, it uploads only the files (or directories) that are new in the local repository to remote storage. It will not upload files associated with earlier commits in the repository (if using Git), nor will it upload files that have not @@ -73,7 +73,7 @@ to push. ## Options - `-a`, `--all-branches` - determines the files to upload by examining DVC-files - in all Git branches instead of just those present in the current workspace. + in all Git branches instead of using files present in the current workspace. It's useful if branches are used to track experiments or project checkpoints. Note that this can be combined with `-T` below, for example using the `-aT` flag. @@ -103,7 +103,7 @@ to push. - `-j `, `--jobs ` - number of threads to run simultaneously to handle the uploading of files from the remote. The default value is - `4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs + `4 * cpu_count()`. For SSH remotes, the default value is `4`. Using more jobs may improve the total download speed if a combination of small and large files are being fetched. diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 8f8a3d6d98..9e6257ed9f 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -107,7 +107,7 @@ workspace) is different from remote storage. Bringing the two into sync requires (specified in the `core.remote` config option). - `-a`, `--all-branches` - compares cache content against all Git branches - instead of just the current workspace. This basically runs the same status + instead of the current workspace. This basically runs the same status command in every branch of this repo. The corresponding branches are shown in the status output. Applies only if `--cloud` or a `-r` remote is specified. Note that this can be combined with `-T` below, for example using the `-aT` diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index 4c6d359d99..30d00349c7 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -71,7 +71,7 @@ As DVC mentions, the import stage (DVC-file) `model.pkl.dvc` is created. This [stage file](/doc/command-reference/run) is frozen by default though, so to [reproduce](/doc/command-reference/repro) it, we would need to run `dvc unfreeze` on it first, then `dvc repro` (and `dvc freeze` again). Let's -just run `dvc update` on it instead: +run `dvc update` on it instead: ```dvc $ dvc update model.pkl.dvc diff --git a/content/docs/tutorials/get-started/data-access.md b/content/docs/tutorials/get-started/data-access.md index 25500426a7..c63c00aac1 100644 --- a/content/docs/tutorials/get-started/data-access.md +++ b/content/docs/tutorials/get-started/data-access.md @@ -25,7 +25,7 @@ cats-dogs.dvc The benefit of this command over browsing a Git hosting website is that the list includes files and directories tracked by **both Git and DVC**. -## Just download it +## Download it One way is to download the data with `dvc get`. This is useful when working outside of a DVC project environment, for example in an diff --git a/content/docs/tutorials/get-started/data-pipelines.md b/content/docs/tutorials/get-started/data-pipelines.md index 977b7978fd..cf1537e68b 100644 --- a/content/docs/tutorials/get-started/data-pipelines.md +++ b/content/docs/tutorials/get-started/data-pipelines.md @@ -163,7 +163,7 @@ This would be a good point to commit the changes with Git. This includes any ## Reproduce -Imagine you're just cloning the repository created so far, in +Imagine you're cloning the repository created so far, in another computer. It's extremely easy for anyone to reproduce the result end-to-end, by using `dvc repro`. @@ -198,7 +198,7 @@ executes the necessary commands to rebuild all the pipeline ## Visualize Having built our pipeline, we need a good way to understand its structure. -Seeing a graph of connected stage files would help. DVC lets you do just that, +Seeing a graph of connected stage files would help. DVC lets you do that, without leaving the terminal! ```dvc diff --git a/content/docs/tutorials/get-started/data-versioning.md b/content/docs/tutorials/get-started/data-versioning.md index 650beb3787..282b22d6bf 100644 --- a/content/docs/tutorials/get-started/data-versioning.md +++ b/content/docs/tutorials/get-started/data-versioning.md @@ -228,7 +228,7 @@ after `git clone` and `git pull`. ### 👉 Expand to simulate a fresh clone of this repo -Let's just remove the directory added so far, both from workspace +Let's remove the directory added so far, both from workspace and cache: ```dvc diff --git a/content/docs/tutorials/get-started/experiments.md b/content/docs/tutorials/get-started/experiments.md index 6035d19d2d..1f1fdcfae5 100644 --- a/content/docs/tutorials/get-started/experiments.md +++ b/content/docs/tutorials/get-started/experiments.md @@ -139,7 +139,7 @@ back and forth. To find the best-performing experiment or track the progress, described in one of the previous sections). Let's run evaluate for the latest `bigrams` experiment we created earlier. It -mostly takes just running the `dvc repro`: +mostly takes running the `dvc repro`: ```dvc $ git checkout master diff --git a/content/docs/tutorials/pipelines.md b/content/docs/tutorials/pipelines.md index 5283c3ea2c..dc7f50958f 100644 --- a/content/docs/tutorials/pipelines.md +++ b/content/docs/tutorials/pipelines.md @@ -183,7 +183,7 @@ outs: persist: false ``` -Just like the DVC-file we created earlier with `dvc add`, this stage file uses +Like the DVC-file we created earlier with `dvc add`, this stage file uses `md5` hashes (that point to the cache) to describe and version control dependencies and outputs. Output `data/Posts.xml` file is saved as `.dvc/cache/a3/04afb96060aad90176268345e10355` and linked (or copied) to the @@ -331,7 +331,7 @@ $ dvc metrics show It's time to save our [pipeline](/doc/command-reference/pipeline). You can confirm that we do not tack files or raw datasets with Git, by using the -`git status` command. We are just saving a snapshot of the DVC-files that +`git status` command. We are saving a snapshot of the DVC-files that describe data, transformations (stages), and relationships between them. ```dvc From bf8eb468e6573988896ff81c77c3716a5413008b Mon Sep 17 00:00:00 2001 From: Kurian Benoy Date: Mon, 8 Jun 2020 14:54:30 +0530 Subject: [PATCH 3/9] Update content/docs/api-reference/open.md Co-authored-by: Jorge Orpinel --- content/docs/api-reference/open.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/api-reference/open.md b/content/docs/api-reference/open.md index 50d71540bb..9ad4c1f1bd 100644 --- a/content/docs/api-reference/open.md +++ b/content/docs/api-reference/open.md @@ -127,7 +127,7 @@ typically faster than loading the whole data into memory. ## Example: Accessing private repos -This is a matter of using the right `repo` argument, for example an SSH URL +The key for this is to use the right `repo` argument, for example an SSH URL (requires that the [credentials are configured](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh) locally): From 9c793063da046231f56df1df1bad40f948fc5554 Mon Sep 17 00:00:00 2001 From: kurianbenoy Date: Fri, 5 Jun 2020 15:51:02 +0530 Subject: [PATCH 4/9] remove instances of just --- content/docs/understanding-dvc/how-it-works.md | 2 +- content/docs/use-cases/shared-development-server.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/understanding-dvc/how-it-works.md b/content/docs/understanding-dvc/how-it-works.md index 732433fca5..4f4a6b28e5 100644 --- a/content/docs/understanding-dvc/how-it-works.md +++ b/content/docs/understanding-dvc/how-it-works.md @@ -84,7 +84,7 @@ $ cd myrepo $ git pull # download tracked data from remote storage $ dvc checkout # checkout data files - $ ls -l data/ # You just got gigabytes of data through Git and DVC: + $ ls -l data/ # You downloaded gigabytes of data through Git and DVC: total 1017488 -r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv diff --git a/content/docs/use-cases/shared-development-server.md b/content/docs/use-cases/shared-development-server.md index 4131d5af6b..b808958eed 100644 --- a/content/docs/use-cases/shared-development-server.md +++ b/content/docs/use-cases/shared-development-server.md @@ -103,7 +103,7 @@ $ git commit -m "process clean data" $ git push ``` -And now you can just as easily make their work appear in your workspace with: +And now you can make their previous work appear in your workspace with: ```dvc $ git pull From eb642f9561e70554451e845288fbe7d3329e2348 Mon Sep 17 00:00:00 2001 From: kurianbenoy Date: Wed, 10 Jun 2020 08:33:26 +0530 Subject: [PATCH 5/9] modified changes in blog post --- ...3-ml-model-ensembling-with-fast-iterations.md | 2 +- .../blog/2020-02-17-a-public-reddit-dataset.md | 16 ++++++++-------- .../blog/2020-04-16-april-20-community-gems.md | 2 +- 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/content/blog/2017-08-23-ml-model-ensembling-with-fast-iterations.md b/content/blog/2017-08-23-ml-model-ensembling-with-fast-iterations.md index 9883526928..9b9082d5c5 100644 --- a/content/blog/2017-08-23-ml-model-ensembling-with-fast-iterations.md +++ b/content/blog/2017-08-23-ml-model-ensembling-with-fast-iterations.md @@ -197,7 +197,7 @@ process for this project `gist:gvyshnya/7f1b8262e3eb7a8b3c16dbfd8cf98644#dvc.bat` If you then further edit ensemble configuration setup in `code/config.R`, you -can leverage the power of DVC as for automatic dependencies resolving and +can simply leverage the power of DVC as for automatic dependencies resolving and tracking to rebuild the new ensemble prediction as follows `gist:gvyshnya/9d80e51ba3d7aa5bd37d100ed82376ee` diff --git a/content/blog/2020-02-17-a-public-reddit-dataset.md b/content/blog/2020-02-17-a-public-reddit-dataset.md index 7ef70e0db7..ac5acd5cb3 100644 --- a/content/blog/2020-02-17-a-public-reddit-dataset.md +++ b/content/blog/2020-02-17-a-public-reddit-dataset.md @@ -110,7 +110,7 @@ you'll need to [install DVC](https://dvc.org/doc/install); one of the simplest ways is `pip install dvc`. Say you have a directory on your local machine where you plan to build some -analysis scripts. You can run the command: +analysis scripts. You run: ```dvc $ dvc get https://github.com/iterative/aita_dataset \ @@ -317,10 +317,10 @@ refine these existing methods. And there’s almost certainly room to push the state of the art in asshole detection! If you're interested in learning more about using Reddit data, check out -[pushshift.io](https://pushshift.io/), a database that contains basically all of -Reddit's content (so why make this dataset? I wanted to remove some of the -barriers to analyzing text from r/AmItheAsshole by providing an -already-processed and cleaned version of the data that can be downloaded with a -line of code; pushshift takes some work). You might use pushshift's API and/or -praw to augment this dataset in some way- perhaps to compare activity in this -subreddit with another, or broader patterns on Reddit. +[pushshift.io](https://pushshift.io/), a database that contains all of Reddit's +content (so why make this dataset? I wanted to remove some of the barriers to +analyzing text from r/AmItheAsshole by providing an already-processed and +cleaned version of the data that can be downloaded with a line of code; +pushshift takes some work). You might use pushshift's API and/or praw to augment +this dataset in some way- perhaps to compare activity in this subreddit with +another, or broader patterns on Reddit. diff --git a/content/blog/2020-04-16-april-20-community-gems.md b/content/blog/2020-04-16-april-20-community-gems.md index 3f208bf360..c791cda4fb 100644 --- a/content/blog/2020-04-16-april-20-community-gems.md +++ b/content/blog/2020-04-16-april-20-community-gems.md @@ -106,7 +106,7 @@ $ dvc pull process_data_stage.dvc You can also use `dvc pull` at the level of individual files. This might be needed if your DVC pipeline file creates 10 outputs, for example, and you only want to pull one (say, `model.pkl`, your trained model) from remote DVC storage. -You'd just run +You'd run: ```dvc $ dvc pull model.pkl From 63d113ae243682703c700e7c51db5cafeafb5197 Mon Sep 17 00:00:00 2001 From: kurianbenoy Date: Wed, 10 Jun 2020 08:36:00 +0530 Subject: [PATCH 6/9] format code with prettier --- content/blog/2019-03-05-march-19-dvc-heartbeat.md | 6 +++--- content/blog/2019-05-21-may-19-dvc-heartbeat.md | 6 +++--- content/docs/command-reference/list.md | 4 ++-- content/docs/command-reference/push.md | 6 +++--- content/docs/command-reference/remote/add.md | 5 ++--- content/docs/command-reference/status.md | 9 ++++----- content/docs/command-reference/update.md | 4 ++-- content/docs/tutorials/get-started/data-access.md | 6 +++--- content/docs/tutorials/get-started/data-pipelines.md | 6 +++--- content/docs/tutorials/get-started/data-versioning.md | 4 ++-- content/docs/tutorials/pipelines.md | 10 +++++----- content/docs/user-guide/external-dependencies.md | 4 ++-- 12 files changed, 34 insertions(+), 36 deletions(-) diff --git a/content/blog/2019-03-05-march-19-dvc-heartbeat.md b/content/blog/2019-03-05-march-19-dvc-heartbeat.md index f9e1ce1e28..863069e2e9 100644 --- a/content/blog/2019-03-05-march-19-dvc-heartbeat.md +++ b/content/blog/2019-03-05-march-19-dvc-heartbeat.md @@ -144,9 +144,9 @@ liking and see your data files listed there. ### Q: [Managing data and pipelines with DVC on HDFS](https://discordapp.com/channels/485586884165107732/485596304961962003/545562334983356426) With DVC, you could connect your data sources from HDFS with your pipeline in -your local project, by specifying it as an external dependency. For -example let’s say your script `process.cmd` works on an input file on HDFS and -then downloads a result to your local workspace, then with DVC it could look +your local project, by specifying it as an external dependency. For example +let’s say your script `process.cmd` works on an input file on HDFS and then +downloads a result to your local workspace, then with DVC it could look something like: ```dvc diff --git a/content/blog/2019-05-21-may-19-dvc-heartbeat.md b/content/blog/2019-05-21-may-19-dvc-heartbeat.md index 8413daa702..090340a9e0 100644 --- a/content/blog/2019-05-21-may-19-dvc-heartbeat.md +++ b/content/blog/2019-05-21-may-19-dvc-heartbeat.md @@ -256,9 +256,9 @@ $ dvc metrics show metrics.json \ There are a few options to add a new dependency: -- opening a file with your favorite editor and adding a dependency there - without md5. DVC will understand that that stage is changed and will re-run - and re-calculate md5 checksums during the next DVC repro; +- opening a file with your favorite editor and adding a dependency there without + md5. DVC will understand that that stage is changed and will re-run and + re-calculate md5 checksums during the next DVC repro; - use `dvc run --no-exec` is another option. It will rewrite the existing file for you with new parameters. diff --git a/content/docs/command-reference/list.md b/content/docs/command-reference/list.md index 1f74a08497..3b7eef0941 100644 --- a/content/docs/command-reference/list.md +++ b/content/docs/command-reference/list.md @@ -19,8 +19,8 @@ positional arguments: DVC, by effectively replacing data files, models, directories with DVC-files (`.dvc`), hides actual locations and names. This means that you don't see data files when you browse a DVC repository on Git hosting (e.g. -Github), you see the DVC-files. This makes it hard to navigate the project -to find data artifacts for use with `dvc get`, `dvc import`, or +Github), you see the DVC-files. This makes it hard to navigate the project to +find data artifacts for use with `dvc get`, `dvc import`, or `dvc.api`. `dvc list` prints a virtual view of a DVC repository, as if files and diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index e5a0be54c7..137e42d1f9 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -54,9 +54,9 @@ none are specified on the command line nor in the configuration. The default remote is used (see `dvc config core.remote`) unless the `--remote` option is used. See `dvc remote` for more information on how to configure a remote. -With no arguments, `dvc push` or `dvc push --remote REMOTE`, it uploads -only the files (or directories) that are new in the local repository to remote -storage. It will not upload files associated with earlier commits in the +With no arguments, `dvc push` or `dvc push --remote REMOTE`, it uploads only the +files (or directories) that are new in the local repository to remote storage. +It will not upload files associated with earlier commits in the repository (if using Git), nor will it upload files that have not changed. diff --git a/content/docs/command-reference/remote/add.md b/content/docs/command-reference/remote/add.md index 5f209a3455..1d1e676290 100644 --- a/content/docs/command-reference/remote/add.md +++ b/content/docs/command-reference/remote/add.md @@ -197,9 +197,8 @@ $ dvc remote add -d myremote "azure://" To start using a GDrive remote, fist add it with a [valid URL format](/doc/user-guide/setup-google-drive-remote#url-format). Then -use any DVC command that needs it (e.g. `dvc pull`, `dvc fetch`, -`dvc push`), and follow the instructions to connect your Google Drive with DVC. -For example: +use any DVC command that needs it (e.g. `dvc pull`, `dvc fetch`, `dvc push`), +and follow the instructions to connect your Google Drive with DVC. For example: ```dvc $ dvc remote add -d myremote gdrive://0AIac4JZqHhKmUk9PDA/dvcstore diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 9e6257ed9f..b6589956b3 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -107,11 +107,10 @@ workspace) is different from remote storage. Bringing the two into sync requires (specified in the `core.remote` config option). - `-a`, `--all-branches` - compares cache content against all Git branches - instead of the current workspace. This basically runs the same status - command in every branch of this repo. The corresponding branches are shown in - the status output. Applies only if `--cloud` or a `-r` remote is specified. - Note that this can be combined with `-T` below, for example using the `-aT` - flag. + instead of the current workspace. This basically runs the same status command + in every branch of this repo. The corresponding branches are shown in the + status output. Applies only if `--cloud` or a `-r` remote is specified. Note + that this can be combined with `-T` below, for example using the `-aT` flag. - `-T`, `--all-tags` - same as `-a` above, but applies to Git tags as well as the workspace. Note that both options can be combined, for example using the diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index 30d00349c7..ba885f3fe5 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -70,8 +70,8 @@ Importing 'model.pkl (git@github.com:iterative/example-get-started)' As DVC mentions, the import stage (DVC-file) `model.pkl.dvc` is created. This [stage file](/doc/command-reference/run) is frozen by default though, so to [reproduce](/doc/command-reference/repro) it, we would need to run -`dvc unfreeze` on it first, then `dvc repro` (and `dvc freeze` again). Let's -run `dvc update` on it instead: +`dvc unfreeze` on it first, then `dvc repro` (and `dvc freeze` again). Let's run +`dvc update` on it instead: ```dvc $ dvc update model.pkl.dvc diff --git a/content/docs/tutorials/get-started/data-access.md b/content/docs/tutorials/get-started/data-access.md index c63c00aac1..1ad3b3767d 100644 --- a/content/docs/tutorials/get-started/data-access.md +++ b/content/docs/tutorials/get-started/data-access.md @@ -27,9 +27,9 @@ includes files and directories tracked by **both Git and DVC**. ## Download it -One way is to download the data with `dvc get`. This is useful when -working outside of a DVC project environment, for example in an -automated ML model deployment task: +One way is to download the data with `dvc get`. This is useful when working +outside of a DVC project environment, for example in an automated +ML model deployment task: ```dvc $ dvc get https://github.com/iterative/dataset-registry \ diff --git a/content/docs/tutorials/get-started/data-pipelines.md b/content/docs/tutorials/get-started/data-pipelines.md index cf1537e68b..8f138cb169 100644 --- a/content/docs/tutorials/get-started/data-pipelines.md +++ b/content/docs/tutorials/get-started/data-pipelines.md @@ -163,9 +163,9 @@ This would be a good point to commit the changes with Git. This includes any ## Reproduce -Imagine you're cloning the repository created so far, in -another computer. It's extremely easy for anyone to reproduce the result -end-to-end, by using `dvc repro`. +Imagine you're cloning the repository created so far, in another +computer. It's extremely easy for anyone to reproduce the result end-to-end, by +using `dvc repro`.
diff --git a/content/docs/tutorials/get-started/data-versioning.md b/content/docs/tutorials/get-started/data-versioning.md index 282b22d6bf..d7e817c78f 100644 --- a/content/docs/tutorials/get-started/data-versioning.md +++ b/content/docs/tutorials/get-started/data-versioning.md @@ -228,8 +228,8 @@ after `git clone` and `git pull`. ### 👉 Expand to simulate a fresh clone of this repo -Let's remove the directory added so far, both from workspace -and cache: +Let's remove the directory added so far, both from workspace and +cache: ```dvc $ rm -f datadir .dvc/cache/a3/04afb96060aad90176268345e10355 diff --git a/content/docs/tutorials/pipelines.md b/content/docs/tutorials/pipelines.md index dc7f50958f..1565a41528 100644 --- a/content/docs/tutorials/pipelines.md +++ b/content/docs/tutorials/pipelines.md @@ -183,9 +183,9 @@ outs: persist: false ``` -Like the DVC-file we created earlier with `dvc add`, this stage file uses -`md5` hashes (that point to the cache) to describe and version -control dependencies and outputs. Output `data/Posts.xml` file is saved as +Like the DVC-file we created earlier with `dvc add`, this stage file uses `md5` +hashes (that point to the cache) to describe and version control +dependencies and outputs. Output `data/Posts.xml` file is saved as `.dvc/cache/a3/04afb96060aad90176268345e10355` and linked (or copied) to the workspace, as well as added to `.gitignore`. @@ -331,8 +331,8 @@ $ dvc metrics show It's time to save our [pipeline](/doc/command-reference/pipeline). You can confirm that we do not tack files or raw datasets with Git, by using the -`git status` command. We are saving a snapshot of the DVC-files that -describe data, transformations (stages), and relationships between them. +`git status` command. We are saving a snapshot of the DVC-files that describe +data, transformations (stages), and relationships between them. ```dvc $ git add *.dvc auc.metric data/.gitignore diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 82c26d22fe..274121349c 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -35,8 +35,8 @@ directory. ## Examples As examples, let's take a look at a [stage](/doc/command-reference/run) that -moves a local file from an external location, producing a `data.txt.dvc` -stage file (DVC-file). +moves a local file from an external location, producing a `data.txt.dvc` stage +file (DVC-file). > Note that some of these commands use the `/home/shared` directory, typical in > Linux distributions. From bc5e288b8655e48fac8eff3e2be170eae4488dd1 Mon Sep 17 00:00:00 2001 From: kurianbenoy Date: Wed, 10 Jun 2020 08:55:55 +0530 Subject: [PATCH 7/9] update cmd checkout description and open-command --- content/docs/api-reference/open.md | 2 +- content/docs/command-reference/checkout.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/content/docs/api-reference/open.md b/content/docs/api-reference/open.md index 9ad4c1f1bd..c5c0a8065e 100644 --- a/content/docs/api-reference/open.md +++ b/content/docs/api-reference/open.md @@ -113,7 +113,7 @@ should handle the event-driven parsing of the document in this case.) This increases the performance of the code (minimizing memory usage), and is typically faster than loading the whole data into memory. -> If you want to load the complete file contents into memory, you can use +> If you wanted to load the complete file contents into memory, you can use > `dvc.api.read()` instead: > > ```py diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md index 4a29ad5eaf..8b824ab767 100644 --- a/content/docs/command-reference/checkout.md +++ b/content/docs/command-reference/checkout.md @@ -151,8 +151,8 @@ baseline-experiment <- First simple version of the model bigrams-experiment <- Uses bigrams to improve the model ``` -We can now run `dvc checkout` that will update the most recent `model.pkl`, -`data.xml`, and other files that are tracked by DVC. The model file hash +We can now run `dvc checkout` to update the most recent `model.pkl`, `data.xml`, +and other files that are tracked by DVC. The model file hash `662eb7f64216d9c2c1088d0a5e2c6951` will be used in the `train.dvc` [stage file](/doc/command-reference/run): From f4ac7286500fc70d55491989da91cd458fe6ae7f Mon Sep 17 00:00:00 2001 From: kurianbenoy Date: Wed, 10 Jun 2020 09:20:28 +0530 Subject: [PATCH 8/9] Changes as requested as part of reviews in #1396 --- content/docs/command-reference/checkout.md | 4 ++-- content/docs/command-reference/commit.md | 6 +++--- content/docs/command-reference/import-url.md | 2 +- content/docs/command-reference/install.md | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/content/docs/command-reference/checkout.md b/content/docs/command-reference/checkout.md index 8b824ab767..c92df101fd 100644 --- a/content/docs/command-reference/checkout.md +++ b/content/docs/command-reference/checkout.md @@ -102,8 +102,8 @@ be pulled from remote storage using `dvc pull`. ## Examples -Let's employ a simple workspace with some data, code, ML models, -pipeline stages, such as the DVC project created for the +Let's create a workspace with some data, code, ML models, pipeline +stages, such as the DVC project created for the [Get Started](/doc/tutorials/get-started). Then we can see what happens with `git checkout` and `dvc checkout` as we switch from tag to tag. diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index 42b944db14..8abaea9c67 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -95,8 +95,8 @@ reproducibility in those cases. ## Examples -Let's employ a simple workspace with some data, code, ML models, -pipeline stages, such as the DVC project created for the +Let's create a workspace with some data, code, ML models, pipeline +stages, such as the DVC project created for the [Get Started](/doc/tutorials/get-started). Then we can see what happens with `git commit` and `dvc commit` in different situations. @@ -280,4 +280,4 @@ Data and pipelines are up to date. ``` Instead of reproducing the pipeline for changes that do not produce different -results, use `commit` on both Git and DVC. +results, just use `commit` on both Git and DVC. diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 5b011e7632..b5ded05358 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -29,7 +29,7 @@ external data source changes. Example scenarios: - A shared dataset on a remote storage that is managed and updated outside DVC. > Note that `dvc get-url` corresponds to the first step this command performs -> (downloads the file or directory). +> (just downloads the file or directory). The `dvc import-url` command helps the user create such an external data dependency without having to manually copying files from the supported remote diff --git a/content/docs/command-reference/install.md b/content/docs/command-reference/install.md index 9eabf54e5e..b550ba4206 100644 --- a/content/docs/command-reference/install.md +++ b/content/docs/command-reference/install.md @@ -262,7 +262,7 @@ matching what is referenced by the DVC-files. To follow this example, start with the same workspace as before, making sure it is not in a _detached HEAD_ state by running `git checkout master`. -We can edit one of the code files: +Let's imagine we have modified the file `src/featurization.py`: ```dvc $ vi src/featurization.py From 7ea6c1261177ec8e83451a8e310243f6322108a2 Mon Sep 17 00:00:00 2001 From: kurianbenoy Date: Wed, 10 Jun 2020 09:24:00 +0530 Subject: [PATCH 9/9] update cmd import-url docs --- content/docs/command-reference/import-url.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index b5ded05358..ed5c0e0b6d 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -146,7 +146,7 @@ $ git checkout 2-remote $ mkdir data ``` -You should now have a blank workspace, before +You should now have a blank workspace, just before [Versioning Basics](/doc/tutorials/get-started/data-versioning).