Corrections in #27 applied. Titles changed back to "Step n"

iterative · Mar 8, 2021 · dbe14ca · dbe14ca
1 parent 38d398a
commit dbe14ca
Show file tree

Hide file tree

Showing 7 changed files with 91 additions and 119 deletions.
diff --git a/get-started/versioning/01-track-a-file-or-directory.md b/get-started/versioning/01-track-a-file-or-directory.md
@@ -1,3 +1,5 @@
+# Track a File or Directory
+
 Let's get a data file from the [Get
 Started](https://dvc.org/doc/start/data-and-model-versioning) example project:
 
@@ -7,18 +9,17 @@ dvc get \
   get-started/data.xml -o data/data.xml
 ```{{execute}}
 
-> The command [`dvc get`][cmdget] is like streamlined and smart `wget`
-> that can be used to retrieve artifacts from DVC projects hosted on Git
-> repositories. You don't even need an _initialized_ DVC repository to use
-> `get`
+> The command [`dvc get`][cmdget] is like a smart `wget` that can be used to
+> retrieve artifacts from DVC repositories. You don't even need an
+> _initialized_ DVC repository to use `dvc get`.
 
 `ls -lh data/`{{execute}}
 
-To track a large file, ML model or a whole directory with DVC we use `dvc add`:
+To track a large file, ML model, or a whole directory with DVC, we use `dvc add`:
 
 `dvc add data/data.xml`{{execute}}
 
-DVC has listed `data.xml` on `.gitignore` to make sure that we don't commit it
+DVC has listed `data.xml` in `.gitignore` to make sure that we don't commit it
 to Git.
 
 `cat data/.gitignore`{{execute}}

diff --git a/get-started/versioning/02-how-does-it-work.md b/get-started/versioning/02-how-does-it-work.md
@@ -1,43 +1,32 @@
-Let's take a look at inside the `.dvc` file to learn how it keeps track of our
-data file: 
+# How does it work?
 
-For a data file named `data.xml`, DVC keeps the tracking information in
-`data.xml.dvc`. 
+Let's take a look at inside the `.dvc` file to learn how it keeps track of our
+data: 
 
 `cat data/data.xml.dvc`{{execute}}
 
-Note that this is valid YAML 1.2 file. There is a field named `md5` in the
-file. 
-
-This field is used to keep track the changes in `data.xml`. Let's check whether
-the hash calculated by `md5sum` is identical with the hash calculated by DVC. 
+> This is a valid YAML 1.2 file. `md5` field is used to keep track changes in
+> `data.xml`. Let's check whether the hash calculated by `md5sum` is identical
+> with the hash calculated by DVC.
 
 ```
 md5sum data/data.xml 
 
 grep 'md5' data/data.xml.dvc 
 ```{{execute}}
 
-DVC uses `md5` field to actually address the file in its cache. The hash
-is a pointer to the content in DVC cache. 
-
-If the content of `data.xml` is changed, its MD5 hash will change. DVC is able
-to detect the change by comparing the new MD5 hash to the MD5 hash kept in
+DVC uses the hash to address the file in the cache. DVC detects
+the change by comparing the new MD5 hash to the hash kept in
 `data.xml.dvc`.
 
-Now let's see how cache is organized:
+Let's see how the cache is organized:
 
 `tree .dvc/cache`{{execute}}
 
-DVC saved a copy of `data/data.xml` to `.dvc/cache/` using its MD5 hash as a
-file name. The first two characters of the hash is used as a directory name and
-the rest as a file name. 
-
-When a data or model file's content changes, its address in the cache changes
-too. This allows to keep track history of large data files.
+DVC saved a copy of `data/data.xml` to `.dvc/cache/` using the first two
+characters as directory and the rest as file name. 
 
-> The default setting for DVC is to copy content both in the cache and the
-> workspace. To save storage space and time when dealing with large files, DVC
-> can use `symlinks`, `hardlinks`, or `reflinks`, depending on the file system. 
-> Read more in the user guide for [Large Dataset
+> To save storage space and time when dealing with large files, DVC can use
+> `symlinks`, `hardlinks`, or `reflinks` depending on the file system. Read
+> more in the user guide for [Large Dataset
 > Optimization](https://dvc.org/doc/user-guide/large-dataset-optimization).
diff --git a/get-started/versioning/03-data-remotes.md b/get-started/versioning/03-data-remotes.md
@@ -1,27 +1,25 @@
-Most of the times it's necessary to share data files with team members. DVC
-allows to set up a central location to share such content safely.
-[Remotes][bcremote] that are accessible by other systems or team members can be
-set up and versioned data and model files can be shared by uploading to and
-downloading from them. 
+# Data Remotes
+
+DVC allows to set up [remotes][bcremote] that are accessible by other systems
+or team members. 
 
-[bcremote]: https://dvc.org/doc/user-guide/basic-concepts/remote
 
 To set up a data remote, you would typically use Amazon S3, Google Cloud, or
-another supported storage type like this:
+[another supported storage][cmdremote] type like:
 
 ```
 dvc remote add --default \
     my-storage s3://my-bucket/dvc-storage
 ```
 
-> Note that DVC remotes are unlike Git remotes. DVC remotes are cache and content
+> Note that DVC remotes are _unlike_ Git remotes. DVC remotes are data 
 > storage locations _without_ a commit history. History of data and model files
-> are tracked in `*.dvc` text files and these are typically shared with other
-> team members as if they are code/text files. 
+> are tracked in `*.dvc` text files, and these are typically shared with other
+> team members as if they were code/text files.
 
-It's possible to use another directory in the same disk as a _remote_ also.
-This allows fast backups and we'll use this feature to keep this scenario
-simple. 
+It's possible to use a _local_ directory in the same file system as a _remote_
+as well.  This allows fast backups.  We'll use this feature to keep this
+scenario simple.
 
 ```
 dvc remote add --default \
@@ -32,20 +30,16 @@ You can get a list of configured remotes using:
 
 `dvc remote list`{{execute}}
 
-These configurations are typically stored in Git:
+The configuration is typically stored in Git:
 
 `git diff .dvc/config`{{execute}}
 
-Let's commit the configuration to Git: 
+Let's commit the changes to Git: 
 
 ```
 git commit .dvc/config \
     -m "Configure data storage"
 ```{{execute}}
 
-
-> DVC supports the following remote storage types: Google Drive, Amazon S3,
-> Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP.
-> Please refer to
-> [dvc remote add](https://dvc.org/doc/command-reference/remote/add) for more
-> details and examples.
+[cmdremote]: https://dvc.org/doc/command-reference/remote/add
+[bcremote]: https://dvc.org/doc/user-guide/basic-concepts/remote
diff --git a/get-started/versioning/04-saving-and-retrieving-data.md b/get-started/versioning/04-saving-and-retrieving-data.md
@@ -1,17 +1,17 @@
-In order to see the status of tracked data and model files and if they are
-stored in remotes, we can use 
+# Saving and Retrieving Data
 
-`dvc status --cloud`{{execute}}
+The command `dvc status` can compare the data on the workspace with the data saved
+on the cache. 
 
-to compare files between the data stored on cache and the default remote.
+`dvc status --cloud`{{execute}}
 
 Without the `-c, --cloud` option 
 
 `dvc status`{{execute}}
 
-checks whether workspace files are up to date with _local cache._
+checks whether workspace files are up to date with the _local cache._
 
-As `dvc status --cloud` shows we have a missing file in `mystorage`.  
+As `dvc status --cloud` shows, we have a missing file in `mystorage`.  
 
 We manage
 data and model files in remotes with `dvc push` and `dvc pull`. These are used
@@ -24,21 +24,20 @@ previous step:
 `dvc push`{{execute}}
 
 > Note that `dvc push` uses _default remote_. There can be multiple remotes for
-> a project and if you want to push to a particular one, you can use `--remote`
+> a project, and if you want to push to a particular one, you can use the `--remote`
 > option for `dvc push`. 
 
-After pushing let's check the status again: 
+Let's check the status again: 
 
 `dvc status --cloud`{{execute}}
 
-Now, let's take a look at the content of `/tmp/data-storage` to see what
-happens in when we push a file to the cache. 
+Now, let's take a look at the content of `/tmp/data-storage` (location of our
+remote) to see what happens when we push a file to the cache.
 
 `tree /tmp/data-storage/`{{execute}}
 
-As you can see the structure of `.dvc/cache` and `/tmp/data-storage` is
-similar. Both of them contains the same files addressed by the same hash
-values. 
+The structure of `.dvc/cache` and `/tmp/data-storage` is identical. Both of them
+contains the same files addressed by the same hash values.
 
 `tree .dvc/cache`{{execute}}
 
@@ -48,46 +47,44 @@ Suppose we deleted `data.xml` from the workspace _and_ from the cache.
 
 `rm -f data/data.xml`{{execute}}
 
-We deleted all _local_ versions of `data.xml`. Neither `.dvc/cache` nor `data/` contains our data file.
+We deleted all _local_ versions of `data.xml`. Neither `.dvc/cache`, nor `data/`
+contains our data file. 
 
-`tree -a -I .git` 
+`ls -R | grep 'data.xml'`{{execute}}
 
-This is also reported by 
+This is also reported by:
 
 `dvc status`{{execute}}
 
-In order to retrieve missing data and model files from remotes we can use:
+In order to retrieve missing data and model files from remotes, we can use:
 
 `dvc fetch`{{execute}}
 
-This command copies files from remote to _local_ cache but it doesn't update
-the workspace with the data file. 
-
-Let's see the status of workspace now: 
+This command copies files from remote to _local_ cache, but it doesn't update
+the workspace with the data file.
 
 `dvc status`{{execute}}
 
-Note that `data.xml` is missing from the workspace but we can get it from local
-cache: 
+Note that `data.xml` is still missing from the workspace, but we can get it
+from the local cache:
 
 `dvc checkout`{{execute}}
 
-`checkout` command copies (or creates link to) local cache files to the
-workspace. Let's check the status again: 
+`dvc checkout` links (or copies) from local cache files to the workspace.
 
 `dvc status`{{execute}}
 
-DVC has a single command that both copies to local cache and the workspace. Instead of two step operation with `fetch` and `checkout`, we can use `pull`. Let's delete the data file and the cache again.
+Instead of a two-step operation with `fetch` and `checkout`, we can use `pull`. Let's delete the data file and the cache again.
 
 ```
 rm -rf .dvc/cache 
-rm -f data/data.xml```{{execute}}
+rm -f data/data.xml
+```{{execute}}
 
 We can retrieve the file from remote with a single command: 
 
 `dvc pull`{{execute}}
 
 `dvc pull` is usually used after `git clone`, `git pull`, or `git checkout` to
 synchronize the data with the code. Along with `dvc push`, it provides a basic
-collaboration workflow, similar to `git push` and `git pull`, that facilitates
-sharing of data.
+collaboration workflow, similar to `git push` and `git pull`.
diff --git a/get-started/versioning/05-making-changes.md b/get-started/versioning/05-making-changes.md
@@ -1,6 +1,8 @@
-Let's make the dataset smaller. It's not what would you usually do (usually we
-want more data), but since there are some RAM limitations to this online
-platform we have to do this to being able to run training later:
+# Making Changes
+
+Let's make the dataset smaller. It's not what we'd usually do. In general, we
+want more data, but since there are limitations in this platform, we need to do
+this to run the training later:
 
 ```
 head -n 12000 data/data.xml > data/data.xml.1
@@ -11,24 +13,21 @@ We check the status of the project:
 
 `dvc status`{{execute}}
 
-When the file `data/data.xml` is changed, DVC can detect it by analyzing the
-corresponding `.dvc` file. Run `dvc add` to save its new version to cache and
-update `data/data.xml.dvc` to match the new hash of the data file:
+Run `dvc add` to save its new version to cache and update `data/data.xml.dvc`.
 
 `dvc add data/data.xml`{{execute}}
 
-It updates the MD5 hash value in the corresponding `dvc` file: 
+It updates the hash value in the corresponding `dvc` file: 
 
 `git diff`{{execute}}
 
-We can commit the changed hash value to track the version of the file:
+Commit the changed hash value to track the version of the file:
 
 `git commit -a -m "Dataset updates"`{{execute}}
 
-We push the new version of file to the default remote: 
+Push the new version of the file to the default remote: 
 
 `dvc push`{{execute}}
 
-So, each version of `data/data.xml.dvc` in Git history is linked to a version
-of the data file in the DVC cache. This connection is done through
-the MD5 hash of the cached file that is also its name on the cache.
+So, each version of `data/data.xml.dvc` in the Git history is linked to a
+version of the data file in the DVC cache.
diff --git a/get-started/versioning/06-switching-between-versions.md b/get-started/versioning/06-switching-between-versions.md
@@ -1,41 +1,33 @@
-Technically speaking DVC is not a version control system.
+# Switching between Versions
 
-Although DVC can work without a VCS, history of `.dvc` files is better tracked
-with Git. Otherwise we may end up with files in the cache whose paths in the
-workspace are missing. 
+Technically speaking, DVC is not a version control system.
 
-Git serves as the version control system for text and code files. DVC in turn
-creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in
-the workspace efficiently to match them.
+Git serves as the VCS for text and code files, although DVC can work without it.
 
-Let's get the previous version of the dataset `data/data.xml`:
+Let's get the previous version of `data/data.xml`:
 
-`git checkout HEAD^1 data/data.xml.dvc`{{execute}}
+`git checkout HEAD~ data/data.xml.dvc`{{execute}}
 
 When we check what changed using:
 
 `git diff --staged`{{execute}}
 
-we see that `data.xml.dvc` now contains the previous hash value of `data.xml`.
-
-We can see that current hash and the hash value in `data.xml.dvc` is different. 
+We see that `data.xml.dvc` now contains the previous hash value of `data.xml`. 
 
 `md5sum data/data.xml`{{execute}}
 
-To synchronize `data.xml` with the version addressed in `data.xml.dvc` we use 
+To synchronize `data.xml` with the version addressed in `data.xml.dvc`:
 
 `dvc checkout`{{execute}}
 
-and this copies the previous version of `data.xml` from from local cache. `dvc
-checkout` command synchronizes data files in the workspace to match the `.dvc`
-files content. Now we can see that the value in `data.xml.dvc` and the hash
-value of `data.xml` are identical. 
+Now we can see that the value in `data.xml.dvc` and the hash value of
+data.xml` are identical.
 
 ```
 grep 'md5' data/data.xml.dvc
 md5sum data/data.xml
 ```{{execute}}
 
-Instead of `checkout` we can use `pull` again.  The difference is that `dvc
-pull` also downloads missing data into cache, while `dvc checkout` only can
-restore data that already in cache.
+Instead of `dvc checkout`, we could use `dvc pull` again.  Pulling also
+downloads missing data from the remote storage, whereas `dvc checkout` can only
+restore that's already in the local cache.
diff --git a/get-started/versioning/index.json b/get-started/versioning/index.json
@@ -6,27 +6,27 @@
     "details": {
         "steps": [
             {
-                "title": "Track a file or directory",
+                "title": "Step 1",
                 "text": "01-track-a-file-or-directory.md"
             },
             {
-                "title": "How does it work?",
+                "title": "Step 2",
                 "text": "02-how-does-it-work.md"
             },
             {
-                "title": "Data remotes",
+                "title": "Step 3",
                 "text": "03-data-remotes.md"
             },
             {
-                "title": "Saving and retrieving data",
+                "title": "Step 4",
                 "text": "04-saving-and-retrieving-data.md"
             },
             {
-                "title": "Making changes",
+                "title": "Step 5",
                 "text": "05-making-changes.md"
             },
             {
-                "title": "Switching between versions",
+                "title": "Step 6",
                 "text": "06-switching-between-versions.md"
             }
         ],