ACDguide · paigem · Mar 8, 2023 · Feb 8, 2023 · Mar 8, 2023
diff --git a/BigData/_toc.yml b/BigData/_toc.yml
@@ -8,7 +8,7 @@ parts:
     - file: data/data-format.md
     - file: data/data-netcdf.md
     - file: data/data-zarr.md
-  - caption: Large-scale data analysis
+  - caption: Analysis of large-scale datasets
     chapters:
     - file: computations/computations-intro.md
       sections:

diff --git a/BigData/computations/computations-intro.md b/BigData/computations/computations-intro.md
@@ -1,5 +1,10 @@
 # Large-scale data analysis
 
-Overview
+## Overview
+This section includes examples of common types of computations performed on large, climate datasets. We focus on the theory of how these computations are handled on chunked datasets when using tools like `Xarray` and `Dask`. We then provide examples of the specific functions that are often used to carry out each computation and, where possible, demonstrations of these tools in use.
+
+## How is analysis on large-scale data different from that on smaller datasets?
+Large datasets are often too big to load into RAM on the computer or server that you use to do your analysis. However, we often don't need to do a computation on the entire dataset. The [section on large-scale climate data](https://acdguide.github.io/BigData/data/data-netcdf.html) discusses how large datasets are typically saved in "chunks". Analysis on large-scale datasets also makes use of these chunks in order to apply computations only on the portion of the dataset needed for that particular calculation.
+
+An important concept for doing analysis on large datasets is the idea of "lazy computation", previously mentioned in the [data structure section](https://acdguide.github.io/BigData/data/data-structure.html). This is what the software package `Xarray` uses in conjunction with tools like `Dask`. When you read a dataset in `Xarray`, it will just read in the metadata (e.g. the variables names, the dimensions, the units, the size of each dimension, and any other metadata that the data creators provided). As you write code to do a computation, the actual calculation isn't carried out until you specify it. These are typically in the form of `.compute()`, `.load()`, etc. Only then is the computation performed by pulling in the specific chunks needed to complete the calculation. These concepts are incredibly powerful, and allow for quick analysis of big datasets!
 
-Index
diff --git a/BigData/computations/computations.ipynb b/BigData/computations/computations.ipynb
@@ -127,6 +127,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "id": "neither-armstrong",
    "metadata": {},
@@ -141,7 +142,7 @@
     "\n",
     "### Min / Max / Mean / Stddev\n",
     "\n",
-    "Functions like these are pretty simple to calculate regardless of dataset size, as they don't require the entire dataset to be in memory. You can just loop over the dimension to be reduced calculating value so far up to that step\n",
+    "Functions like these are pretty simple to calculate regardless of dataset size, as they don't require the entire dataset to be in memory. You can just loop over the dimension to be reduced by calculating the value so far up to that step\n",
     "\n",
     "In pseudocode (in Python you're better off using `data.min(axis=0)`, as that's optimised compared to a loop)\n",
     "\n",
@@ -299,13 +300,14 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "id": "vital-restoration",
    "metadata": {},
    "source": [
     "### Climatologies\n",
     "\n",
-    "Climatologies combine multiple years worth of data into a single sample year, for instance a daily mean climatology would output a year of data, with each day in the output the mean of all the days with the same month and day in the input.\n",
+    "Climatologies combine multiple years worth of data into a single sample year. For instance, a daily mean climatology would output a year of data, with each output day being the mean of all days with the same date from the input. In other words, the output for, e.g., March 3rd would be the average across all March 3rds in the multi-year dataset.\n",
     "\n",
     "Leap years require some consideration in a daily climatology, as those days will have 1/4 the samples of other days. Also consider how you are counting - with a day of year counting Feb 29 in a leap year will be matched up with 1 Mar in a non-leap year. "
    ]