Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add intro text for Computations section #90

Merged
merged 2 commits into from
Mar 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion BigData/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ parts:
- file: data/data-format.md
- file: data/data-netcdf.md
- file: data/data-zarr.md
- caption: Large-scale data analysis
- caption: Analysis of large-scale datasets
paigem marked this conversation as resolved.
Show resolved Hide resolved
chapters:
- file: computations/computations-intro.md
sections:
Expand Down
9 changes: 7 additions & 2 deletions BigData/computations/computations-intro.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Large-scale data analysis

Overview
## Overview
This section includes examples of common types of computations performed on large, climate datasets. We focus on the theory of how these computations are handled on chunked datasets when using tools like `Xarray` and `Dask`. We then provide examples of the specific functions that are often used to carry out each computation and, where possible, demonstrations of these tools in use.
paigem marked this conversation as resolved.
Show resolved Hide resolved

## How is analysis on large-scale data different from that on smaller datasets?
Large datasets are often too big to load into RAM on the computer or server that you use to do your analysis. However, we often don't need to do a computation on the entire dataset. The [section on large-scale climate data](https://acdguide.github.io/BigData/data/data-netcdf.html) discusses how large datasets are typically saved in "chunks". Analysis on large-scale datasets also makes use of these chunks in order to apply computations only on the portion of the dataset needed for that particular calculation.

An important concept for doing analysis on large datasets is the idea of "lazy computation", previously mentioned in the [data structure section](https://acdguide.github.io/BigData/data/data-structure.html). This is what the software package `Xarray` uses in conjunction with tools like `Dask`. When you read a dataset in `Xarray`, it will just read in the metadata (e.g. the variables names, the dimensions, the units, the size of each dimension, and any other metadata that the data creators provided). As you write code to do a computation, the actual calculation isn't carried out until you specify it. These are typically in the form of `.compute()`, `.load()`, etc. Only then is the computation performed by pulling in the specific chunks needed to complete the calculation. These concepts are incredibly powerful, and allow for quick analysis of big datasets!

Index
6 changes: 4 additions & 2 deletions BigData/computations/computations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "neither-armstrong",
"metadata": {},
Expand All @@ -141,7 +142,7 @@
"\n",
"### Min / Max / Mean / Stddev\n",
"\n",
"Functions like these are pretty simple to calculate regardless of dataset size, as they don't require the entire dataset to be in memory. You can just loop over the dimension to be reduced calculating value so far up to that step\n",
"Functions like these are pretty simple to calculate regardless of dataset size, as they don't require the entire dataset to be in memory. You can just loop over the dimension to be reduced by calculating the value so far up to that step\n",
"\n",
"In pseudocode (in Python you're better off using `data.min(axis=0)`, as that's optimised compared to a loop)\n",
"\n",
Expand Down Expand Up @@ -299,13 +300,14 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "vital-restoration",
"metadata": {},
"source": [
"### Climatologies\n",
"\n",
"Climatologies combine multiple years worth of data into a single sample year, for instance a daily mean climatology would output a year of data, with each day in the output the mean of all the days with the same month and day in the input.\n",
"Climatologies combine multiple years worth of data into a single sample year. For instance, a daily mean climatology would output a year of data, with each output day being the mean of all days with the same date from the input. In other words, the output for, e.g., March 3rd would be the average across all March 3rds in the multi-year dataset.\n",
"\n",
"Leap years require some consideration in a daily climatology, as those days will have 1/4 the samples of other days. Also consider how you are counting - with a day of year counting Feb 29 in a leap year will be matched up with 1 Mar in a non-leap year. "
]
Expand Down