Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(docs): correct typos and improve stylistic consistency #232

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/data_datacube.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@

The `datacube.py` script collects Sentinel-2, Sentinel-1, and DEM data over individual MGRS tiles. The source list of the MGRS tiles to be processed is provided in an input file with MGRS geometries. Each run of the script will collect data for one of the MGRS tiles in the source file. The tile to be processed is based on the row index number provided as input. The MGRS tile ID is expected to be in the `name` property of the input file.

For the target MGRS tile, the script loops through the years between 2017 and 2023 in random order. For each year, it will search for the least cloudy Sentinel-2 scene. Based on the date of the selected Sentinel-2 scene, it will search for the Sentinel-1 scenes that are the closest match to that date, with a maximum of +/- 3 days of difference. It will include multiple Sentinel-1 scenes until the full MGRS tile is covered. There are cases where no matching Sentinel-1 scenes can be found, in which case the script moves to the next year. The script stops when 3 matching datasets were collected for 3 different years. Finally, the script will also select the intersecting part of the Copernicus Digital Elevation Model (DEM).
For the target MGRS tile, the script loops through the years between 2017 and 2023 in random order. For each year, it will search for the least cloudy Sentinel-2 scene. Based on the date of the selected Sentinel-2 scene, it will search for the Sentinel-1 scenes that are the closest match to that date, with a maximum of +/- 3 days of difference. It will include multiple Sentinel-1 scenes until the full MGRS tile is covered. If no matching Sentinel-1 scenes can be found, the script moves to the next year. The script stops when 3 matching datasets have been collected for 3 different years. Finally, the script will also select the intersecting part of the Copernicus Digital Elevation Model (DEM).

The script will then download all of the Sentinel-2 scene, and match the data cube with the corresponding Sentinel-1 and DEM data. The scene level data is then split into smaller chips of a fixed size of 512x512 pixels. The Sentinel2, Sentinel-1 and DEM bands are then packed together in a single TIFF file for each chip. These are saved locally and synced to a S3 bucket at the end of the script. The bucket name can be specified as input.
The script will then download the Sentinel-2 scene and match the data cube with the corresponding Sentinel-1 and DEM data. The scene-level data is then split into smaller chips of a fixed size of 512x512 pixels. The Sentinel-2, Sentinel-1 and DEM bands are then packed together in a single TIFF file for each chip. These are saved locally and synced to a S3 bucket at the end of the script. The bucket name can be specified as input.

For testing and debugging, the data size can be reduced by specifying a pixel window using the `subset` parameter. Data will then be requested only for the specified pixel window. This will reduce the data size considerably which speeds up the processing during testing.

The example run below will search for data for the geometry with row index 1 in a with a local MGRS sample file, for a 1000x1000 pixel window.
The example run below will search for data for the geometry with row index 1 in a local MGRS sample file for a 1000x1000 pixel window.

```bash
python datacube.py --sample /home/user/Desktop/mgrs_sample.fgb --bucket "my-bucket" --subset "1000,1000,2000,2000" --index 1
Expand Down Expand Up @@ -38,14 +38,14 @@ docker push $ecr_repo_id.dkr.ecr.us-east-1.amazonaws.com/fetch-and-run

### Prepare AWS batch

To prepare batch, we need to create a compute environment, job queue, and job
To prepare a batch, we need to create a compute environment, job queue, and job
definition.

Example configurations for the compute environment and the job definition are
provided in the `batch` directory.

The `submit.py` script contains a loop for submitting jobs to the queue. An
alternative to this individual job submissions would be to use array jobs, but
alternative to these individual job submissions would be to use array jobs, but
for now the individual submissions are simpler and failures are easier to track.

### Create ZIP file with the package to execute
Expand All @@ -54,7 +54,7 @@ Package the model and the inference script into a zip file. The `datacube.py`
script is the one that will be executed on the instances.

Put the scripts in a zip file and upload the zip package into S3 so that
the batch fetch and run can use it.
the batch fetch-and-run can use it.

```bash
zip -FSrj "batch-fetch-and-run.zip" ./scripts/pipeline* -x "scripts/pipeline*.pyc"
Expand Down
29 changes: 14 additions & 15 deletions docs/data_labels.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Benchmark dataset labels

A benchmark dataset is a collection of data used for evaluating the performance
of algorithms, models or systems in a specific field of study. These datasets
are crucial in providing a common ground for comparing different approaches,
of algorithms, models, or systems in a specific field of study. These datasets
are crucial for providing common ground for comparing different approaches,
allowing researchers to assess the strengths and weaknesses of various methods.
For Clay, we evaluate our model on benchmark datasets with suitable downstream
For Clay, we evaluate our model on benchmark datasets that have suitable downstream
tasks.

For our initial benchmark dataset, we've implemented the
Expand All @@ -14,40 +14,39 @@ evaluation of finetuning on a downstream task. The task itself is
[segmentation](https://paperswithcode.com/task/semantic-segmentation) of water
pixels associated with recorded flood events.

The original dataset consists of 2/3 of our Foundation model's datacube inputs
The original dataset consists of two out of three of our Foundation model's datacube inputs
(Sentinel-1 and Sentinel-2) along with raster water mask labels for both
sensors. Each image is 512x512 pixels in terms of width and height. The
original Sentinel-2 images are L1C, which is Top-of-Atmosphere reflectance. We
are training Clay with surface reflectance, however, so we ultimately used the
geospatial bounds from the GeoTIFF and image timestamp (from the granule name)
to query
sensors. Each image is 512x512 pixels. The
original Sentinel-2 images are L1C, which is Top-of-Atmosphere reflectance. We train
Clay with surface reflectance, however, so we ultimately used the geospatial bounds
from the GeoTIFF and image timestamp (from the granule name) to query
[Microsoft Planetary Computer's STAC API for L2A (Bottom-of-Atmosphere a.k.a. "surface reflectance") Sentinel-2](https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a)
scenes in the same time and space, with the same channels expected by Clay. We
then followed the same `datacube` creation logic to generate datacubes with
Sentinel-1 VV and VH and the Copernicus digital elevation model (DEM). We also
Sentinel-1 VV and VH and the Copernicus Digital Elevation Model (DEM). We also
ensured that the Sentinel-1 data was within a +/- 3 day interval of each
reference Sentinel-2 scene (same method used by the benchmark dataset authors)
and that the Sentinel-1 data was indeed already included in the bechmark
datasets list of granules. The datacubes generated have all three inputs
dataset's list of granules. The datacubes generated have all three inputs
matching the exact specs of the Foundation model's training data, at 512x512
pixels.

Here is an example of a datacube we generated for the dataset:

![datacube](https://github.com/Clay-foundation/model/assets/23487320/94dffcf5-4075-4c17-ac96-01c11bcb299b)

The images left to right show a true color representation of the Sentinel-2
scene, the Sentinel-1 VH polarization and the digital elevation model.
The images, left to right, show a true-color representation of the Sentinel-2
scene, the Sentinel-1 VH polarization, and the Digital Elevation Model.

![gt](https://github.com/Clay-foundation/model/assets/23487320/4ac92af7-6931-4249-a920-7d29453b9b31)

Here we have something similar, but this time just the Sentinel-1 and
Sentinel-2 scenes with the Sentinel-1 water mask (ground truth) overlaid.

Last note on this benchmark dataset that we've adapted for Clay, we made sure
Last note on this benchmark dataset that we've adapted for Clay: we made sure
to preserve the metadata for timestamp and geospatial coordinates in the
datacube such that we can embed information in the way that the Clay Foundation
model expects. We also preserve the flood event information too, for analysis
model expects. We also preserve the flood event information for analysis
during finetuning.

The script for generating these datacubes is at
Expand Down
30 changes: 15 additions & 15 deletions docs/model_embeddings.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
# Generating vector embeddings

Once you have a pretrained model, it is now possible to pass some input images
into the encoder part of the Vision Transformer, and produce vector embeddings
Once you have a pretrained model, it is possible to pass some input images
into the encoder part of the Vision Transformer and produce vector embeddings
which contain a semantic representation of the image.

## Producing embeddings from the pretrained model

Step by step instructions to create embeddings for a single MGRS tile location
(e.g. 27WXN).
Step-by-step instructions to create embeddings for a single MGRS tile location
(e.g. 27WXN):

1. Ensure that you can access the 13-band GeoTIFF data files.

```
aws s3 ls s3://clay-tiles-02/02/27WXN/
```

This should report a list of filepaths if you have the correct permissions,
otherwise, please set up authentication before continuing.
This should report a list of filepaths if you have the correct permissions.
Otherwise, please set up authentication before continuing.

2. Download the pretrained model weights, and put them in the `checkpoints/`
folder.
2. Download the pretrained model weights and put them in the `checkpoints/`
folder:

```bash
aws s3 cp s3://clay-model-ckpt/v0/clay-small-70MT-1100T-10E.ckpt checkpoints/
Expand All @@ -37,7 +37,7 @@ Step by step instructions to create embeddings for a single MGRS tile location
For example, an AWS g5.4xlarge instance would be a cost effective option.
```

3. Run model inference to generate the embeddings.
3. Run model inference to generate the embeddings:

```bash
python trainer.py predict --ckpt_path=checkpoints/clay-small-70MT-1100T-10E.ckpt \
Expand All @@ -51,7 +51,7 @@ Step by step instructions to create embeddings for a single MGRS tile location
This should output a GeoParquet file containing the embeddings for MGRS tile
27WXN (recall that each 10000x10000 pixel MGRS tile contains hundreds of
smaller 512x512 chips), saved to the `data/embeddings/` folder. See the next
sub-section for details about the embeddings file.
subsection for details about the embeddings file.

The `embeddings_level` flag determines how the embeddings are calculated.
The default is `mean`, resulting in one average embedding per MGRS tile of
Expand All @@ -61,9 +61,9 @@ Step by step instructions to create embeddings for a single MGRS tile location
dimensionality of the encoder output, including the band group
dimension. The array size of those embeddings is 6 * 16 * 16 * 768.

The embeddings are flattened into one dimensional arrays because pandas
The embeddings are flattened into one-dimensional arrays because pandas
does not allow for multidimensional arrays. This makes it necessary to
reshape the flattened arrays to access the patch level embeddings.
reshape the flattened arrays to access the patch-level embeddings.

```{note}
For those interested in how the embeddings were computed, the predict step
Expand Down Expand Up @@ -113,7 +113,7 @@ Example: `27WXN_20200101_20231231_v001.gpq`

### Table schema

Each row within the GeoParquet table is generated from a 512x512 pixel image,
Each row within the GeoParquet table is generated from a 512x512 pixel image
and contains a record of the embeddings, spatiotemporal metadata, and a link to
the GeoTIFF file used as the source image for the embedding. The table looks
something like this:
Expand Down Expand Up @@ -161,9 +161,9 @@ Further reading:
- https://cloudnativegeo.org/blog/2023/10/the-geoparquet-ecosystem-at-1.0.0
```

## Converting to patch level embeddings
## Converting to patch-level embeddings

In the case where patch level embeddings are requested, the resulting array
In the case where patch-level embeddings are requested, the resulting array
will have all patch embeddings ravelled in one row. Each row represents a
512x512 pixel image, and contains 16x16 patch embeddings.

Expand Down
16 changes: 8 additions & 8 deletions docs/run_region.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,26 +63,26 @@ mgrs_aoi.to_file("data/mgrs/mgrs_aoi.fgb")

This will select the MGRS tiles that intersect with your AOI. The processing
will then happen for each of the MGRS tiles. This will most likely provide
slightly more data than the AOI itself, as the whole tile data will downloaded
slightly more data than the AOI itself, as the whole tile data will be downloaded
for each matched MGRS tile.

Each run of th datacube script will take an index as input, which is the index
Each run of the datacube script will take an index as input, which is the index
of the MGRS tile within the input file. This is why we need to download the
data in a loop.

A list of date ranges can be specified. The script will look for the least
cloudy Sentinel-2 scene for each date range, and match Sentinel-1 dates near
cloudy Sentinel-2 scene for each date range and match Sentinel-1 dates near
the identified Sentinel-2 dates.

The output folder can be specified as a local folder, or a bucket can be
specified to upload the data to S3.
The output folder can be specified as a local folder or a bucket can be
specified if you want to upload the data to S3.

Note that for the script to run, a Microsoft Planetary Computer token needs
to be set up, consult the [Planetary Computer SDK](https://github.com/microsoft/planetary-computer-sdk-for-python)
to be set up. Consult the [Planetary Computer SDK](https://github.com/microsoft/planetary-computer-sdk-for-python)
documentation on how to set up the token.

By default, the datacube script will download all the data available for each
MGRS tile it processes. So the output might include imagery chips that are
MGRS tile it processes, so the output might include imagery chips that are
outside of the AOI specified.

To speed up processing in the example below, we use the subset argument to
Expand Down Expand Up @@ -110,7 +110,7 @@ done
The checkpoints can be accessed directly from Hugging Face
at https://huggingface.co/made-with-clay/Clay.

The following command will run the model to create the embeddings,
The following command will run the model to create the embeddings
and automatically download and cache the model weights.

```bash
Expand Down