Docs: promoting to production (#162)

Update dataset ingest links and add new overview of publishing data to production --------- Co-authored-by: smohiudd <[email protected]>
NASA-IMPACT · Aug 20, 2024 · 24b60ad · 24b60ad
1 parent 1a8b557
commit 24b60ad
Show file tree

Hide file tree

Showing 5 changed files with 74 additions and 444 deletions.
diff --git a/.markdownlint-cli2.jsonc b/.markdownlint-cli2.jsonc
@@ -1,5 +1,8 @@
 {
     "config": {
-        "MD013": false // disable line length checks
+        "MD013": false, // disable line length checks
+        "MD033": {
+            "allowed_elements": [ "a", "b", "details", "summary" ]
+        }
     }
 }
diff --git a/README.md b/README.md
@@ -2,32 +2,42 @@
 
 [![GitHub Workflow Status (with event)](https://img.shields.io/github/actions/workflow/status/nasa-impact/veda-data/ci.yaml?style=for-the-badge&label=CI)](https://github.com/NASA-IMPACT/veda-data/actions/workflows/ci.yaml)
 
-This repository houses data used to define a VEDA dataset to load into the [VEDA catalog](https://nasa-impact.github.io/veda-docs/services/apis.html). Inclusion in the VEDA catalog is a prerequisite for displaying the dataset in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/).
+This repository houses config data used to load datasets into the [VEDA catalog](https://nasa-impact.github.io/veda-docs/services/apis.html). Inclusion in the VEDA catalog is a prerequisite for displaying datasets in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/).
 
-The data provided here gets processed in the ingestion system [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow), to which this repository is directly linked (as a Git submodule).
+The config data provided here gets processed in the [veda-data-airflow](https://github.com/NASA-IMPACT/veda-data-airflow) ingestion system. See [Dataset Submission Process](#dataset-submission-process) for details about submitting work to the ingestion system.
 
-## Dataset Submission Process
+## Dataset submission process
 
-The VEDA user docs explain the full [dataset submission process](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/).
+![veda-data-publication][veda-data-publication]
 
-Ultimately, submission to the VEDA catalog requires that you [open an issue with the "new dataset" template](https://github.com/NASA-IMPACT/veda-data/issues/new?assignees=&labels=dataset&projects=&template=new-dataset.yaml&title=New+Dataset%3A+%3Cdataset+title%3E). This template will require, at minimum:
+To add data to VEDA you will:
 
-1. a description of the dataset
-2. the location of the data (in S3, CMR, etc.), and
-3. a point of contact for the VEDA team to collaborate with.
+1. **Stage your files:** Upload files to the staging bucket `s3://veda-data-store-staging` (which you can do with a VEDA JupyterHub account--request access [here](https://nasa-impact.github.io/veda-docs/services/jupyterhub.html)) or a self-hosted bucket in s3 has shared read access to VEDA service.
 
-One or more notebooks showing how the data should be processed would be appreciated.
+2. **Generate STAC metadata in the staging catalog:** Metadata must first be added to the Staging Catalog [staging.openveda.cloud/api/stac](https://staging.openveda.cloud/api/stac). You will need to create a dataset config file and submit it to the `/workflows/dataset/publish` endpoint to generate STAC Collection metadata and generate Item records for the files you have uploaded in Step 1. See detailed steps for the [dataset submission process](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/) in the contribuing section of [veda-docs](https://nasa-impact.github.io/veda-docs) where you can also find this full ingestion workflow example [geoglam ingest notebook](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/example-template/example-geoglam-ingest.html)
 
-## Ingestion Data Structure
+3. **Acceptance testing\*:** Perform acceptance testing appropriate for your data. \*In most cases this will be opening a dataset PR in [veda-config](https://github.com/NASA-IMPACT/veda-config) to generate a dashboard preview of the data. See [veda-docs/contributing/dashboard-configuration](https://nasa-impact.github.io/veda-docs/contributing/dashboard-configuration/dataset-configuration.html) for instructions on generating a dashboard preview).
 
-When submitting STAC records to ingest, a pull request can be made with the data structured as described below.
+4. **Promote to production!** Open a PR in the [veda-data](https://github.com/NASA-IMPACT/veda-data) repo with the dataset config metadata you used to add your data to the Staging catalog in Step 2. Add your config to `ingestion-data/production/dataset-config`. When your PR is approved, this configuration will be used to generate records in the production VEDA catalog!
 
-### `collections/`
+5. **[Optional] Share your data :** Share your data in the [VEDA Dashboard](https://www.earthdata.nasa.gov/dashboard/) by submitting a PR to [veda-config](https://github.com/NASA-IMPACT/veda-config) ([see veda-docs/contributing/dashboard-configuration](https://nasa-impact.github.io/veda-docs/contributing/dashboard-configuration/dataset-configuration.html)) and add jupyterhub hosted usage examples to [veda-docs/contributing/docs-and-notebooks](https://nasa-impact.github.io/veda-docs/contributing/docs-and-notebooks.html)
 
-The `ingestion-data/collections/` directory holds json files representing the data for VEDA collection metadata (STAC).
+## Project ingestion data structure
+
+When submitting STAC records for ingestion, a pull request can be made with the data structured as described below. The `ingestion-data/` directory contains artifacts of the ingestion configuration used to publish to the staging and production catalogs.
+
+> **Note**
+Various ingestion workflows are supported and documented below but only the configuration metadata used to publish to the VEDA catalog are stored in this repo. It is not expected that every ingestion will follow exactly the same pattern nor will each ingested collection have have all types of configuration metadata here. The primary method used to ingest is [**`dataset-config`**](#stagedataset-config).
+
+### `<stage>/collections/`
+
+The `ingestion-data/collections/` directory holds json files representing the data for VEDA collection metadata (STAC). STAC Collection metadata can be generated from an id, title, description using Pystac. See this [veda-docs/contributing notebook example](https://nasa-impact.github.io/veda-docs/notebooks/veda-operations/stac-collection-creation.html) to get started.
 
 Should follow the following format:
 
+<details>
+  <summary><b>/collections/collection_id.json</b></summary>
+
 ```json
 {
     "id": "<collection-id>",
@@ -105,40 +115,41 @@ Should follow the following format:
 
 ```
 
-### `discovery-items/`
+</details>
+
+### `<stage>/discovery-items/`
 
-The `ingestion-data/discovery-items/` directory holds json files representing the step function inputs for initiating the discovery, ingest and publication workflows.
+The `ingestion-data/discovery-items/` directory holds json files representing the inputs for initiating the discovery, ingest and publication workflows.
 Can either be a single input event or a list of input events.
 
 Should follow the following format:
 
+<details>
+  <summary><b>/discovery-items/collection_id.json</b></summary>
+
 ```json
 {
     "collection": "<collection-id>",
-    "discovery": "<s3/cmr>",
 
     ## for s3 discovery
     "prefix": "<s3-key-prefix>",
     "bucket": "<s3-bucket>",
     "filename_regex": "<filename-regex>",
     "datetime_range": "<month/day/year>",
 
-    ## for cmr discovery
-    "version": "<collection-version>",
-    "temporal": ["<start-date>", "<end-date>"],
-    "bounding_box": ["<bounding-box-as-comma-separated-LBRT>"],
-    "include": "<filename-pattern>",
-
     ### misc
-    "cogify": "<true/false>",
-    "upload": "<true/false>",
-    "dry_run": "<true/false>",
+    "dry_run": "<true/false>"
 }
 ```
 
-### `dataset-config/`
+</details>
 
-The `ingestion-data/dataset-config/` directory holds json files that can be used with the `dataset/publish` stac ingestor endpoint, combining both collection metadata and discovery items. For an example of this ingestion workflow, see this [jupyter notebook](./transformation-scripts/example-template/example-geoglam-ingest.ipynb).
+### `<stage>/dataset-config/`
+
+The `ingestion-data/dataset-config/` directory holds json files that can be used with the `dataset/publish` workflows endpoint, combining both collection metadata and discovery items. For an example of this ingestion workflow, see this [geoglam ingest notebook in nasa-impact.github.io/veda-docs/contributing/dataset-ingeston](https://nasa-impact.github.io/veda-docs/contributing/dataset-ingestion/example-template/example-geoglam-ingest.html).
+
+<details>
+  <summary><b>/dataset-config/collection_id.json</b></summary>
 
 ```json
 {
@@ -170,11 +181,40 @@ The `ingestion-data/dataset-config/` directory holds json files that can be used
         }
     ]
 }
+
+```
+
+</details>
+
+### `production/transfer-config`
+
+This directory contains the configuration needed to execute a stand-alone airflow DAG that copies data from a specified staging bucket and prefix to a permanent location in `s3://veda-data-store` using the collection_id as a prefix.
+
+Should follow the following format:
+
+<details>
+  <summary><b>/production/transfer-config/collection_id.json</b></summary>
+
+```json
+{
+    "collection": "<collection-id>",
+
+    ## the location of the staged files
+    "origin_bucket": "<s3-bucket>",
+    "origin_prefix": "<s3-key-prefix>",
+    "bucket": "<s3-bucket>",
+    "filename_regex": "<filename-regex>",
+
+    ### misc
+    "dry_run": "<true/false>"
+}
 ```
 
+</details>
+
 ## Validation
 
-This repository provides a script for validating all collections.
+This repository provides a script for validating all collections in the ingestion-data directory.
 First, install the requirements (preferably in a virtual environment):
 
 ```shell
@@ -212,3 +252,5 @@ pip-compile
 ```
 
 This will update `requirements.txt` with a complete, realized set of Python dependencies.
+
+[veda-data-publication]: ./docs/publishing-data.excalidraw.png
diff --git a/docs/publishing-data.excalidraw.png b/docs/publishing-data.excalidraw.png
diff --git a/transformation-scripts/example-template/CropMonitor_2023_06_28.tif b/transformation-scripts/example-template/CropMonitor_2023_06_28.tif