Merge pull request #21 from IVproger/19-docs

Update Docs
IVproger · Jul 23, 2024 · fe8296e · fe8296e
2 parents f0ae1c7 + ca8b4fe
commit fe8296e
Show file tree

Hide file tree

Showing 13 changed files with 79 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -85,3 +85,6 @@ We use Docker Compose to run all services of Airflow and ZenML server.
    ```
 3. Wait for all models to train.
 4. Access MLFlow server at http://localhost:5000.
+
+## Docs
+Each folder contains a `README.md` file with short description for every file. Code is well-documented with inline comments and representative symbol    names
diff --git a/api/README.md b/api/README.md
@@ -0,0 +1,9 @@
+# API
+```
+api
+├── api.Dockerfile       # Dockerfile for starting the Flask backend
+├── app.py               # Flask entrypoint
+├── gradio_app.py        # Gradio config file
+├── gradio.Dockerfile    # Dockerfile for starting the Gradio frontend
+└── ml.Dockerfile        # Dockerfile for starting the model with MLFlow
+```
diff --git a/configs/README.md b/configs/README.md
@@ -1 +1,15 @@
-# MLops-project
+# Configs
+```
+configs
+├── data_sample.yaml           # Main configuration file
+├── data_transformations.yaml  # Sample data configuration
+├── data_version.txt           # Defines transformations applied to data rows
+├── experiment.yaml            # Current data version (for convenience)
+├── main.yaml                  # MLFlow experiment definition
+├── model
+│   ├── lr.yaml                # Definition of folds and metrics
+│   ├── model.yaml             # Parameters for Logistic Regression
+│   ├── rf.yaml                # Parameters for Random Forest
+│   └── xgboost.yaml           # Parameters for XGBoost
+└── README.md                  # Project documentation
+```
diff --git a/data/README.md b/data/README.md
@@ -1,2 +1,8 @@
 # Data outline
-- `samples/sample.csv` - our primary sample file. Synced via DVC.
+```
+data
+├── README.md
+└── samples
+    ├── sample.csv      # Our primary sample file. Synced via DVC.
+    └── sample.csv.dvc  # DVC shadow for sample.csv
+```
diff --git a/docs/README.md b/docs/README.md
diff --git a/models/README.md b/models/README.md
@@ -1 +1,2 @@
-# MLops-project
+# Models
+This folder contains "champion" models for each architecture (that have been trained)
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -1,6 +1,11 @@
 # Notebooks outline
-- `business_data_understanding.ipynb` - showcases the Business Data Understanding of the project.
-- `data_analysis.ipynb` - EDA for the dataset.
-- `data_quality.ipynb` - Defines data requirements for the project and showcases data checks.
-- `feature_descriptions.csv` - Human-understandable description of all of the available features (taken from the dataset's datacard on Kaggle)
-- `poc.ipynb` - Proof-of-concept model showcase that solves the business problem.
+```
+notebooks
+├── business_data_understanding.ipynb  # Showcases the Business Data Understanding of the project.
+├── data_analysis.ipynb                # EDA for the dataset
+├── expectations.ipynb                 # Great Expectations
+├── data_quality.ipynb                 # Defines data requirements for the project and showcases data checks
+├── poc.ipynb                          # Proof-of-concept model showcase that solves the business problem.
+├── xgboost_experiment.ipynb           # Proof-of-concept XGBoost model
+└── feature_descriptions.csv           # Human-understandable description of all of the available features (taken from the dataset's datacard on Kaggle)
+```
diff --git a/outputs/README.md b/outputs/README.md
diff --git a/reports/README.md b/reports/README.md
@@ -1 +1,2 @@
-# MLops-project
+# Reports
+This folder contains Giskard reports (if any meaningful reports have been generated and tracked with Git)
diff --git a/scripts/README.md b/scripts/README.md
@@ -1,5 +1,12 @@
-# Scripts outline
-- `install_requirements.sh` - installs all of the requirements. Make sure that you have activated a local environment.
-- `test_data.sh` - installs, tests, and runs GX on the data.
-- `airflow_activate.sh` - run all airflow services to track and run DAGs.
-- `airflow_activate.sh` - delete all processes of airflow to clean up the working directory and restart airflow piplines.
+# Scipts
+```
+scripts
+├── airflow_activate.sh      # Outdated by Docker
+├── airflow_cleanup.sh       # Outdated by Docker
+├── airflow_logs.sh          # Outdated by Docker
+├── extend_activate.sh       # "Extends" shell to include AIRFLOW_HOME env var
+├── extract_data.sh          # Runs src.data Python script
+├── install_requirements.sh  # Installs dependencies
+├── push_sample_version.sh   # Outdated
+└── test_data.sh             # Samples, validates, versions, and commits data
+```
diff --git a/services/README.md b/services/README.md
@@ -1,2 +1,2 @@
-# GX outline
-- `gx/great_expectations.yaml` - primary configuration file for GX
+# Services outline
+This folder contains config files related to Airflow and GX
diff --git a/services/airflow/dags/README.md b/services/airflow/dags/README.md
@@ -0,0 +1,6 @@
+pipelines
+├── data_extract_v0_dag.py   # Data extraction Airflow pipeline (unmaintained)
+├── data_extract_v1_dag.py   # Data extraction Airflow pipeline (latest)
+├── data_prepare_dag.py      # Data preparation Airflow pipeline
+└── data_prepare.py          # Data preparation ZenML pipeline
+```
diff --git a/src/README.md b/src/README.md
@@ -1,4 +1,12 @@
 # Source code outline
-- `data_quality.py` - Python script to run Great eXpectations on the dataset
-- `data_transformations.py` - Python script with all data transformation functions for EDA and POC model
-- `data.py` - script for data sampling and validation
+```
+src
+├── data.py                     # Functions to manipulate data. If ran on its own, downloads, samples, validates, and versions data
+├── data_quality.py             # Unmaintained. `load_context_and_sample_data` is used in two notebooks. Keep this for archival reasons
+├── data_transformations.py     # Functions for data transformation
+├── evaluate.py                 # Script that validates a given model
+├── main.py                     # Script runs model training (MLFlow)
+├── model.py                    # MLFlow-related functions
+├── utils.py                    # Utility functions
+└── validate.py                 # Giskard model validation. Generates a Giskard report
+```