merge codebase changes on main into default (#92)

* wip: dataloader first draf * Fixing train, val, and test path * Added initial project structure Added a bunch of directories with (mostly) empty/dummy .py files for now, so that everyone can see what the project will be structured like. On top of the present directories, there will also be a datasets and a logs directory, the latter being dynamically created at traintime or validation time. * rename file, remove one-hot encode * Revert "wip: dataloader first draft" * Updating component loading section * sequence dataloader baseline model * fixing a couple typos * Delete src/metrics directory Deleting metrics directory as it was decided we'll have only one file with all metrics. * Added refactored DDPM and UNet from notebook V2 Refactored Lucas's DDPM, UNet and units and added them as PL modules. * Update diffusion.py Added "instantiate_from_config" import. * Update ddpm.py Added nucleotides as a parameter with a default of 4 to the sample method. * wip: separate train/val/test subclasses * Delete codebase/src/data directory * Updated PL dataloader * placeholder test file * Update unet_lucas.py Added default function import. * Added matching dummy test files * complete: initial dataloader * Added config template Designed config template mainly for PL-related parameters. Keeping multiprocessing arguments for multi-GPU for the first test, which we'll change to multi-node. Diffusion and UNet parameters can easily vary. * Delete dummy_config.yaml * delete test_diffusion * fix: fixed function naming convention * feat: Add initial CI proposal * feat: Add a simple pyproject config file * wip: train.py + configs * config folder structure update * fix datapath param of datasets * add additional sequence encoding schemes + separate transforms * add tests for sequence dataloader * add additional asserts for data batches * check sequence lengths in datasets * add more tests for invalid data * style: run black * feat: Refactor schedules and remove time_difference * feat: Add type hints to schedule utility functions * feat: Refactor noise schedule fn * feat: refactor q_sample fn * feat: add type hints to q_sample * feat: drop bit_scale * feat: run black and switch to torch.log * feat: drop t_index * feat: refactor p_sample fn * feat: refactor p_sample_loop fn * feat: refactor sample fn * feat: refactor training_step fn * feat(ci): Add `codebase` branch to CI Based on discussion with @mateibejan1, running the tests on the `codebase` branch is also essential. It's the branch which is under heavy development and we should ensure all tests pass before we merge into `codebase` as well. * reqs: add `pandas` to requirements.txt * reqs: add `torch` to requirements.txt * reqs: bump torch to `1.11.0` for compatibility * fix(ci): run pytest as a module * reqs: add torchvision to `0.12.0` * reqs: add `pytorch-lightning` * fix: failing CI tests for dataloader across platforms * fix: failing CI tests for dataloader - wrap transforms * fix: failing CI tests for dataloader - no multiprocessing for transforms * Add Lucas' conditioned UNet * Update EMA with Lucas' version * Added mean_flat util from P2 paper * Added P2 weighting skeleton. Need to figure out how to use P2 weighting on DNA data. * misc: create a PR template Fixes #51 * misc: add doc strings and type hints to the PR template cc: @mateibejan1 * Add files via upload * Add files via upload * Add files via upload Updated DDPM with the Noah's refactored notebook version. Preemptively added p2_weighting, need to figure out if/how it works on bit sequences. * Add files via upload * Add files via upload * style: run black * feat: add type hints to `utils/misc.py` * feat: add type hints to utils/metrics * feat: add type hints to utils/schedules * feat: add type hints to unet_bitdiffusion * feat: add type hints to unet_lucas * feat: add type hints to ddim * feat: add type hints to seq dataloader * feat: add type hints to unet_lucas_cond * Delete ddim.py Deprecated. * Delete unet_bitdiffusion.py Deprecated. * Update unet_conditional.yaml Changed default number of timesteps from 1000 to 200. * Update unet_conditional.yaml Moved unet_config params inside the diffusion models params, so it mirrors the hierarchical relationship between the diffusion class and the unet class. * Update misc.py Minor dict property name changes. * Update diffusion.py * Update diffusion.py * Update default.yaml * Update unet_lucas.py * initial test lucas unet * add test vq * ddm * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * merge codebase-hydra-restructure into main (#90) * WIP new folder structure * ema parameter fix * Base dataloader instantiation with full hydraconfig succesful, missing full params * Update sequence_dataloader.py * Remove outputs folder, update .gitignore * Update network.py * Update sequence_datamodule.py * Update sequence_datamodule.py --------- Co-authored-by: cmvcordova <[email protected]> Co-authored-by: cmvcordova <[email protected]> Co-authored-by: Matei Bejan <[email protected]> --------- Co-authored-by: ssenan <[email protected]> Co-authored-by: Matei Bejan <[email protected]> Co-authored-by: Bendidi Ihab <[email protected]> Co-authored-by: Saurav Maheshkar <[email protected]> Co-authored-by: Jan Sobotka <[email protected]> Co-authored-by: ceziegler <[email protected]> Co-authored-by: jamesthesnake <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: cmvcordova <[email protected]> Co-authored-by: cmvcordova <[email protected]>
pinellolab · Mar 7, 2023 · da6599f · da6599f
1 parent 73cf225
commit da6599f
Show file tree

Hide file tree

Showing 57 changed files with 3,400 additions and 0 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,22 @@
+## What does this PR do?
+
+<!-- Please include a summary of what the PR aims to do -->
+
+Fixes `#<issue_number>`
+
+### Checklist
+
+- [ ] This change is discussed in a Github **issue/discussion** (`#<link>`).
+- [ ] Did you **document** your changes? (if necessary)
+- [ ] Have you **documented (doc strings)** your code ? (if necessary)
+- [ ] Have you added type hints to your code ? (if necessary)
+- [ ] Did you write any **new necessary high-coverage tests**?
+- [ ] Did you verify new and **existing tests pass** locally with your changes?
+
+### PR Review
+
+<!-- Tag Relevant People here -->
+
+Reviewers:
+
+### Any Other Comments
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,36 @@
+name: "Continuous Integration & Testing"
+
+on:
+  push:
+    branches:
+      - main
+      - codebase
+  pull_request:
+    branches:
+      - main
+      - codebase
+
+jobs:
+  build:
+    runs-on: ${{ matrix.os }}
+
+    strategy:
+      matrix:
+        python-version: ["3.8", "3.9", "3.10"]
+        os: [ubuntu-latest, windows-latest, macos-latest]
+
+    steps:
+      - uses: actions/checkout@v3
+      - name: Setup Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+          cache: "pip"
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip wheel setuptools
+          python -m pip install -r requirements.txt
+      - name: Test with PyTest
+        run: |
+          python -m pytest -v .
diff --git a/.idea/workspace.xml b/.idea/workspace.xml
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,27 @@
+[tool.pylint.messages_control]
+disable = [
+    "no-member",
+    "too-many-arguments",
+    "too-few-public-methods",
+    "no-else-return",
+    "duplicate-code",
+    "too-many-branches",
+    "redefined-builtin",
+    "dangerous-default-value",
+]
+
+[tool.pylint.format]
+max-line-length = 88
+
+[tool.black]
+line-length = 88
+
+[[tool.mypy.overrides]]
+ignore_missing_imports = true
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+filterwarnings = [
+    "ignore::DeprecationWarning",
+	"ignore::UserWarning"
+]
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,8 @@
+black==22.10.0
+mypy==0.982
+pandas==1.3.5
+pylint==2.15.4
+pytest==7.1.3
+pytorch-lightning==1.7.0
+torch==1.11.0
+torchvision==0.12.0
diff --git a/src/README.md b/src/README.md
@@ -0,0 +1,96 @@
+## Config Structure
+
+Current hypothetical config folder structure is as follows:
+
+```
+├── configs
+    ├── callbacks
+        ├── default.yaml
+    ├── dataset
+        ├── sequence.yaml
+    ├── logger
+        ├── wandb.yaml
+    ├── model
+        ├── unet.yaml
+        ├── unet_conditional.yaml
+        ├── unet_bitdiffusion.yaml
+    ├── paths
+        ├── default.yaml
+    ├── train.yaml
+```
+
+As new items (models, datasets, etc.) are added, a corresponding config file can be included so that minimal parameter altering is needed across various experiments
+
+## How to Run
+
+Below contains the main training config file that can be altered to fit any training alterations that are desired.
+Every parameter listed under defaults is defined within a config listed above.
+
+<details>
+<summary><b>Training config</b></summary>
+
+```yaml
+defaults:
+  - model: unet_conditional
+  - dataset: sequence
+  - logger: wandb
+  - callbacks: default
+
+ckpt: null # path to checkpoint
+seed: 42
+batch_size: 32
+devices: gpu
+benchmark: True
+ckpt_dir: # path still to be defined
+accelerator: gpu
+strategy: ddp
+min_epochs: 5
+max_epochs: 100000
+gradient_clip_val: 1.0
+accumulate_grad_batches: 1
+log_every_n_steps: 1
+check_val_every_n_epoch: 1 #for debug purposes
+save_last: True
+precision: 32
+```
+
+</details>
+
+### Using hydra config in a Jupyter Notebook
+
+Including the following at the beginning of a jupyter notebook will initialize hydra, load defined training config, and then print it.
+
+```python
+from hydra import compose, initialize
+from omegaconf import OmegaConf
+
+initialize(version_base=None, config_path="./src/configs")
+cfg = compose(config_name="train")
+print(OmegaConf.to_yaml(cfg))
+```
+
+When initializing hydra it is possible to override any of the default assignments.
+Here is an example of overriding batch_size and seed while initializing hydra:
+
+```python
+from hydra import compose, initialize
+from omegaconf import OmegaConf
+
+initialize(version_base=None, config_path="./src/configs")
+cfg = compose(overrides=["batch_size=64", "seed=1"])
+print(OmegaConf.to_yaml(cfg))
+```
+
+The following link to hydra documentation provides more information on override syntax: <br/>
+https://hydra.cc/docs/advanced/override_grammar/basic/ <br/>
+
+For more information regarding hydra initialization in jupyter see the following link:
+https://github.com/facebookresearch/hydra/blob/main/examples/jupyter_notebooks/compose_configs_in_notebook.ipynb
+
+## Still To Do:
+
+- Alter training script to accommodate all logs we wish to track using wandb
+- Decide on default hyperparameters in train.yaml
+- Further alter config folder structure to best suit our training and testing practices
+- Define default paths for dataset within path config file so that directory can be referenced across various other configs
+- Hydra config logs currently output in src directory, creating the following folder structure ./src/outputs/YYYY-MM-DD/MM-HH-SS. If we wish to alter this it can be done in a hydra config file.
diff --git a/src/__pycache__/config.cpython-39.pyc b/src/__pycache__/config.cpython-39.pyc
diff --git a/src/config.py b/src/config.py
@@ -0,0 +1,22 @@
+### file to include dataclass definition
+from dataclasses import dataclass
+from hydra.core.config_store import ConfigStore
+
+### needs overhaul with new folder structure
+### ignore for now
+@dataclass
+class DNADiffusionConfig:
+    defaults:
+      - _self_
+      - optimizer: adam
+      - lr_scheduler: MultiStepLR
+      - unet: unet_conditional
+
+    _target_: str = "__main__.trgt"  # dotpath describing location of callable
+    timesteps: 200
+    use_fp16: True
+    criterion: torch.nn.MSELoss #utils.metrics.MetricName
+    use_ema: True
+    ema_decay: float = 0.999
+    lr_warmup: 5000
+    image_size: 200
diff --git a/src/configs/callbacks/default.yaml b/src/configs/callbacks/default.yaml
@@ -0,0 +1,13 @@
+save_checkpoint:
+  target: pytorch_lightning.callbacks.ModelCheckpoint
+  params:
+    path: # to be entered
+    monitor: val_loss
+    mode: min
+    save_top_k: 10
+    save_last: True
+
+learning_rate:
+  target: pytorch_lightning.callbacks.LearningRateMonitor
+  params:
+    logging_interval: epoch