Merge branch 'revision-upd-4' of https://github.com/calico/borzoi int…

…o revision-upd-4
calico · Oct 7, 2024 · 3c491b7 · 3c491b7
2 parents 6cf2976 + f865ad5
commit 3c491b7
Show file tree

Hide file tree

Showing 9 changed files with 209 additions and 20 deletions.
diff --git a/README.md b/README.md
@@ -122,19 +122,12 @@ The curated e-/s-/pa-/ipaQTL benchmarking data can be downloaded from the follow
 To replicate the results presented in the paper, visit the [borzoi-paper repository](https://github.com/calico/borzoi-paper.git). This repository contains scripts for **training**, **evaluating**, and **analyzing** the published model, and for processing the **training data**.
 
 ### Tutorials
-Todo.
+The following directories contain *minimal* tutorials regarding model training, variant scoring, and interpretation. The 'legacy' tutorials use data transformations that are similar to those used in the manuscript, while 'latest' use updated (and simpler) transformations. Note that these tutorials are only intended to showcase core functionality on sample data (such as processing an RNA-seq experiment, or training a simple model). For advanced analyses, we recommend studying the results presented in the manuscript (see [Paper Replication](https://github.com/calico/borzoi/tree/main?tab=readme-ov-file#paper-replication)).
 
-#### Data Processing
-Todo.
-
-#### Model Training
-Todo.
-
-#### Variant Scoring
-Todo.
-
-#### Sequence Attribution
-Todo.
+- **Data Processing** [latest](https://github.com/calico/borzoi/tree/main/tutorials/latest/make_data) | [legacy](https://github.com/calico/borzoi/tree/main/tutorials/legacy/make_data)<br/>
+- **Model Training** [latest](https://github.com/calico/borzoi/tree/main/tutorials/latest/train_model) | [legacy](https://github.com/calico/borzoi/tree/main/tutorials/legacy/train_model)<br/>
+- **Variant Scoring** [latest](https://github.com/calico/borzoi/tree/main/tutorials/latest/score_variants) | [legacy](https://github.com/calico/borzoi/tree/main/tutorials/legacy/score_variants)<br/>
+- **Sequence Interpretation** [latest](https://github.com/calico/borzoi/tree/main/tutorials/latest/interpret_sequence) | [legacy](https://github.com/calico/borzoi/tree/main/tutorials/legacy/interpret_sequence)<br/>
 
 ### Example Notebooks
 The following notebooks contain example code for predicting and interpreting genetic variants.

diff --git a/tutorials/latest/interpret_sequence/README.md b/tutorials/latest/interpret_sequence/README.md
@@ -1,3 +1,13 @@
 ## Interpretation
 
-Todo.
+This tutorial describes how to compute gradient saliency scores (sequence attributions) with respect to various statistics computed for a list of input genes specified in a .gtf file. This example relies on the Mini Borzoi model trained on sample K562 RNA-seq data from the [train_model tutorial](https://github.com/calico/borzoi/tree/main/tutorials/latest/train_model), which clearly is a significantly weaker model than the pre-trained, published Borzoi model.
+
+To compute input gradients with respect to the log-sum of coverage across the exons of the example gene HBE1, run the script 'run_gradients_expr_HBE1.sh'.
+```sh
+conda activate borzoi_py310
+cd ~/borzoi/tutorials/latest/interpret_sequence
+./run_gradients_expr_HBE1.sh
+```
+
+*Notes*:
+- The track scale, squashing exponentiation, and clip-soft threshold, are specific in the .py script arguments (flags: '--track_scale, '--track_transform', '--clip_soft'), and the values in the targets file are ignored. This means that the same data transformation parameters are applied to all tracks specified in the targets file. To calculate gradients for groups of tracks with different data transforms, separate these tracks into different targets files, and execute the gradient script on each group separately.
diff --git a/tutorials/latest/make_data/README.md b/tutorials/latest/make_data/README.md
@@ -1,3 +1,42 @@
 ## Data Processing
 
-Todo.
+This tutorial decribes how to process a .bigwig sequencing experiment into compressed .w5 format, merge replicates, generate QC metrics, and finally create TFRecord files containing binned coverage values suitable for training Borzoi models. We will exemplify this for the ENCODE K562 RNA-seq experiment [ENCSR000AEL](https://www.encodeproject.org/experiments/ENCSR000AEL/).
+
+First, activate the conda environment and run the script 'download_dependencies.sh' to download required auxiliary files.
+```sh
+conda activate borzoi_py310
+cd ~/borzoi/tutorials/latest/make_data
+./download_dependencies.sh
+```
+
+Next, run the script 'download_bw.sh' to download sample ENCODE .bigwig files and arrange them in a folder structure.
+```sh
+./download_bw.sh
+```
+
+Then run script 'process_w5.sh' to generate compressed .w5 files (hdf5) from the input .bigwig files, merge the two replicates, and calculate basic QC metrics. This .sh script internally calls 'bw_h5.py' to generate .w5 files, 'w5_merge.py' to merge replicates, and 'w5_qc.py' to calculate QC metrics.
+```sh
+./process_w5.sh
+```
+
+Finally, run the Makefile to create genome-wide binned coverage tracks, stored as compressed TFRecords.
+```sh
+make
+```
+
+In this example, the Makefile creates 8 cross-validation folds of TFRecords with input sequences of length 393216 bp, generated with a genome-wide stride of 43691 bp. The output coverage tracks corresponding to each input sequence are not cropped in the latest version of Borzoi models. This results in 12288 coverage bins per 393kb sequence. The specific .w5 tracks to include in the TFRecord generation, and the scales and pooling transforms applied to the bins of each experiment, are given in the targets file 'targets_human.txt'. Below is a description of the columns in this file.
+
+*targets_human.txt*:
+- (unnamed) => integer index of each track (must start from 0 when training a new model).
+- 'identifier' => unique identifier of each experiment (and strand).
+- 'file' => local file path to .w5 file.
+- 'clip' => hard clipping threshold to be applied to each bin, after soft-clipping.
+- 'clip_soft' => soft clipping (squashing) threshold.
+- 'scale' => scale value applied to each bp-level position before clipping.
+- 'sum_stat' => type of bin-level pooling operation ('sum_sqrt' = sum and square-root).
+- 'strand_pair' => integer index of the other stranded track of an experiment (same index as current row if unstranded).
+- 'description' => text description of experiment.
+
+*Notes*:
+- See [here](https://github.com/calico/borzoi-paper/tree/main/data/training) for a description of the scripts called by the Makefile to create TFRecords.
+- In the latest version of Borzoi models, a modified hg38 fasta genome is used in the Makefile where the allele with highest overall frequency (from gnomAD) is substituted at each position.
diff --git a/tutorials/latest/score_variants/README.md b/tutorials/latest/score_variants/README.md
@@ -1,3 +1,27 @@
 ## Variant Scoring
 
-Todo.
+This tutorial describes how to predict variant effect scores for a small set of SNVs defined in a .vcf file. This example relies on the Mini Borzoi model trained on sample K562 RNA-seq data from the [train_model tutorial](https://github.com/calico/borzoi/tree/main/tutorials/latest/train_model), which clearly is a significantly weaker model than the pre-trained, published Borzoi model. For examples showcasing variant effect prediction at a larger scale with the pre-trained model (e.g. fine-mapped eQTL classification benchmarks), we refer the user to the [borzoi-paper respository](https://github.com/calico/borzoi-paper/tree/main). Additionally, we refer the user to the **legacy** version of [this tutorial](https://github.com/calico/borzoi/tree/main/tutorials/legacy/score_variants), which uses the pre-trained, published model.
+
+First, to calculate **gene-specific expression** scores, run the script 'score_expr_sed.sh'. Two different statistics are computed: (1) logSED (gene expression log fold change), and (2) logD2 (bin-level L2 norm across the coverage profile intersecting the exons of the gene).
+```sh
+conda activate borzoi_py310
+cd ~/borzoi/tutorials/latest/score_variants
+./score_expr_sed.sh
+```
+
+To calculate **gene-agnostic expression** scores, run the script 'score_expr_sad.sh'. One statistic is computed: logD2 (bin-level L2 norm across the entire predicted coverage track).
+```sh
+./score_expr_sad.sh
+```
+
+To calculate **gene-specific polyadenylation** scores, run the script 'score_polya.sh'. One statistic is computed: COVR (3' coverage ratio across pA junctions of the target gene).
+```sh
+./score_polya.sh
+```
+
+To calculate **gene-specific splicing** scores, run the script 'score_splice.sh'. One statistic is computed: nDi (normalized maximum absolute difference in coverage bins across the target gene span).
+```sh
+./score_splice.sh
+```
+
+Finally, the jupyter notebook 'run_variant_scripts.ipynb' is provided for convenience to execute all above scripts. The notebook also exemplifies how to navigate the variant prediction hdf5 files and print some example scores.
diff --git a/tutorials/latest/train_model/README.md b/tutorials/latest/train_model/README.md
@@ -1,3 +1,19 @@
 ## Model Training
 
-Todo.
+This tutorial describes how to train smaller Borzoi models on the example RNA-seq experiment processed in the [make_data tutorial](https://github.com/calico/borzoi/tree/main/tutorials/latest/make_data).
+
+To train a 'Mini Borzoi' ensemble (~40M parameters, 2 cross-validation folds), run the script 'train_mini.sh'. The model parameters are specified in 'params_mini.json'. This model can be trained with a batch size of 2 on a 24GB NVIDIA Titan RTX or RTX4090 GPU.
+```sh
+conda activate borzoi_py310
+cd ~/borzoi/tutorials/latest/train_model
+./train_mini.sh
+```
+
+Alternatively, to train an even smaller 'Micro Borzoi' ensemble (~5M parameters), run the script 'train_micro.sh'. This model can fit into the above GPU cards with a batch size of 4, which means the learning rate can be doubled and each epoch finished in half the time.
+```sh
+./train_micro.sh
+```
+
+*Notes*:
+- See [here](https://github.com/calico/borzoi-paper/tree/main/model) for a description of the scripts called internally by the training .sh script.
+- Rather than cropping the output predictions before applying the training loss, in the latest version of Borzoi models a smooth position-specific loss weight is applied that penalizes prediction errors less at the left/right boundaries.
diff --git a/tutorials/legacy/interpret_sequence/README.md b/tutorials/legacy/interpret_sequence/README.md
@@ -1,3 +1,25 @@
 ## Interpretation
 
-Todo.
+This tutorial describes how to compute gradient saliency scores (sequence attributions) with respect to various statistics computed for a list of input genes specified in a .gtf file. This example uses the pre-trained, published Borzoi model to compute gradients. To download this model, run the script 'download_models.sh' in the 'borzoi' root folder.
+
+First, to compute input gradients with respect to the log-sum of coverage across the exons of the target gene, run the script 'run_gradients_expr_CFHR2.sh'.
+```sh
+conda activate borzoi_py310
+cd ~/borzoi/tutorials/legacy/interpret_sequence
+./run_gradients_expr_CFHR2.sh
+```
+
+To compute input gradients with respect to the log-ratio of coverage immediately upstream and downstream of the distal polyA site of the target gene, run the script 'run_gradients_polya_CD99.sh'.
+```sh
+./run_gradients_polya_CD99.sh
+```
+
+To compute input gradients with respect to the log-ratio of coverage of an exon of the target gene relative to intronic coverage, run the script 'run_gradients_splice_GCFC2.sh'.
+```sh
+./run_gradients_splice_GCFC2.sh
+```
+Currently, the splicing gradient script chooses one exon at random to compute gradients for. While this approach was favorable for the specific analysis of the manuscript, we acknowledge that this is not particularly useful to users wanting to investigate an exon of their choice. We plan on updating this script soon to allow users to specify which exon to calculate gradients for.
+
+*Notes*:
+- The track scale, squashing exponentiation, and clip-soft threshold, are specific in the .py script arguments (flags: '--track_scale, '--track_transform', '--clip_soft'), and the values in the targets file are ignored. This means that the same data transformation parameters are applied to all tracks specified in the targets file. To calculate gradients for groups of tracks with different data transforms, separate these tracks into different targets files, and execute the gradient script on each group separately.
+- The legacy data transforms are activated in all above .sh scripts with the flag '--untransform_old'.
diff --git a/tutorials/legacy/make_data/README.md b/tutorials/legacy/make_data/README.md
@@ -1,3 +1,43 @@
 ## Data Processing
 
-Todo.
+This tutorial decribes how to process a .bigwig sequencing experiment into compressed .w5 format, merge replicates, generate QC metrics, and finally create TFRecord files containing binned coverage values suitable for training Borzoi models. We will exemplify this for the ENCODE K562 RNA-seq experiment [ENCSR000AEL](https://www.encodeproject.org/experiments/ENCSR000AEL/).
+
+First, activate the conda environment and run the script 'download_dependencies.sh' to download required auxiliary files.
+```sh
+conda activate borzoi_py310
+cd ~/borzoi/tutorials/legacy/make_data
+./download_dependencies.sh
+```
+
+Next, run the script 'download_bw.sh' to download sample ENCODE .bigwig files and arrange them in a folder structure.
+```sh
+./download_bw.sh
+```
+
+Then run script 'process_w5.sh' to generate compressed .w5 files (hdf5) from the input .bigwig files, merge the two replicates, and calculate basic QC metrics. This .sh script internally calls 'bw_h5.py' to generate .w5 files, 'w5_merge.py' to merge replicates, and 'w5_qc.py' to calculate QC metrics.
+```sh
+./process_w5.sh
+```
+
+Finally, run the Makefile to create genome-wide binned coverage tracks, stored as compressed TFRecords.
+```sh
+make
+```
+
+In this example, the Makefile creates 8 cross-validation folds of TFRecords with input sequences of length 393216 bp, generated with a genome-wide stride of 43691 bp. The output coverage tracks corresponding to each input sequence are cropped by 98304 bp on each side, before pooling the measurements in 32 bp bins. This results in 6144 coverage bins per 393kb sequence. The specific .w5 tracks to include in the TFRecord generation, and the scales and pooling transforms applied to the bins of each experiment, are given in the targets file 'targets_human.txt'. Below is a description of the columns in this file.
+
+*targets_human.txt*:
+- (unnamed) => integer index of each track (must start from 0 when training a new model).
+- 'identifier' => unique identifier of each experiment (and strand).
+- 'file' => local file path to .w5 file.
+- 'clip' => hard clipping threshold to be applied to each bin, after soft-clipping.
+- 'clip_soft' => soft clipping (squashing) threshold.
+- 'scale' => scale value applied to each 32 bp bin after clipping.
+- 'sum_stat' => type of bin-level pooling operation ('sum_sqrt' = sum and exponentiate by 3/4).
+- 'strand_pair' => integer index of the other stranded track of an experiment (same index as current row if unstranded).
+- 'description' => text description of experiment.
+
+*Notes*:
+- See [here](https://github.com/calico/borzoi-paper/tree/main/data/training) for a description of the scripts called by the Makefile to create TFRecords.
+- Of note, the **legacy** settings are activated in these data processing scripts with the flag '--transform_old' in the Makefile.
+- The **legacy** approach crops to the coverage tracks, a practice we have since abandonded in favor of a position-specific loss scale.
diff --git a/tutorials/legacy/score_variants/README.md b/tutorials/legacy/score_variants/README.md
@@ -1,3 +1,31 @@
 ## Variant Scoring
 
-Todo.
+This tutorial describes how to predict variant effect scores for a small set of SNVs defined in a .vcf file. For examples showcasing variant effect prediction at a larger scale (e.g. fine-mapped eQTL classification benchmarks), we refer the user to the [borzoi-paper respository](https://github.com/calico/borzoi-paper/tree/main). This example uses the pre-trained, published Borzoi model to predict variant effects. To download this model, run the script 'download_models.sh' in the 'borzoi' root folder.
+
+First, to calculate **gene-specific expression** scores, run the script 'score_expr_sed.sh'. Two different statistics are computed: (1) logSED (gene expression log fold change), and (2) logD2 (bin-level L2 norm across the coverage profile intersecting the exons of the gene).
+```sh
+conda activate borzoi_py310
+cd ~/borzoi/tutorials/legacy/score_variants
+./score_expr_sed.sh
+```
+
+To calculate **gene-agnostic expression** scores, run the script 'score_expr_sad.sh'. One statistic is computed: logD2 (bin-level L2 norm across the entire predicted coverage track).
+```sh
+./score_expr_sad.sh
+```
+
+To calculate **gene-specific polyadenylation** scores, run the script 'score_polya.sh'. One statistic is computed: COVR (3' coverage ratio across pA junctions of the target gene).
+```sh
+./score_polya.sh
+```
+
+To calculate **gene-specific splicing** scores, run the script 'score_splice.sh'. One statistic is computed: nDi (normalized maximum absolute difference in coverage bins across the target gene span).
+```sh
+./score_splice.sh
+```
+
+Finally, the jupyter notebook 'run_variant_scripts.ipynb' is provided for convenience to execute all above scripts. The notebook also exemplifies how to navigate the variant prediction hdf5 files and print some example scores.
+
+*Notes*:
+- The legacy data transforms are activated in all above .sh scripts with the flag '-u'.
+
diff --git a/tutorials/legacy/train_model/README.md b/tutorials/legacy/train_model/README.md
@@ -1,3 +1,20 @@
 ## Model Training
 
-Todo.
+This tutorial describes how to train smaller Borzoi models on the example RNA-seq experiment processed in the [make_data tutorial](https://github.com/calico/borzoi/tree/main/tutorials/legacy/make_data).
+
+To train a 'Mini Borzoi' ensemble (~40M parameters, 2 cross-validation folds), run the script 'train_mini.sh'. The model parameters are specified in 'params_mini.json'. This model can be trained with a batch size of 2 on a 24GB NVIDIA Titan RTX or RTX4090 GPU.
+```sh
+conda activate borzoi_py310
+cd ~/borzoi/tutorials/legacy/train_model
+./train_mini.sh
+```
+
+Alternatively, to train an even smaller 'Micro Borzoi' ensemble (~5M parameters), run the script 'train_micro.sh'. This model can fit into the above GPU cards with a batch size of 4, which means the learning rate can be doubled and each epoch finished in half the time.
+```sh
+./train_micro.sh
+```
+
+*Notes*:
+- See [here](https://github.com/calico/borzoi-paper/tree/main/model) for a description of the scripts called internally by the training .sh script.
+- The **legacy** model crops the predicted tracks (see layer 'Cropping1D' in the parameters file). In this example, the input sequence has length 393216 bp, and the cropping layer removes 3072x 32 bp bins from each side, resulting in 6144 bins.
+- In the **legacy** architecture, there is an extra/superfluous linear convolution applied in each 'unet_conv' layer (see the bool 'upsample_conv' in the parameters file). This additional convolution has since been removed.