Merge pull request #18 from pinellolab/dev

Dev
pinellolab · Mar 7, 2024 · b3ce0b1 · b3ce0b1
2 parents 8bd29ca + d523ae8
commit b3ce0b1
Show file tree

Hide file tree

Showing 24 changed files with 515 additions and 106 deletions.
diff --git a/README.md b/README.md
@@ -25,8 +25,17 @@ This is an analysis toolkit for the pooled CRISPR reporter or sensor data. The r
 ### Screen data is saved as *ReporterScreen* object in the pipeline.
 BEAN stores mapped gRNA and allele counts in `ReporterScreen` object which is compatible with [AnnData](https://anndata.readthedocs.io/en/latest/index.html). See [Data Structure](#data-structure) section for more information.
 
-### Examples
-We provide example scripts in `tests/`. Running `pytest --sparse-ordering` generates example input/output files from running 1 and 2-4 sequentially.
+<br/><br/>
+
+## Examples
+| Tutorial link | [Library design](#pipeline-run-options-by-library-design) | Selection | Reporter |  
+|---------------| -------------- | --------- | -------- |
+| [LDL-C GWAS](docs/tutorials/ldl_var.md) | GWAS variant library | FACS sorting | Yes/No |
+| [LDL-C LDLR CDS](docs/tutorials/ldl_cds.md) | Coding sequence tiling libarary | FACS sorting | Yes/No |
+| TKO simulated (Coming soon!) | GWAS variant library | Survival / Proliferation | Yes/No |
+| TKO simulated, tiling (Coming soon!)  | Coding sequence tiling libarary | Survival / Proliferation | Yes/No |
+
+We also provide example scripts in `tests/`. Running `pytest --sparse-ordering` will generate example input/output files.
 
 ### Pipeline run options by library design
 The `bean-filter` and `bean-run` steps depend on the type of gRNA library design, where BEAN supports two modes of running.
@@ -83,8 +92,8 @@ File should contain following columns.
   * Option 2: `accessibility_signal`: ATAC-seq signal value of the target loci of each guide.  
 * For variant library (gRNAs are designed to target specific variants and ignores bystander edits)
   * `target`: This column denotes which target variant/element of each gRNA. This is not used in `bean-count[-samples]` but required to run `bean-run` in later steps.
-  * `target_group [Optional]`: If negative/positive control gRNA will be considered in `bean-qc` and/or `bean-run`, specify as "NegCtrl"/"PosCtrl" in this column. 
-  * `target_pos [Optional]`: If `--match_target_pos` flag is used, input file needs `target_pos` which specifies 0-based relative position of targeted base within Reporter sequence.
+  * `target_group`: If negative/positive control gRNA will be considered in `bean-qc` and/or `bean-run`, specify as "NegCtrl"/"PosCtrl" in this column. 
+  * `target_pos`: If `--match_target_pos` flag is used, input file needs `target_pos` which specifies 0-based relative position of targeted base within Reporter sequence.
 * For tiling library (gRNAs tile coding / noncoding sequences)
   * `strand`: Specifies gRNA strand information relative to the reference genome.
   * `chrom`: Chromosome of gRNA targeted locus.
@@ -97,12 +106,18 @@ File should contain following columns with header.
 * `R1_filepath`: Path to read 1 `.fastq[.gz]` file
 * `R2_filepath`: Path to read 1 `.fastq[.gz]` file
 * `sample_id`: ID of sequencing sample
-* `rep [Optional]`: Replicate # of this sample (Should NOT contain `.`)
-* `bin [Optional]`: Name of the sorting bin
-* `upper_quantile [Optional]`: FACS sorting upper quantile
-* `lower_quantile [Optional]`: FACS sorting lower quantile  
+* `rep`: Replicate # of this sample (Should NOT contain `.`)
+* `condition`: Name of the sorting bin (ex. `top`, `bot`), or label of timepoint (ex. `D5`, `D18`)  
+
+For FACS sorting screens:
+* `upper_quantile`: FACS sorting upper quantile
+* `lower_quantile`: FACS sorting lower quantile  
 
-Optional columns are not required but can be provided for compatibility with `bean-qc` and `bean-run`. See [example](tests/data/sample_list.csv).
+For proliferation / survival screens:
+* `time`: FACS sorting upper quantile
+
+
+Also see examples for [FACS sorting screen](tests/data/sample_list.csv).
 
 ### Output file format
 `count` or `count-samples` produces `.h5ad` and `.xlsx` file with guide and per-guide allele counts.  
@@ -111,35 +126,45 @@ Optional columns are not required but can be provided for compatibility with `be
 
 ### Parameters
 * `-b`, `--edited-base` (`str`, default: `None`): For base editors, the base that should be ignored when matching the gRNA sequence 
+* `-i`, `--input`: List of fastq and sample ids. See above for the format.
 * `-f`, `--sgRNA-filename` (`str`, default: `None`): sgRNA description file. (See above)
-* `--guide-start-seq`: Guide starts after this sequence in R1 (default: '')
-* `--guide-end-seq`: Guide ends after this sequence in R1 (default: '')
-* `-r`, `--count-reporter` (default: `False`): Count edited alleles in reporters.
-* `-q`, `--min-average-read-quality` (default: `30`): Minimum average quality score (phred33) to keep a read
-* `-s`, `--min-single-bp-quality` (default: `0`): Minimum single bp score (phred33) to keep a read (default: 0)
+
+Output formats
 * `-n`, `--name`: Name of the output file will be `bean_count_{name}.h5ad`.
 * `-o`, `--output-folder`: Output folder
+* `--tiling`: Specify that the guide library is tiling library 
+* `-r`, `--count-reporter` (default: `False`): Count edited alleles in reporters.  
+* `--string-allele` (default: `False`): Store allele as quality filtered string instead of Allele object
+* `-g`, `--count-guide-edits` (default: `False`): count the self editing of guides (default: False)
+* `-m`, `--count-guide-reporter-alleles`: count the matched allele of guide and reporter edit
+* `--match-target-pos` (default: `False`): Count the edit in the exact target position.
+* `--target-pos-col` (default: `target_pos`): Column name specifying the relative target position within *reporter* sequence. 
+* `--offset` (default: `False`): Guide file has `offest` column that will be added to the relative position of reporters. 
+* `--align-fasta` (default: ''): gRNA is aligned to this sequence to infer the offset. Can be used when the exact offset is not provided.
+
+
+Read structure
+* `--guide-start-seq`: Guide starts after this sequence in R1 (default: '')
+* `--guide-end-seq`: Guide ends after this sequence in R1 (default: '')  
+* `--guide-start-seqs-file` (default: `None`): CSV file path with per-sample `guide_start_seq` to be used, if provided. Formatted as `sample_id, guide_start_seq` 
+* `--guide-end-seqs-file` (default: `None`): CSV file path with per-sample `guide_end_seq` to be used, if provided. Formatted as `sample_id,guide_end_seq` 
 * `-l`, `--reporter-length` (default: `32`): Length of the reporter sequence.
+* `--gstart-reporter` (default: `6`): Start position of the guide sequence in the reporter
+* `--guide-bc` (default: `True`): Construct has guide barcode 
+* `--guide-bc-len` (default: `4`): Guide barcode sequence length at the beginning of the R2
+
+Mapping quality filters
+* `-q`, `--min-average-read-quality` (default: `30`): Minimum average quality score 
+(phred33) to keep a read
 * `--keep-intermediate` (default: `False`): Keep all intermediate files generated from filtering.
+* `-s`, `--min-single-bp-quality` (default: `0`): Minimum single bp score (phred33) to keep a read (default: 0)
 * `--qstart-R1` (default: `0`): Start position of the read when filtering for quality score of the read 1 
 * `--qend-R1` (default: `47`): End position of the read when filtering for quality score of the read 1 
 * `--qstart-R2` (default: `0`): Start position of the read when filtering for quality score of the read 2 
 * `--qend-R2` (default: `36`): End position of the read when filtering for quality score of the read 2 
-* `--gstart-reporter` (default: `6`): Start position of the guide sequence in the reporter
-* `--match-target-pos` (default: `False`): Count the edit in the exact target position.
-* `--target-pos-col` (default: `target_pos`): Column name specifying the relative target position within *reporter* sequence. 
-* `--guide-bc` (default: `True`): Construct has guide barcode 
-* `--guide-bc-len` (default: `4`): Guide barcode sequence length at the beginning of the R2
-* `--offset` (default: `False`): Guide file has `offest` column that will be added to the relative position of reporters. 
-* `--align-fasta` (default: ''): gRNA is aligned to this sequence to infer the offset. Can be used when the exact offset is not provided.
-* `--string-allele` (default: `False`): Store allele as quality filtered string instead of Allele object
-* `-g`, `--count-guide-edits` (default: `False`): count the self editing of guides (default: False)
-* `-m`, `--count-guide-reporter-alleles`: count the matched allele of guide and reporter edit
-* `--tiling`: Specify that the guide library is tiling library 
-* `-i`, `--input`: List of fastq and sample ids. See above for the format.
+
+Run options
 * `-t`, `--threads` (default: `10`): Number of threads to use
-* `--guide-start-seqs-file` (default: `None`): CSV file path with per-sample `guide_start_seq` to be used, if provided. Formatted as `sample_id, guide_start_seq` 
-* `--guide-end-seqs-file` (default: `None`): CSV file path with per-sample `guide_end_seq` to be used, if provided. Formatted as `sample_id,guide_end_seq` 
 * `--rerun` (default: `False`): Recount each sample. If `False`, existing count for each sample is taken.
 
 
@@ -206,20 +231,33 @@ Above command produces
 #### Additional Parameters
 * `--tiling` (default: `None`): If set as `True` or `False`, it sets the screen object to be tiling (`True`) or variant (`False`)-targeting screen when calculating editing rate. 
 * `--replicate-label` (default: `"rep"`): Label of column in `bdata.samples` that describes replicate ID.
-* `--condition-label` (default: `"bin"`)": Label of column in `bdata.samples` that describes experimental condition. (sorting bin, time, etc.).
+* `--condition-label` (default: `"condition"`)": Label of column in `bdata.samples` that describes experimental condition. (sorting bin, time, etc.).
 * `--sample-covariates` (default: `None`): Comma-separated list of column names in `bdata.samples` that describes non-selective experimental condition (drug treatment, etc.). The values in the `bdata.samples` should NOT contain `.`. 
 * `--no-editing` (default: `False`): Ignore QC about editing. Can be used for QC of other editing modalities.
-* `--target-pos-col` (default: `"target_pos"`): Target position column in `bdata.guides` specifying target edit position in reporter.
-* `--rel-pos-is-reporter` (default: `False`): Specifies whether `edit_start_pos` and `edit_end_pos` are relative to reporter position. If `False`, those are relative to spacer position.
-* `--edit-start-pos` (default: `2`): Edit start position to quantify editing rate on, 0-based inclusive.
-* `--edit-end-pos` (default: `7`): Edit end position to quantify editing rate on, 0-based exclusive.
-* `--count-correlation-thres` (default: `0.8`): Threshold of guide count correlation to mask out.
-* `--edit-rate-thres` (default: `0.1`): Median editing rate threshold per sample to mask out.
+
+Editing rate quantification
+* `--ctrl-cond` (default: `"bulk"`): Value in of column in `ReporterScreen.samples[condition_label]` where guide-level editing rate to be calculated
+
+  Editing rate is calculated with following parameters in variant screens: 
+  * `--target-pos-col` (default: `"target_pos"`): Target position column in `bdata.guides` specifying target edit position in reporter.
+
+  For tiling screens:
+  * `--rel-pos-is-reporter` (default: `False`): Specifies whether `edit_start_pos` and `edit_end_pos` are relative to reporter position. If `False`, those are relative to spacer position.
+  * `--edit-start-pos` (default: `2`): Edit start position to quantify editing rate on, 0-based inclusive.
+  * `--edit-end-pos` (default: `7`): Edit end position to quantify editing rate on, 0-based exclusive.
+
+LFC of positive controls
 * `--posctrl-col` (default: `group`): Column name in .h5ad.guides DataFrame that specifies guide category.
 * `--posctrl-val` (default: `PosCtrl`): Value in .h5ad.guides[`posctrl_col`] that specifies guide will be used as the positive control in calculating log fold change.
-* `--lfc-thres` (default: `0.1`): Positive guides' correlation threshold to filter out.
 * `--lfc-conds` (default: `"top,bot"`): Values in of column in `ReporterScreen.samples[condition_label]` for LFC will be calculated between, delimited by comma
-* `--ctrl-cond` (default: `"bulk"`): Value in of column in `ReporterScreen.samples[condition_label]` where guide-level editing rate to be calculated
+
+
+Sample filtering thresholds
+* `--count-correlation-thres` (default: `0.7`): Threshold of guide count correlation to mask out.
+* `--edit-rate-thres` (default: `0.1`): Mean editing rate threshold per sample to mask out.
+* `--lfc-thres` (default: `0.1`): Positive guides' correlation threshold to filter out.
+
+Other
 * `--recalculate-edits` (default: `False`): Even when `ReporterScreen.layers['edit_count']` exists, recalculate the edit counts from `ReporterScreen.uns['allele_count']`."
 
 <br/><br/>
@@ -237,7 +275,7 @@ Above command produces
 * `my_sorting_screen_filtered.h5ad` with filtered alleles stored in `.uns`,   
 * `my_sorting_screen_filtered.filtered_allele_stats.pdf`, and `my_sorting_screen_filtered.filter_log.txt` that report allele count stats in each filtering step.  
 
-You may want to adjust the flitering parameters to obtain optimal balance between # guides per variant & # variants that are scored. See example outputs of filtering step [here](docs/example_filtering_outputs/).
+You may want to adjust the flitering parameters to obtain optimal balance between # guides per variant & # variants that are scored. See example outputs of filtering step [here](docs/example_filtering_output/).
 
 
 ### Translating alleles
@@ -367,7 +405,7 @@ Above command produces
   * `--target-column` (default: `target`): Column key in `bdata.guides` that describes the target element of each guide.
   * `--guide-activity-col`: Column in `bdata.guides` DataFrame showing the editing rate estimated via external tools.
 * Sample annotations (`bdata.samples` column keys)
-  * `--condition-column` (default: `bin`): Column key in `bdata.samples` that describes experimental condition.
+  * `--condition-column` (default: `condition`): Column key in `bdata.samples` that describes experimental condition.
   * `-uq`, `--sorting-bin-upper-quantile-column` (default: `upper_quantile`): Column name with upper quantile values of each sorting bin in bdata.samples
   * `-lq`, `--sorting-bin-lower-quantile-column` (default: `lower_quantile`): Column name with lower quantile values of each sorting bin in bdata.samples
 

diff --git a/bean/annotate/translate_allele.py b/bean/annotate/translate_allele.py
@@ -595,7 +595,8 @@ def strsplit_edit(edit_str):
 
 def annotate_edit(
     edit_info: pd.DataFrame,
-    edit_col="edit",
+    edit_col: str = "edit",
+    control_tag: str = "CONTROL",
     splice_sites: Collection[
         int
     ] = None,  # TODO: may be needed to extended into multi-chromosome case
@@ -604,6 +605,7 @@ def annotate_edit(
 
     Args
     edit_info: pd.DataFrame with at least 1 column of 'edit_col', which has 'Edit' format.
+    control_tag: String tag identifying non-targeting control guides so their variant signal are not aggregated.
     splice_sites: Collection of integer splice site positions. If the edit position matches the positions, it will be annotated as 'splicing'.
 
     """
@@ -619,8 +621,13 @@ def annotate_edit(
     edit_info.loc[
         edit_info.pos.map(lambda s: not s.startswith("A")), "coding"
     ] = "noncoding"
-    edit_info.loc[edit_info.pos.map(lambda s: "CONTROL" in s), "group"] = "negctrl"
-    edit_info.loc[edit_info.pos.map(lambda s: "CONTROL" in s), "coding"] = "negctrl"
+    if control_tag is not None:
+        edit_info.loc[
+            edit_info.pos.map(lambda s: control_tag in s), "group"
+        ] = "negctrl"
+        edit_info.loc[
+            edit_info.pos.map(lambda s: control_tag in s), "coding"
+        ] = "negctrl"
     edit_info.loc[
         (edit_info.coding == "noncoding") & (edit_info.group != "negctrl"), "int_pos"
     ] = edit_info.loc[

diff --git a/bean/model/model.py b/bean/model/model.py
@@ -206,18 +206,23 @@ def ControlNormalModel(data, mask_thres=10, use_bcmatch=True):
                     data.sample_mask,
                     data.a0_bcmatch,
                 )
-                with poutine.mask(
-                    mask=torch.logical_and(
-                        data.X_bcmatch_masked.permute(0, 2, 1).sum(axis=-1)
-                        > mask_thres,
-                        data.repguide_mask,
-                    )
-                ):
-                    pyro.sample(
-                        "guide_bcmatch_counts",
-                        dist.DirichletMultinomial(a_bcmatch, validate_args=False),
-                        obs=data.X_bcmatch_masked.permute(0, 2, 1),
-                    )
+                try:
+                    with poutine.mask(
+                        mask=torch.logical_and(
+                            data.X_bcmatch_masked.permute(0, 2, 1).sum(axis=-1)
+                            > mask_thres,
+                            data.repguide_mask,
+                        )
+                    ):
+                        pyro.sample(
+                            "guide_bcmatch_counts",
+                            dist.DirichletMultinomial(a_bcmatch, validate_args=False),
+                            obs=data.X_bcmatch_masked.permute(0, 2, 1),
+                        )
+                except RuntimeError:
+                    print(data.X_bcmatch_masked.shape)
+                    print(data.repguide_mask.shape)
+                    exit(1)
 
     return alleles_p_bin
 

diff --git a/bean/model/run.py b/bean/model/run.py
@@ -102,7 +102,7 @@ def parse_args():
     )
     parser.add_argument(
         "--condition-col",
-        default="bin",
+        default="condition",
         type=str,
         help="Column key in `bdata.samples` that describes experimental condition.",
     )
@@ -281,8 +281,21 @@ def check_args(args, bdata):
         pass
     elif args.library_design == "tiling":
         if args.allele_df_key is None:
+            key_to_use = "allele_counts"
+            n_alleles = len(bdata.uns[key_to_use])
+            for key, df in bdata.uns.items():
+                if "allele_counts" not in key or not isinstance(df, pd.DataFrame):
+                    continue
+                if len(df) < n_alleles:
+                    key_to_use = key
+                    n_alleles = len(df)
+            warn(
+                f"--allele-df-key not provided for tiling screen. Using the most filtered allele counts with {n_alleles} alleles stored in {key_to_use}."
+            )
+            args.allele_df_key = key_to_use
+        elif args.allele_df_key not in bdata.uns:
             raise ValueError(
-                "--allele-df-key not provided for tiling screen. Feed in the key then allele counts in screen.uns[allele_df_key] will be used."
+                f"--allele-df-key {args.allele_df_key} not in ReporterScreen.uns. Check your input."
             )
     else:
         raise ValueError(
@@ -293,7 +306,7 @@ def check_args(args, bdata):
             bdata.guides[args.negctrl_col].map(lambda s: s.lower())
             == args.negctrl_col_value.lower()
         ).sum()
-        if not n_negctrl >= 20:
+        if not n_negctrl >= 10:
             raise ValueError(
                 f"Not enough negative control guide in the input data: {n_negctrl}. Check your input arguments."
             )