Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev #18

Merged
merged 9 commits into from
Mar 7, 2024
Merged

Dev #18

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 77 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,17 @@ This is an analysis toolkit for the pooled CRISPR reporter or sensor data. The r
### Screen data is saved as *ReporterScreen* object in the pipeline.
BEAN stores mapped gRNA and allele counts in `ReporterScreen` object which is compatible with [AnnData](https://anndata.readthedocs.io/en/latest/index.html). See [Data Structure](#data-structure) section for more information.

### Examples
We provide example scripts in `tests/`. Running `pytest --sparse-ordering` generates example input/output files from running 1 and 2-4 sequentially.
<br/><br/>

## Examples
| Tutorial link | [Library design](#pipeline-run-options-by-library-design) | Selection | Reporter |
|---------------| -------------- | --------- | -------- |
| [LDL-C GWAS](docs/tutorials/ldl_var.md) | GWAS variant library | FACS sorting | Yes/No |
| [LDL-C LDLR CDS](docs/tutorials/ldl_cds.md) | Coding sequence tiling libarary | FACS sorting | Yes/No |
| TKO simulated (Coming soon!) | GWAS variant library | Survival / Proliferation | Yes/No |
| TKO simulated, tiling (Coming soon!) | Coding sequence tiling libarary | Survival / Proliferation | Yes/No |

We also provide example scripts in `tests/`. Running `pytest --sparse-ordering` will generate example input/output files.

### Pipeline run options by library design
The `bean-filter` and `bean-run` steps depend on the type of gRNA library design, where BEAN supports two modes of running.
Expand Down Expand Up @@ -83,8 +92,8 @@ File should contain following columns.
* Option 2: `accessibility_signal`: ATAC-seq signal value of the target loci of each guide.
* For variant library (gRNAs are designed to target specific variants and ignores bystander edits)
* `target`: This column denotes which target variant/element of each gRNA. This is not used in `bean-count[-samples]` but required to run `bean-run` in later steps.
* `target_group [Optional]`: If negative/positive control gRNA will be considered in `bean-qc` and/or `bean-run`, specify as "NegCtrl"/"PosCtrl" in this column.
* `target_pos [Optional]`: If `--match_target_pos` flag is used, input file needs `target_pos` which specifies 0-based relative position of targeted base within Reporter sequence.
* `target_group`: If negative/positive control gRNA will be considered in `bean-qc` and/or `bean-run`, specify as "NegCtrl"/"PosCtrl" in this column.
* `target_pos`: If `--match_target_pos` flag is used, input file needs `target_pos` which specifies 0-based relative position of targeted base within Reporter sequence.
* For tiling library (gRNAs tile coding / noncoding sequences)
* `strand`: Specifies gRNA strand information relative to the reference genome.
* `chrom`: Chromosome of gRNA targeted locus.
Expand All @@ -97,12 +106,18 @@ File should contain following columns with header.
* `R1_filepath`: Path to read 1 `.fastq[.gz]` file
* `R2_filepath`: Path to read 1 `.fastq[.gz]` file
* `sample_id`: ID of sequencing sample
* `rep [Optional]`: Replicate # of this sample (Should NOT contain `.`)
* `bin [Optional]`: Name of the sorting bin
* `upper_quantile [Optional]`: FACS sorting upper quantile
* `lower_quantile [Optional]`: FACS sorting lower quantile
* `rep`: Replicate # of this sample (Should NOT contain `.`)
* `condition`: Name of the sorting bin (ex. `top`, `bot`), or label of timepoint (ex. `D5`, `D18`)

For FACS sorting screens:
* `upper_quantile`: FACS sorting upper quantile
* `lower_quantile`: FACS sorting lower quantile

Optional columns are not required but can be provided for compatibility with `bean-qc` and `bean-run`. See [example](tests/data/sample_list.csv).
For proliferation / survival screens:
* `time`: FACS sorting upper quantile


Also see examples for [FACS sorting screen](tests/data/sample_list.csv).

### Output file format
`count` or `count-samples` produces `.h5ad` and `.xlsx` file with guide and per-guide allele counts.
Expand All @@ -111,35 +126,45 @@ Optional columns are not required but can be provided for compatibility with `be

### Parameters
* `-b`, `--edited-base` (`str`, default: `None`): For base editors, the base that should be ignored when matching the gRNA sequence
* `-i`, `--input`: List of fastq and sample ids. See above for the format.
* `-f`, `--sgRNA-filename` (`str`, default: `None`): sgRNA description file. (See above)
* `--guide-start-seq`: Guide starts after this sequence in R1 (default: '')
* `--guide-end-seq`: Guide ends after this sequence in R1 (default: '')
* `-r`, `--count-reporter` (default: `False`): Count edited alleles in reporters.
* `-q`, `--min-average-read-quality` (default: `30`): Minimum average quality score (phred33) to keep a read
* `-s`, `--min-single-bp-quality` (default: `0`): Minimum single bp score (phred33) to keep a read (default: 0)

Output formats
* `-n`, `--name`: Name of the output file will be `bean_count_{name}.h5ad`.
* `-o`, `--output-folder`: Output folder
* `--tiling`: Specify that the guide library is tiling library
* `-r`, `--count-reporter` (default: `False`): Count edited alleles in reporters.
* `--string-allele` (default: `False`): Store allele as quality filtered string instead of Allele object
* `-g`, `--count-guide-edits` (default: `False`): count the self editing of guides (default: False)
* `-m`, `--count-guide-reporter-alleles`: count the matched allele of guide and reporter edit
* `--match-target-pos` (default: `False`): Count the edit in the exact target position.
* `--target-pos-col` (default: `target_pos`): Column name specifying the relative target position within *reporter* sequence.
* `--offset` (default: `False`): Guide file has `offest` column that will be added to the relative position of reporters.
* `--align-fasta` (default: ''): gRNA is aligned to this sequence to infer the offset. Can be used when the exact offset is not provided.


Read structure
* `--guide-start-seq`: Guide starts after this sequence in R1 (default: '')
* `--guide-end-seq`: Guide ends after this sequence in R1 (default: '')
* `--guide-start-seqs-file` (default: `None`): CSV file path with per-sample `guide_start_seq` to be used, if provided. Formatted as `sample_id, guide_start_seq`
* `--guide-end-seqs-file` (default: `None`): CSV file path with per-sample `guide_end_seq` to be used, if provided. Formatted as `sample_id,guide_end_seq`
* `-l`, `--reporter-length` (default: `32`): Length of the reporter sequence.
* `--gstart-reporter` (default: `6`): Start position of the guide sequence in the reporter
* `--guide-bc` (default: `True`): Construct has guide barcode
* `--guide-bc-len` (default: `4`): Guide barcode sequence length at the beginning of the R2

Mapping quality filters
* `-q`, `--min-average-read-quality` (default: `30`): Minimum average quality score
(phred33) to keep a read
* `--keep-intermediate` (default: `False`): Keep all intermediate files generated from filtering.
* `-s`, `--min-single-bp-quality` (default: `0`): Minimum single bp score (phred33) to keep a read (default: 0)
* `--qstart-R1` (default: `0`): Start position of the read when filtering for quality score of the read 1
* `--qend-R1` (default: `47`): End position of the read when filtering for quality score of the read 1
* `--qstart-R2` (default: `0`): Start position of the read when filtering for quality score of the read 2
* `--qend-R2` (default: `36`): End position of the read when filtering for quality score of the read 2
* `--gstart-reporter` (default: `6`): Start position of the guide sequence in the reporter
* `--match-target-pos` (default: `False`): Count the edit in the exact target position.
* `--target-pos-col` (default: `target_pos`): Column name specifying the relative target position within *reporter* sequence.
* `--guide-bc` (default: `True`): Construct has guide barcode
* `--guide-bc-len` (default: `4`): Guide barcode sequence length at the beginning of the R2
* `--offset` (default: `False`): Guide file has `offest` column that will be added to the relative position of reporters.
* `--align-fasta` (default: ''): gRNA is aligned to this sequence to infer the offset. Can be used when the exact offset is not provided.
* `--string-allele` (default: `False`): Store allele as quality filtered string instead of Allele object
* `-g`, `--count-guide-edits` (default: `False`): count the self editing of guides (default: False)
* `-m`, `--count-guide-reporter-alleles`: count the matched allele of guide and reporter edit
* `--tiling`: Specify that the guide library is tiling library
* `-i`, `--input`: List of fastq and sample ids. See above for the format.

Run options
* `-t`, `--threads` (default: `10`): Number of threads to use
* `--guide-start-seqs-file` (default: `None`): CSV file path with per-sample `guide_start_seq` to be used, if provided. Formatted as `sample_id, guide_start_seq`
* `--guide-end-seqs-file` (default: `None`): CSV file path with per-sample `guide_end_seq` to be used, if provided. Formatted as `sample_id,guide_end_seq`
* `--rerun` (default: `False`): Recount each sample. If `False`, existing count for each sample is taken.


Expand Down Expand Up @@ -206,20 +231,33 @@ Above command produces
#### Additional Parameters
* `--tiling` (default: `None`): If set as `True` or `False`, it sets the screen object to be tiling (`True`) or variant (`False`)-targeting screen when calculating editing rate.
* `--replicate-label` (default: `"rep"`): Label of column in `bdata.samples` that describes replicate ID.
* `--condition-label` (default: `"bin"`)": Label of column in `bdata.samples` that describes experimental condition. (sorting bin, time, etc.).
* `--condition-label` (default: `"condition"`)": Label of column in `bdata.samples` that describes experimental condition. (sorting bin, time, etc.).
* `--sample-covariates` (default: `None`): Comma-separated list of column names in `bdata.samples` that describes non-selective experimental condition (drug treatment, etc.). The values in the `bdata.samples` should NOT contain `.`.
* `--no-editing` (default: `False`): Ignore QC about editing. Can be used for QC of other editing modalities.
* `--target-pos-col` (default: `"target_pos"`): Target position column in `bdata.guides` specifying target edit position in reporter.
* `--rel-pos-is-reporter` (default: `False`): Specifies whether `edit_start_pos` and `edit_end_pos` are relative to reporter position. If `False`, those are relative to spacer position.
* `--edit-start-pos` (default: `2`): Edit start position to quantify editing rate on, 0-based inclusive.
* `--edit-end-pos` (default: `7`): Edit end position to quantify editing rate on, 0-based exclusive.
* `--count-correlation-thres` (default: `0.8`): Threshold of guide count correlation to mask out.
* `--edit-rate-thres` (default: `0.1`): Median editing rate threshold per sample to mask out.

Editing rate quantification
* `--ctrl-cond` (default: `"bulk"`): Value in of column in `ReporterScreen.samples[condition_label]` where guide-level editing rate to be calculated

Editing rate is calculated with following parameters in variant screens:
* `--target-pos-col` (default: `"target_pos"`): Target position column in `bdata.guides` specifying target edit position in reporter.

For tiling screens:
* `--rel-pos-is-reporter` (default: `False`): Specifies whether `edit_start_pos` and `edit_end_pos` are relative to reporter position. If `False`, those are relative to spacer position.
* `--edit-start-pos` (default: `2`): Edit start position to quantify editing rate on, 0-based inclusive.
* `--edit-end-pos` (default: `7`): Edit end position to quantify editing rate on, 0-based exclusive.

LFC of positive controls
* `--posctrl-col` (default: `group`): Column name in .h5ad.guides DataFrame that specifies guide category.
* `--posctrl-val` (default: `PosCtrl`): Value in .h5ad.guides[`posctrl_col`] that specifies guide will be used as the positive control in calculating log fold change.
* `--lfc-thres` (default: `0.1`): Positive guides' correlation threshold to filter out.
* `--lfc-conds` (default: `"top,bot"`): Values in of column in `ReporterScreen.samples[condition_label]` for LFC will be calculated between, delimited by comma
* `--ctrl-cond` (default: `"bulk"`): Value in of column in `ReporterScreen.samples[condition_label]` where guide-level editing rate to be calculated


Sample filtering thresholds
* `--count-correlation-thres` (default: `0.7`): Threshold of guide count correlation to mask out.
* `--edit-rate-thres` (default: `0.1`): Mean editing rate threshold per sample to mask out.
* `--lfc-thres` (default: `0.1`): Positive guides' correlation threshold to filter out.

Other
* `--recalculate-edits` (default: `False`): Even when `ReporterScreen.layers['edit_count']` exists, recalculate the edit counts from `ReporterScreen.uns['allele_count']`."

<br/><br/>
Expand All @@ -237,7 +275,7 @@ Above command produces
* `my_sorting_screen_filtered.h5ad` with filtered alleles stored in `.uns`,
* `my_sorting_screen_filtered.filtered_allele_stats.pdf`, and `my_sorting_screen_filtered.filter_log.txt` that report allele count stats in each filtering step.

You may want to adjust the flitering parameters to obtain optimal balance between # guides per variant & # variants that are scored. See example outputs of filtering step [here](docs/example_filtering_outputs/).
You may want to adjust the flitering parameters to obtain optimal balance between # guides per variant & # variants that are scored. See example outputs of filtering step [here](docs/example_filtering_output/).


### Translating alleles
Expand Down Expand Up @@ -367,7 +405,7 @@ Above command produces
* `--target-column` (default: `target`): Column key in `bdata.guides` that describes the target element of each guide.
* `--guide-activity-col`: Column in `bdata.guides` DataFrame showing the editing rate estimated via external tools.
* Sample annotations (`bdata.samples` column keys)
* `--condition-column` (default: `bin`): Column key in `bdata.samples` that describes experimental condition.
* `--condition-column` (default: `condition`): Column key in `bdata.samples` that describes experimental condition.
* `-uq`, `--sorting-bin-upper-quantile-column` (default: `upper_quantile`): Column name with upper quantile values of each sorting bin in bdata.samples
* `-lq`, `--sorting-bin-lower-quantile-column` (default: `lower_quantile`): Column name with lower quantile values of each sorting bin in bdata.samples

Expand Down
13 changes: 10 additions & 3 deletions bean/annotate/translate_allele.py
Original file line number Diff line number Diff line change
Expand Up @@ -595,7 +595,8 @@ def strsplit_edit(edit_str):

def annotate_edit(
edit_info: pd.DataFrame,
edit_col="edit",
edit_col: str = "edit",
control_tag: str = "CONTROL",
splice_sites: Collection[
int
] = None, # TODO: may be needed to extended into multi-chromosome case
Expand All @@ -604,6 +605,7 @@ def annotate_edit(

Args
edit_info: pd.DataFrame with at least 1 column of 'edit_col', which has 'Edit' format.
control_tag: String tag identifying non-targeting control guides so their variant signal are not aggregated.
splice_sites: Collection of integer splice site positions. If the edit position matches the positions, it will be annotated as 'splicing'.

"""
Expand All @@ -619,8 +621,13 @@ def annotate_edit(
edit_info.loc[
edit_info.pos.map(lambda s: not s.startswith("A")), "coding"
] = "noncoding"
edit_info.loc[edit_info.pos.map(lambda s: "CONTROL" in s), "group"] = "negctrl"
edit_info.loc[edit_info.pos.map(lambda s: "CONTROL" in s), "coding"] = "negctrl"
if control_tag is not None:
edit_info.loc[
edit_info.pos.map(lambda s: control_tag in s), "group"
] = "negctrl"
edit_info.loc[
edit_info.pos.map(lambda s: control_tag in s), "coding"
] = "negctrl"
edit_info.loc[
(edit_info.coding == "noncoding") & (edit_info.group != "negctrl"), "int_pos"
] = edit_info.loc[
Expand Down
29 changes: 17 additions & 12 deletions bean/model/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,18 +206,23 @@ def ControlNormalModel(data, mask_thres=10, use_bcmatch=True):
data.sample_mask,
data.a0_bcmatch,
)
with poutine.mask(
mask=torch.logical_and(
data.X_bcmatch_masked.permute(0, 2, 1).sum(axis=-1)
> mask_thres,
data.repguide_mask,
)
):
pyro.sample(
"guide_bcmatch_counts",
dist.DirichletMultinomial(a_bcmatch, validate_args=False),
obs=data.X_bcmatch_masked.permute(0, 2, 1),
)
try:
with poutine.mask(
mask=torch.logical_and(
data.X_bcmatch_masked.permute(0, 2, 1).sum(axis=-1)
> mask_thres,
data.repguide_mask,
)
):
pyro.sample(
"guide_bcmatch_counts",
dist.DirichletMultinomial(a_bcmatch, validate_args=False),
obs=data.X_bcmatch_masked.permute(0, 2, 1),
)
except RuntimeError:
print(data.X_bcmatch_masked.shape)
print(data.repguide_mask.shape)
exit(1)

return alleles_p_bin

Expand Down
19 changes: 16 additions & 3 deletions bean/model/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def parse_args():
)
parser.add_argument(
"--condition-col",
default="bin",
default="condition",
type=str,
help="Column key in `bdata.samples` that describes experimental condition.",
)
Expand Down Expand Up @@ -281,8 +281,21 @@ def check_args(args, bdata):
pass
elif args.library_design == "tiling":
if args.allele_df_key is None:
key_to_use = "allele_counts"
n_alleles = len(bdata.uns[key_to_use])
for key, df in bdata.uns.items():
if "allele_counts" not in key or not isinstance(df, pd.DataFrame):
continue
if len(df) < n_alleles:
key_to_use = key
n_alleles = len(df)
warn(
f"--allele-df-key not provided for tiling screen. Using the most filtered allele counts with {n_alleles} alleles stored in {key_to_use}."
)
args.allele_df_key = key_to_use
elif args.allele_df_key not in bdata.uns:
raise ValueError(
"--allele-df-key not provided for tiling screen. Feed in the key then allele counts in screen.uns[allele_df_key] will be used."
f"--allele-df-key {args.allele_df_key} not in ReporterScreen.uns. Check your input."
)
else:
raise ValueError(
Expand All @@ -293,7 +306,7 @@ def check_args(args, bdata):
bdata.guides[args.negctrl_col].map(lambda s: s.lower())
== args.negctrl_col_value.lower()
).sum()
if not n_negctrl >= 20:
if not n_negctrl >= 10:
raise ValueError(
f"Not enough negative control guide in the input data: {n_negctrl}. Check your input arguments."
)
Expand Down
Loading