Change the default behavior of bean-qc to not remove the replicates e…

…ven when it has bad replicates
pinellolab · Mar 25, 2024 · c858819 · c858819
1 parent 4316376
commit c858819
Show file tree

Hide file tree

Showing 8 changed files with 164 additions and 101 deletions.
diff --git a/README.md b/README.md
@@ -206,9 +206,18 @@ Above command produces `prefix_editing_preference.[html,ipynb]` as editing prefe
 ## `bean-qc`: QC of reporter screen data
 ```bash
 bean-qc \
-  my_sorting_screen.h5ad    `# Input ReporterScreen .h5ad file path` \
+  my_sorting_screen.h5ad             `# Input ReporterScreen .h5ad file path` \
   -o my_sorting_screen_masked.h5ad   `# Output ReporterScreen .h5ad file path` \
-  -r qc_report_my_sorting_screen   `# Prefix for QC report` 
+  -r qc_report_my_sorting_screen     `# Prefix for QC report` \
+
+# Inspect the output qc_report_my_sorting_screen.html to tweak QC threshold
+
+bean-qc \
+  my_sorting_screen.h5ad              \
+  -o my_sorting_screen_masked.h5ad    \
+  -r qc_report_my_sorting_screen      \
+  #[--count-correlation-thres 0.7 ...]\
+  -b
 ```
 
 `bean-qc` supports following quality control and masks samples with low quality. Specifically:  
@@ -229,36 +238,68 @@ Above command produces
 
 
 #### Additional Parameters
-* `--tiling` (default: `None`): If set as `True` or `False`, it sets the screen object to be tiling (`True`) or variant (`False`)-targeting screen when calculating editing rate. 
-* `--replicate-label` (default: `"rep"`): Label of column in `bdata.samples` that describes replicate ID.
-* `--condition-label` (default: `"condition"`)": Label of column in `bdata.samples` that describes experimental condition. (sorting bin, time, etc.).
-* `--sample-covariates` (default: `None`): Comma-separated list of column names in `bdata.samples` that describes non-selective experimental condition (drug treatment, etc.). The values in the `bdata.samples` should NOT contain `.`. 
-* `--no-editing` (default: `False`): Ignore QC about editing. Can be used for QC of other editing modalities.
-
-Editing rate quantification
-* `--ctrl-cond` (default: `"bulk"`): Value in of column in `ReporterScreen.samples[condition_label]` where guide-level editing rate to be calculated
-
-  Editing rate is calculated with following parameters in variant screens: 
-  * `--target-pos-col` (default: `"target_pos"`): Target position column in `bdata.guides` specifying target edit position in reporter.
-
-  For tiling screens:
-  * `--rel-pos-is-reporter` (default: `False`): Specifies whether `edit_start_pos` and `edit_end_pos` are relative to reporter position. If `False`, those are relative to spacer position.
-  * `--edit-start-pos` (default: `2`): Edit start position to quantify editing rate on, 0-based inclusive.
-  * `--edit-end-pos` (default: `7`): Edit end position to quantify editing rate on, 0-based exclusive.
-
-LFC of positive controls
-* `--posctrl-col` (default: `group`): Column name in .h5ad.guides DataFrame that specifies guide category.
-* `--posctrl-val` (default: `PosCtrl`): Value in .h5ad.guides[`posctrl_col`] that specifies guide will be used as the positive control in calculating log fold change.
-* `--lfc-conds` (default: `"top,bot"`): Values in of column in `ReporterScreen.samples[condition_label]` for LFC will be calculated between, delimited by comma
-
-
-Sample filtering thresholds
-* `--count-correlation-thres` (default: `0.7`): Threshold of guide count correlation to mask out.
-* `--edit-rate-thres` (default: `0.1`): Mean editing rate threshold per sample to mask out.
-* `--lfc-thres` (default: `0.1`): Positive guides' correlation threshold to filter out.
-
-Other
-* `--recalculate-edits` (default: `False`): Even when `ReporterScreen.layers['edit_count']` exists, recalculate the edit counts from `ReporterScreen.uns['allele_count']`."
+##### Optional arguments:
+* `-o OUT_SCREEN_PATH`, `--out-screen-path OUT_SCREEN_PATH`
+                        Path where quality-filtered ReporterScreen object to be written to
+* `-r OUT_REPORT_PREFIX`, `--out-report-prefix OUT_REPORT_PREFIX`
+                        Output prefix of qc report (prefix.html, prefix.ipynb)
+
+##### QC thresholds:
+* `--count-correlation-thres COUNT_CORRELATION_THRES`
+                        Correlation threshold to mask out.
+* `--edit-rate-thres EDIT_RATE_THRES`
+                        Mean editing rate threshold per sample to mask out.
+* `--lfc-thres LFC_THRES`
+                        Positive guides' correlation threshold to filter out.
+
+##### Run options:
+* `-b`, `--remove-bad-replicates`
+                        Remove replicates with at least two of its samples meet the QC threshold (bean-run does not support having only one sorting bin sample for a replicate).
+* `-i`, `--ignore-missing-samples`
+                        If the flag is not provided, if the ReporterScreen object does not contain all condiitons for
+                        each replicate, make fake empty samples. If the flag is provided, don't add dummy samples.
+* `--no-editing`          Ignore QC about editing. Can be used for QC of other editing modalities.
+* `--dont-recalculate-edits`
+                        When ReporterScreen.layers['edit_count'] exists, do not recalculate the edit counts from
+                        ReporterScreen.uns['allele_count'].
+
+##### Input `.h5ad` formatting:
+Note that these arguements will change the way the QC metrics are calculated for guides, samples, or replicates.
+* `--tiling TILING`       Specify that the guide library is tiling library without 'n guides per target' design
+* `--replicate-label REPLICATE_LABEL`
+                        Label of column in `bdata.samples` that describes replicate ID.
+* `--sample-covariates SAMPLE_COVARIATES`
+                        Comma-separated list of column names in `bdata.samples` that describes non-selective
+                        experimental condition. (drug treatment, etc.)
+* `--condition-label CONDITION_LABEL`
+                        Label of column in `bdata.samples` that describes experimental condition. (sorting bin, time,
+                        etc.)
+###### Editing rate calculation
+  * `--ctrl-cond CTRL_COND`
+                        Values in of column in `ReporterScreen.samples[condition_label]` for guide-level editing rate
+                        to be calculated
+  * `--rel-pos-is-reporter`
+                        Specifies whether `edit_start_pos` and `edit_end_pos` are relative to reporter position. If
+                        `False`, those are relative to spacer position.
+  Editing rate is calculated with following parameters in 
+    * Variant screens: 
+      * `--target-pos-col TARGET_POS_COL`
+                        Target position column in `bdata.guides` specifying target edit position in reporter
+    * tiling screens:
+      * `--edit-start-pos EDIT_START_POS`
+                            Edit start position to quantify editing rate on, 0-based inclusive.
+      * `--edit-end-pos EDIT_END_POS`
+                            Edit end position to quantify editing rate on, 0-based exclusive.
+###### LFC of positive controls
+  * `--posctrl-col POSCTRL_COL`
+                          Column name in ReporterScreen.guides DataFrame that specifies guide category. To use all
+                          gRNAs, feed empty string ''.
+  * `--posctrl-val POSCTRL_VAL`
+                          Value in ReporterScreen.guides[`posctrl_col`] that specifies guide will be used as the
+                          positive control in calculating log fold change.
+  * `--lfc-conds LFC_CONDS`
+                          Values in of column in `ReporterScreen.samples[condition_label]` for LFC will be calculated
+                          between, delimited by comma
 
 <br/><br/>
 

diff --git a/bean/framework/Edit.py b/bean/framework/Edit.py
@@ -175,7 +175,11 @@ def get_range(self):
             min(edit.pos for edit in self.edits),
             max(edit.pos for edit in self.edits),
         )
-
+
+    def set_uid(self, uid):
+        self.edits = {edit.set_uid(uid) for edit in self.edits}
+        return self
+
     def get_uid(self):
         uid = None
         if (

diff --git a/bean/mapping/CRISPResso2Align.cpython-38-x86_64-linux-gnu.so b/bean/mapping/CRISPResso2Align.cpython-38-x86_64-linux-gnu.so
diff --git a/bean/model/run.py b/bean/model/run.py
@@ -222,7 +222,7 @@ def parse_args():
     parser.add_argument(
         "--allele-df-key",
         type=str,
-        default=None,
+        default="allele_counts",
         help="screen.uns[allele_df_key] will be used as the allele count.",
     )
     parser.add_argument(

diff --git a/bean/qc/utils.py b/bean/qc/utils.py
@@ -23,122 +23,138 @@ def parse_args():
     parser.add_argument(
         "bdata_path", help="Path to the ReporterScreen object to run QC on", type=str
     )
+    thres_parser = parser.add_argument_group("QC thresholds")
+    run_parser =  parser.add_argument_group("Run options")
+    input_parser =  parser.add_argument_group("Input .h5ad formatting")
+
+    thres_parser.add_argument(
+        "--count-correlation-thres",
+        help="Correlation threshold to mask out.",
+        type=float,
+        default=0.7,
+    )
+    thres_parser.add_argument(
+        "--edit-rate-thres",
+        help="Mean editing rate threshold per sample to mask out.",
+        type=float,
+        default=0.1,
+    )
+    thres_parser.add_argument(
+        "--lfc-thres",
+        help="Positive guides' correlation threshold to filter out.",
+        type=float,
+        default=-0.1,
+    )
+
     parser.add_argument(
         "-o",
         "--out-screen-path",
         help="Path where quality-filtered ReporterScreen object to be written to",
         type=str,
     )
     parser.add_argument(
+        "-r",
+        "--out-report-prefix",
+        help="Output prefix of qc report (prefix.html, prefix.ipynb)",
+        type=str,
+    )
+
+    run_parser.add_argument(
+        "-b", "--remove-bad-replicates",
+        help="Remove replicates with at least two of its samples meet the QC threshold.",
+        action="store_true",
+    )
+    run_parser.add_argument(
         "-i",
         "--ignore-missing-samples",
         help="If the flag is not provided, if the ReporterScreen object does not contain all condiitons for each replicate, make fake empty samples. If the flag is provided, don't add dummy samples.",
         action="store_true",
     )
-    parser.add_argument(
-        "-r",
-        "--out-report-prefix",
-        help="Output prefix of qc report (prefix.html, prefix.ipynb)",
-        type=str,
+    run_parser.add_argument(
+        "--no-editing",
+        help="Ignore QC about editing. Can be used for QC of other editing modalities.",
+        action="store_true",
     )
-    parser.add_argument(
+    run_parser.add_argument(
+        "--dont-recalculate-edits",
+        help="When ReporterScreen.layers['edit_count'] exists, do not recalculate the edit counts from ReporterScreen.uns['allele_count'].",
+        action="store_true",
+    )
+
+    input_parser.add_argument(
         "--tiling",
         dest="tiling",
         type=lambda x: bool(distutils.util.strtobool(x)),
         help="Specify that the guide library is tiling library without 'n guides per target' design",
     )
-    parser.add_argument(
+    input_parser.add_argument(
         "--replicate-label",
         help="Label of column in `bdata.samples` that describes replicate ID.",
         type=str,
         default="rep",
     )
-    parser.add_argument(
+    input_parser.add_argument(
         "--sample-covariates",
         help="Comma-separated list of column names in `bdata.samples` that describes non-selective experimental condition. (drug treatment, etc.)",
         type=str,
         default=None,
     )
-    parser.add_argument(
+    input_parser.add_argument(
         "--condition-label",
         help="Label of column in `bdata.samples` that describes experimental condition. (sorting bin, time, etc.)",
         type=str,
         default="condition",
     )
-    parser.add_argument(
-        "--no-editing",
-        help="Ignore QC about editing. Can be used for QC of other editing modalities.",
-        action="store_true",
-    )
-    parser.add_argument(
+    input_parser.add_argument(
         "--target-pos-col",
         help="Target position column in `bdata.guides` specifying target edit position in reporter",
         type=str,
         default="target_pos",
     )
-    parser.add_argument(
+    input_parser.add_argument(
         "--rel-pos-is-reporter",
         help="Specifies whether `edit_start_pos` and `edit_end_pos` are relative to reporter position. If `False`, those are relative to spacer position.",
         action="store_true",
         default=False,
     )
-    parser.add_argument(
+    input_parser.add_argument(
         "--edit-start-pos",
         help="Edit start position to quantify editing rate on, 0-based inclusive.",
         default=2,
     )
-    parser.add_argument(
+    input_parser.add_argument(
         "--edit-end-pos",
         help="Edit end position to quantify editing rate on, 0-based exclusive.",
         default=7,
     )
-    parser.add_argument(
-        "--count-correlation-thres",
-        help="Correlation threshold to mask out.",
-        type=float,
-        default=0.7,
-    )
-    parser.add_argument(
-        "--edit-rate-thres",
-        help="Mean editing rate threshold per sample to mask out.",
-        type=float,
-        default=0.1,
-    )
-    parser.add_argument(
+
+    input_parser.add_argument(
         "--posctrl-col",
         help="Column name in ReporterScreen.guides DataFrame that specifies guide category. To use all gRNAs, feed empty string ''.",
         type=str,
         default="target_group",
     )
-    parser.add_argument(
+    input_parser.add_argument(
         "--posctrl-val",
         help="Value in ReporterScreen.guides[`posctrl_col`] that specifies guide will be used as the positive control in calculating log fold change.",
         type=str,
         default="PosCtrl",
     )
-    parser.add_argument(
-        "--lfc-thres",
-        help="Positive guides' correlation threshold to filter out.",
-        type=float,
-        default=-0.1,
-    )
-    parser.add_argument(
+
+    input_parser.add_argument(
         "--lfc-conds",
         help="Values in of column in `ReporterScreen.samples[condition_label]` for LFC will be calculated between, delimited by comma",
         type=str,
         default="top,bot",
     )
-    parser.add_argument(
+    input_parser.add_argument(
         "--ctrl-cond",
         help="Values in of column in `ReporterScreen.samples[condition_label]` for guide-level editing rate to be calculated",
         type=str,
         default="bulk",
     )
-    parser.add_argument(
-        "--recalculate-edits",
-        help="Even when ReporterScreen.layers['edit_count'] exists, recalculate the edit counts from ReporterScreen.uns['allele_count'].",
-        action="store_true",
-    )
+
+
     args = parser.parse_args()
     if args.out_screen_path is None:
         args.out_screen_path = f"{args.bdata_path.rsplit('.h5ad', 1)[0]}.filtered.h5ad"

diff --git a/bin/bean-qc b/bin/bean-qc
@@ -33,9 +33,9 @@ def main():
             comp_cond2=args.lfc_cond2,
             ctrl_cond=args.ctrl_cond,
             exp_id=args.out_report_prefix,
-            recalculate_edits=args.recalculate_edits,
+            recalculate_edits=~args.dont_recalculate_edits,
             base_edit_data=args.base_edit_data,
-
+            remove_bad_replicates=args.remove_bad_replicates,
         ),
         kernel_name="bean_python3",
     )