This repository contains a pipeline for processing B-cell receptor (BCR) sequencing data with a focus on clonotype analysis and consensus sequence generation.
The pipeline processes BCR sequencing data by:
- Filtering multiplets and non-full-length chains
- Grouping sequences by V(D)J allele combinations
- Generating consensus sequences for clonotypes
- Mapping cell barcodes to clonotypes
The required dependencies are specified in the environment.yml
file. You can create a conda environment using:
conda env create -f environment.yml
Activate the environment:
conda activate TRIBAL_preprocess
To run the pipeline, use the following command:
python TRIBAL_preprocess.py --config <input_file> [--multiplets <bool>]
--config
: Path to the configuration YAML file (required)--multiplets
: Boolean flag to include multiplet analysis (optional, default=False)- When set to
True
, the pipeline will:- Process cells marked as multiplets
- Analyze cells with multiple heavy/light chain pairs
- Attempt to match multiplet chains based on transcript counts and existing clonotypes
- Add multiplet-derived sequences to the clonotype pool
- When set to
False
, multiplets are filtered out
- When set to
Example usage with multiplets:
python TRIBAL_preprocess.py --config config.yaml --multiplets True
The repository implements two different preprocessing approaches, each with its own configuration and command:
This approach, implemented in preprocess.py
, groups BCR sequences based on their V(D)J allele combinations:
- Grouping Strategy: Sequences are grouped by exact V, D, and J allele matches
- Advantages:
- More precise clonotype definition
- Better handling of somatic hypermutation
- Maintains allelic information
- Use Case: Preferred when studying allele-specific responses or when high precision in clonotype definition is required
Run with:
python preprocess.py --config config.yaml
This approach groups sequences based on V(D)J gene families without considering specific alleles:
- Grouping Strategy: Sequences are grouped by V, D, and J gene families
- Advantages:
- More lenient grouping
- Captures broader clonal relationships
- Less sensitive to allele calling errors
- Use Case: Suitable for general repertoire analysis or when studying broader clonal relationships
Run with:
python preprocess.py --config config.yaml
Note: Each approach requires its own configuration file (config_allele.yaml
or config_gene.yaml
) with appropriate settings for the grouping strategy.
The config.yaml
file contains several important parameters that control the pipeline's behavior. For a complete example, see the provided config.yaml
file.
base_path
: Base directory containing input filesall_genes_file
: CSV file containing gene informationmetadata_file
: CSV file with metadataclonotype_file
: Tab-separated file with clonotype informationannotation_file
: Tab-separated file with annotation databarcodes_file
: Tab-separated file with cell barcode information
multiplet_column
: Column name identifying multiplet statusbarcode_column
: Column name containing cell barcodesfull_length_column
: Column indicating full-length chain statuscdr3_columns
: List of columns containing CDR3 amino acid sequenceschain_columns
:heavy_full_length
: Column indicating full-length heavy chainkappa_full_length
: Column for full-length kappa chainlambda_full_length
: Column for full-length lambda chainheavy_isotype
: Column containing heavy chain isotype information
clonotype_column
: Column containing clonotype informationclonotype_id_column
: Column containing clonotype IDsbarcode_column
: Column containing cell barcodesclonotype_mapping_columns
:light_chain
: Column mapping light chain informationheavy_chain
: Column mapping heavy chain information
cell_barcode_column
: Column name for cell barcodeslocus_column
: Column containing chain locus informationsequence_column
: Column containing sequence datav_call_column
: Column for V gene callsd_call_column
: Column for D gene callsj_call_column
: Column for J gene callsgermline_alignment_column
: Column with germline alignment informationchain_types
:heavy
: Identifier for heavy chainkappa
: Identifier for kappa chainlambda
: Identifier for lambda chain
records
: List of column names for sequence-level data outputcellid
: Cell identifierclonotype
: Clonotype identifierheavy_chain_isotype
: Heavy chain isotypeheavy_chain_seq
: Heavy chain sequenceheavy_chain_v_allele
: Heavy chain V allelelight_chain_seq
: Light chain sequencelight_chain_v_allele
: Light chain V allele
clono_records
: List of column names for clonotype-level data outputclonotype
: Clonotype identifierheavy_chain_root
: Consensus heavy chain sequencelight_chain_root
: Consensus light chain sequence
seq_data_file_script1
: Output path for sequence data (CDR3-based)root_data_file_script1
: Output path for consensus sequences (CDR3-based)clonotype_map_file_script1
: Output path for clonotype mapping (CDR3-based)seq_data_file_script2
: Output path for sequence data (allele-based)root_data_file_script2
: Output path for consensus sequences (allele-based)clonotype_map_file_script2
: Output path for clonotype mapping (allele-based)
required_loci
: Minimum number of loci required per cellmin_consensus_sequences
: Minimum number of sequences required for consensus generation (default: 1)min_cells_per_clonotype
: Minimum number of cells required to form a clonotype (default: 2)
transcript_count_column
: Column name containing transcript count information
nucleotides
: List of valid nucleotides for consensus generation ['A', 'T', 'C', 'G', 'N']
level
: Logging level (e.g., "INFO", "DEBUG", "WARNING")format
: Format string for log messagesfailed_cells_file
: Output file for recording failed cell processing
The pipeline generates three main output files:
seq_data.csv
: Contains sequence-level information for each cellroot_data.csv
: Contains consensus sequences for each clonotypeclonotype_map.json
: Maps between internal clonotype IDs and their V(D)J definitions
The pipeline includes comprehensive logging that tracks:
- Processing progress
- Error messages
- Processing statistics
- Success/failure of file operations
The pipeline includes robust error handling for:
- Missing or malformed input files
- Data processing errors
- Invalid configurations
- File I/O operations
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
[Add your chosen license here]
- The pipeline assumes that the input file is a CSV file with the specified columns.
- The pipeline assumes that the input file is sorted by
cell_barcode
.