-
Notifications
You must be signed in to change notification settings - Fork 23
Creating a native barcoding training set
- Barcoded read sets made with the EXP-NBD103 native barcoding expansion.
- Non-barcoded read sets made with the SQK-LSK108 ligation sequencing kit 1D. While not required, including these read sets can help Deepbinner to properly categorise barcode-less reads as such.
- Decent reference genomes for each of your samples. They don't have to be perfect – Nanopore-only assemblies should be fine.
- Building the C++ DTW component of Deepbinner.
For each of your read sets, you must:
Then this step applies to all of your read sets together:
The first step is to basecall the reads with Albacore. While Deepbinner will be trained on the raw (pre-basecall) data, basecalling the reads is necessary for determining the correct barcode label and which reads to use in our training set.
These (and subsequent) commands assume your reads are in a directory named fast5_dir
.
For barcoded read sets, we use the --barcoded
option so Albacore assigns barcode labels:
read_fast5_basecaller.py -f FLO-MIN106 -k SQK-LSK108 -i fast5_dir -t 16 -s basecalling -o fastq --disable_filtering --barcoding --recursive
For non-barcoded read sets, we use the we don't need that option:
read_fast5_basecaller.py -f FLO-MIN106 -k SQK-LSK108 -i /path/to/fast5_dir -t 16 -s basecalling -o fastq --disable_filtering --recursive
After basecalling is finished, combined all of the reads into a single file:
cat basecalling/workspace/*.fastq | gzip > reads.fastq.gz
The deepbinner prep
command takes care of producing training samples from the raw data. It carries out the following logic:
(flowchart image)
Your reference sequences are assumed to be in a single file named ref.fasta
. For barcoded samples, this means multiple genome references in the one file.
For native barcodes, we generate two separate training sets, one for the starts of reads:
deepbinner prep --fastq reads.fastq.gz --fast5_dir fast5_dir --kit EXP-NBD103_start --ref ref.fasta --sequencing_summary basecalling/sequencing_summary.txt 1> read_start_training_samples 2> read_start_training.out
and one for the ends of reads:
deepbinner prep --fastq reads.fastq.gz --fast5_dir fast5_dir --kit EXP-NBD103_end --ref ref.fasta --sequencing_summary basecalling/sequencing_summary.txt 1> read_end_training_samples 2> read_end_training.out
Note that the --sequencing_summary
option is only used to inform Deepbinner about Albacore's barcode choice, so it may be omitted when working with non-barcoded read sets.
Having run the above steps for each of your barcoded sequencing runs, you can now use the deepbinner balance
command to:
- balance the number of training samples, so each barcode has the same amount.
- add additional no-barcode samples made from various types of random noise.
Assuming your unbalanced samples are in filenames that end with _read_start_training_samples
and _read_end_training_samples
, you can run these commands to balance them:
deepbinner balance *_read_start_training_samples > training_data_read_start
deepbinner balance *_read_end_training_samples > training_data_read_end
Note that even for a barcoded sequencing run, deepbinner prep
will have produced no-barcode training samples. These consist of real read signal, taken from parts of the read other than the where the barcode is.
If you also have also prepared data from non-barcoded sequencing runs, you can now add those to your training sets. In addition to the two type of no-barcode training samples already present in your balanced training data (random noise and real signal from barcoded reads that excludes the barcode), this data can be important because it will include the adapter region of a read without a barcode. It can help Deepbinner to learn what a read looks like that never received a barcode.
How many samples you add is up to you. For example, to add 10000 training samples you might use a command like this:
shuf non_barcoded_read_start_training_samples | head -n 10000 >> training_data_read_start
That's it! You should now have two files (training_data_read_start
and training_data_read_end
) that are ready to give to deepbinner train
.