Skip to content

Creating a native barcoding training set

Ryan Wick edited this page Aug 13, 2018 · 3 revisions

Prerequisites

  • Barcoded read sets made with the EXP-NBD103 native barcoding expansion.
  • Non-barcoded read sets made with the SQK-LSK108 ligation sequencing kit 1D. While not required, including these read sets can help Deepbinner to properly categorise barcode-less reads as such.
  • Decent reference genomes for each of your samples. They don't have to be perfect – Nanopore-only assemblies should be fine.
  • Building the C++ DTW component of Deepbinner.

Overview

For each of your read sets, you must:

Then this step applies to all of your read sets together:

Basecall with Albacore

The first step is to basecall the reads with Albacore. While Deepbinner will be trained on the raw (pre-basecall) data, basecalling the reads is necessary for determining the correct barcode label and which reads to use in our training set.

Example commands

These (and subsequent) commands assume your reads are in a directory named fast5_dir.

For barcoded read sets, we use the --barcoded option so Albacore assigns barcode labels:

read_fast5_basecaller.py -f FLO-MIN106 -k SQK-LSK108 -i fast5_dir -t 16 -s basecalling -o fastq --disable_filtering --barcoding --recursive

For non-barcoded read sets, we use the we don't need that option:

read_fast5_basecaller.py -f FLO-MIN106 -k SQK-LSK108 -i /path/to/fast5_dir -t 16 -s basecalling -o fastq --disable_filtering --recursive

After basecalling is finished, combined all of the reads into a single file:

cat basecalling/workspace/*.fastq | gzip > reads.fastq.gz

Create training samples

The deepbinner prep command takes care of producing training samples from the raw data. It carries out the following logic: (flowchart image)

Your reference sequences are assumed to be in a single file named ref.fasta. For barcoded samples, this means multiple genome references in the one file.

For native barcodes, we generate two separate training sets, one for the starts of reads:

deepbinner prep --fastq reads.fastq.gz --fast5_dir fast5_dir --kit EXP-NBD103_start --ref ref.fasta --sequencing_summary basecalling/sequencing_summary.txt 1> read_start_training_samples 2> read_start_training.out

and one for the ends of reads:

deepbinner prep --fastq reads.fastq.gz --fast5_dir fast5_dir --kit EXP-NBD103_end --ref ref.fasta --sequencing_summary basecalling/sequencing_summary.txt 1> read_end_training_samples 2> read_end_training.out

Note that the --sequencing_summary option is only used to inform Deepbinner about Albacore's barcode choice, so it may be omitted when working with non-barcoded read sets.

Balance and finalise the data

Balancing the barcoded training samples

Having run the above steps for each of your barcoded sequencing runs, you can now use the deepbinner balance command to:

  • balance the number of training samples, so each barcode has the same amount.
  • add additional no-barcode samples made from various types of random noise.

Assuming your unbalanced samples are in filenames that end with _read_start_training_samples and _read_end_training_samples, you can run these commands to balance them:

deepbinner balance *_read_start_training_samples > training_data_read_start
deepbinner balance *_read_end_training_samples > training_data_read_end

Note that even for a barcoded sequencing run, deepbinner prep will have produced no-barcode training samples. These consist of real read signal, taken from parts of the read other than the where the barcode is.

Adding additional non-barcoded training samples

If you also have also prepared data from non-barcoded sequencing runs, you can now add those to your training sets. In addition to the two type of no-barcode training samples already present in your balanced training data (random noise and real signal from barcoded reads that excludes the barcode), this data can be important because it will include the adapter region of a read without a barcode. It can help Deepbinner to learn what a read looks like that never received a barcode.

How many samples you add is up to you. For example, to add 10000 training samples you might use a command like this:

shuf non_barcoded_read_start_training_samples | head -n 10000 >> training_data_read_start

That's it! You should now have two files (training_data_read_start and training_data_read_end) that are ready to give to deepbinner train.