Skip to content

Refining a training set

Ryan Wick edited this page Aug 19, 2018 · 1 revision

Deepbinner's method of building a training set isn't perfect – locating barcodes in raw signal data is hard! Some incorrectly labelled samples can therefore sneak through.

This page describes how you can refine your training set by excluding samples that are possibly labelled incorrectly. Disclaimer: I'm not sure how much (or how little) this actually helps the final model, so it may not be necessary. But regardless, this is what I did when making the models included with Deepbinner.

Partition the data into files

Here I used a 4-way split:

shuf unrefined_samples > temp
mv temp unrefined_samples

total_count=$(wc -l < unrefined_samples)
count_per_file=$(( (total_count + 3) / 4))
split -a 1 -dl $count_per_file unrefined_samples split_samples_

I then make some smaller versions of these to use as validation sets during training (using the full files as validation sets would work too but take longer):

for f in split_samples_*; do
    head -n 10000 "$f" > "$f"_small
done

Train a model using each split

Combine the split samples into three-quarter training sets (each leaving out one quarter of the data):

cat split_samples_1 split_samples_2 split_samples_3 > split_samples_0_train
cat split_samples_0 split_samples_2 split_samples_3 > split_samples_1_train
cat split_samples_0 split_samples_1 split_samples_3 > split_samples_2_train
cat split_samples_0 split_samples_1 split_samples_2 > split_samples_3_train

Now train a model using each, using samples from the left out quarter as validation. I don't use too many epochs here so it doesn't take that long:

deepbinner train --train split_samples_0_train --val split_samples_0_small --model_out model_0 --epochs 100
deepbinner train --train split_samples_1_train --val split_samples_1_small --model_out model_1 --epochs 100
deepbinner train --train split_samples_2_train --val split_samples_2_small --model_out model_2 --epochs 100
deepbinner train --train split_samples_3_train --val split_samples_3_small --model_out model_3 --epochs 100

Refine the samples

Using each of our models, we can now classify the samples in the quarter that was not included in its training set:

deepbinner classify -s model_0 split_samples_0 > split_samples_0_classification
deepbinner classify -s model_1 split_samples_1 > split_samples_1_classification
deepbinner classify -s model_2 split_samples_2 > split_samples_2_classification
deepbinner classify -s model_3 split_samples_3 > split_samples_3_classification

And finally, the deepbinner refine command will compare the classifications to the labels, only outputting samples for which these match:

deepbinner refine split_samples_0 split_samples_0_classification >> refined_samples
deepbinner refine split_samples_1 split_samples_1_classification >> refined_samples
deepbinner refine split_samples_2 split_samples_2_classification >> refined_samples
deepbinner refine split_samples_3 split_samples_3_classification >> refined_samples

If all went well, you should now have a refined_samples file that is slightly smaller than your original unrefined_samples file, with most of the mislabelled samples removed.