-
Notifications
You must be signed in to change notification settings - Fork 23
Refining a training set
Deepbinner's method of building a training set isn't perfect – locating barcodes in raw signal data is hard! Some incorrectly labelled samples can therefore sneak through.
This page describes how you can refine your training set by excluding samples that are possibly labelled incorrectly. Disclaimer: I'm not sure how much (or how little) this actually helps the final model, so it may not be necessary. But regardless, this is what I did when making the models included with Deepbinner.
Here I used a 4-way split:
shuf unrefined_samples > temp
mv temp unrefined_samples
total_count=$(wc -l < unrefined_samples)
count_per_file=$(( (total_count + 3) / 4))
split -a 1 -dl $count_per_file unrefined_samples split_samples_
I then make some smaller versions of these to use as validation sets during training (using the full files as validation sets would work too but take longer):
for f in split_samples_*; do
head -n 10000 "$f" > "$f"_small
done
Combine the split samples into three-quarter training sets (each leaving out one quarter of the data):
cat split_samples_1 split_samples_2 split_samples_3 > split_samples_0_train
cat split_samples_0 split_samples_2 split_samples_3 > split_samples_1_train
cat split_samples_0 split_samples_1 split_samples_3 > split_samples_2_train
cat split_samples_0 split_samples_1 split_samples_2 > split_samples_3_train
Now train a model using each, using samples from the left out quarter as validation. I don't use too many epochs here so it doesn't take that long:
deepbinner train --train split_samples_0_train --val split_samples_0_small --model_out model_0 --epochs 100
deepbinner train --train split_samples_1_train --val split_samples_1_small --model_out model_1 --epochs 100
deepbinner train --train split_samples_2_train --val split_samples_2_small --model_out model_2 --epochs 100
deepbinner train --train split_samples_3_train --val split_samples_3_small --model_out model_3 --epochs 100
Using each of our models, we can now classify the samples in the quarter that was not included in its training set:
deepbinner classify -s model_0 split_samples_0 > split_samples_0_classification
deepbinner classify -s model_1 split_samples_1 > split_samples_1_classification
deepbinner classify -s model_2 split_samples_2 > split_samples_2_classification
deepbinner classify -s model_3 split_samples_3 > split_samples_3_classification
And finally, the deepbinner refine
command will compare the classifications to the labels, only outputting samples for which these match:
deepbinner refine split_samples_0 split_samples_0_classification >> refined_samples
deepbinner refine split_samples_1 split_samples_1_classification >> refined_samples
deepbinner refine split_samples_2 split_samples_2_classification >> refined_samples
deepbinner refine split_samples_3 split_samples_3_classification >> refined_samples
If all went well, you should now have a refined_samples
file that is slightly smaller than your original unrefined_samples
file, with most of the mislabelled samples removed.