-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ema (Emerald) mapper added. #168
Conversation
…pdated tagging policy for read names.
What was the reason for excluding N-base-containing barcodes? Are the N-base containing barcodes not compatible with ema? |
I get an error originating from line 54 in this ema script. Seems the barcode is encoded using A,T,C,G and thus N is not allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice, I think this is a good start and I think this setting will account for most cases.
But if accept this we should create the following issues to investigate the corner cases (I am assuming they mostly are).
- Add non-barcoded reads to mapping results
- Include reads with N-containing barcode to mapping results
We should probably also add an issue for big files, I've had some problems with this in the metagenomics datasets where input files have been required to be sorted. There I ended up using external sorting by building a SQL database and subsequently extracting them as sorted. It scales very well with RAM but it took quite some time so I'd try and avoid it.
- Investigate read sorting on big files & possibly add solution problematic sizes
src/blr/blr.yaml
Outdated
@@ -3,7 +3,7 @@ molecule_tag: MI # Used to store molecule ID, same as 10x default. | |||
num_mol_tag: MN # Used to store number of molecules per barcode | |||
sequence_tag: RX # Used to store original barcode sequence in bam file. 'RX' is 10x genomic default | |||
genome_reference: # Path to indexed reference | |||
read_mapper: bowtie2 # Choose bwa, bowtie2 or minimap2 | |||
read_mapper: bowtie2 # Choose bwa, bowtie2, minimap2 and ema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_mapper: bowtie2 # Choose bwa, bowtie2, minimap2 and ema | |
read_mapper: bowtie2 # Choose bwa, bowtie2, minimap2 or ema |
src/blr/Snakefile
Outdated
"pigz -cd {input.fastq} |" | ||
" paste - - - - |" | ||
" awk -F ' ' '{{print $2,$0}}' |" | ||
" sort -t ' ' -k1,1 |" | ||
" cut -d' ' -f 2- |" | ||
" tr '\t' '\n' |" | ||
" gzip > {output.fastq}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the awk
can be removed and I think it makes sense to use pigz
for zipping.
"pigz -cd {input.fastq} |"
" paste - - - - |"
" sort -t "_" -k 3 |"
" tr "\t" "\n" |"
" pigz > {output.fastq}"
Results from a 100x blr-testdata-0.2 sorting.
time pigz -cd 100x-testfile.fq.gz | paste - - - - | sort -t "_" -k 3 | tr "\t" "\n" | pigz > sort.fq.gz
real 1m39.001s
user 1m46.161s
sys 0m2.255s
time pigz -cd 100x-testfile.fq.gz | paste - - - | awk -F ' ' '{{print $2,$0}}' | sort -t ' ' -k1,1 | cut -d' ' -f 2- | tr '\t' '\n' | gzip > sort.fq.gz
real 1m45.886s
user 2m16.831s
sys 0m2.180s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using pigz
for decompression, please use it in single-thread mode (add option -p 1
). According to the man page, decompression cannot be parallelized, so it just spawns some extra I/O threads which help to reduce wall-clock time. However, this comes at the cost of higher total CPU time. What actually is faster than gzip
for some weird reason is pigz -dc -p 1
, which seems to use a more efficient (albeit still single-threaded) decompression algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments, I will add changes! 👍
This would be quite easy as we could justa output additional file for non-barcoded reads in
Do you mean to mapp these separately in ema or whether to include them at all? Currently they are also quite few, actual there were 72 in my dataset (the same that appeared as non-barcoded following
This is probably reasonable to investigate. Just to note that the unix I also want to clarify that "sorted" in this case only requires the barcodes to be grouped together not that they have to appear in alphabetical order. Possibly on could take advantage of this when writing a more efficient program. One idea I had was to make use of the |
I agree it's not much but since it's easily handled I'd at least map these reads with |
See issue #114
A few additional things had to be added for implementing the new mapper.
@ST-E00269:339:H27G2CCX2:7:1102:21186:8060:AAAAAAAATATCTACGCTCA BX:Z:AAAAAAAATATCTACGCTCA
. For thistagfastq
had to be updated to take in information about the current mapper and modify read names accordingly. Note that uncorrected barcodes CANNOT be included in this scheme.tagfastq
for ema.extract_DBS
now skips barcodes containing N.Runtime for sorting chr22 testdata (~6 million reads).
Runtimes for mapping chr22 testdata (~6 million reads).