Ema (Emerald) mapper added. #168

pontushojer · 2019-12-03T10:33:02Z

See issue #114

A few additional things had to be added for implementing the new mapper.

Sorting of the barcoded FASTQ file based on barcode. This I have done using a bash one-liner which is fairly fast.
New barcode tagging scheme for ema FASTQs. ema requires the read names to be in the "10x" format which corresponds to. @ST-E00269:339:H27G2CCX2:7:1102:21186:8060:AAAAAAAATATCTACGCTCA BX:Z:AAAAAAAATATCTACGCTCA. For this tagfastq had to be updated to take in information about the current mapper and modify read names accordingly. Note that uncorrected barcodes CANNOT be included in this scheme.
Removal of non barcoded reads from FASTQs. These cause errors with ema. Reads for which the barcode could not be identified are skipped in tagfastq for ema.
Removal of barcode containing N bases. These cause errors with ema. The rule extract_DBS now skips barcodes containing N.

Runtime for sorting chr22 testdata (~6 million reads).

real: 1m30.065s, 1m28.889s, 1m29.104s
user:1m50.679s, 1m49.464s, 1m49.437s
sys: 0m11.854s, 0m11.280s, 0m11.329s

Runtimes for mapping chr22 testdata (~6 million reads).

real: 3m2.813s, 3m2.105s, 3m1.974s
user: 60m26.098s, 60m24.210s, 60m21.574s
sys: 0m16.194s, 0m15.654s, 0m15.309s

…pdated tagging policy for read names.

FrickTobias · 2019-12-03T10:44:38Z

What was the reason for excluding N-base-containing barcodes? Are the N-base containing barcodes not compatible with ema?

pontushojer · 2019-12-03T13:42:49Z

What was the reason for excluding N-base-containing barcodes? Are the N-base containing barcodes not compatible with ema?

I get an error originating from line 54 in this ema script. Seems the barcode is encoded using A,T,C,G and thus N is not allowed.

FrickTobias

Really nice, I think this is a good start and I think this setting will account for most cases.

But if accept this we should create the following issues to investigate the corner cases (I am assuming they mostly are).

Add non-barcoded reads to mapping results
Include reads with N-containing barcode to mapping results

We should probably also add an issue for big files, I've had some problems with this in the metagenomics datasets where input files have been required to be sorted. There I ended up using external sorting by building a SQL database and subsequently extracting them as sorted. It scales very well with RAM but it took quite some time so I'd try and avoid it.

Investigate read sorting on big files & possibly add solution problematic sizes

FrickTobias · 2019-12-03T15:42:35Z

src/blr/blr.yaml

@@ -3,7 +3,7 @@ molecule_tag: MI  # Used to store molecule ID, same as 10x default.
 num_mol_tag: MN # Used to store number of molecules per barcode
 sequence_tag: RX    # Used to store original barcode sequence in bam file. 'RX' is 10x genomic default
 genome_reference:  # Path to indexed reference
-read_mapper: bowtie2  # Choose bwa, bowtie2 or minimap2
+read_mapper: bowtie2  # Choose bwa, bowtie2, minimap2 and ema


Suggested change

read_mapper: bowtie2 # Choose bwa, bowtie2, minimap2 and ema

read_mapper: bowtie2 # Choose bwa, bowtie2, minimap2 or ema

FrickTobias · 2019-12-03T16:09:26Z

src/blr/Snakefile

+        "pigz -cd {input.fastq} |"
+        " paste - - - - |"
+        " awk -F ' ' '{{print $2,$0}}' |"
+        " sort -t ' ' -k1,1  |"
+        " cut -d' ' -f 2- |"
+        " tr '\t' '\n' |"
+        " gzip > {output.fastq}"


I think the awk can be removed and I think it makes sense to use pigz for zipping.

"pigz -cd {input.fastq} |" " paste - - - - |" " sort -t "_" -k 3 |" " tr "\t" "\n" |" " pigz > {output.fastq}"

Results from a 100x blr-testdata-0.2 sorting.

time pigz -cd 100x-testfile.fq.gz | paste - - - - | sort -t "_" -k 3 | tr "\t" "\n" | pigz > sort.fq.gz real 1m39.001s user 1m46.161s sys 0m2.255s

time pigz -cd 100x-testfile.fq.gz | paste - - - | awk -F ' ' '{{print $2,$0}}' | sort -t ' ' -k1,1 | cut -d' ' -f 2- | tr '\t' '\n' | gzip > sort.fq.gz real 1m45.886s user 2m16.831s sys 0m2.180s

When using pigz for decompression, please use it in single-thread mode (add option -p 1). According to the man page, decompression cannot be parallelized, so it just spawns some extra I/O threads which help to reduce wall-clock time. However, this comes at the cost of higher total CPU time. What actually is faster than gzip for some weird reason is pigz -dc -p 1, which seems to use a more efficient (albeit still single-threaded) decompression algorithm.

Thanks for the comments, I will add changes! 👍

…inor changes.

pontushojer · 2019-12-04T10:41:13Z

But if accept this we should create the following issues to investigate the corner cases (I am assuming they mostly are).

Add non-barcoded reads to mapping results

This would be quite easy as we could justa output additional file for non-barcoded reads in tagfastq. Then implement conditions for how to handle them. Possibly they could be merged into the other BAM file after being mapped separately. However the number of reads missing barcodes are quite small, in my test file there were just 72 read pairs (0.001% of total). Possibly its not worth bothering with.

Include reads with N-containing barcode to mapping results

Do you mean to mapp these separately in ema or whether to include them at all? Currently they are also quite few, actual there were 72 in my dataset (the same that appeared as non-barcoded following tagfastq).

We should probably also add an issue for big files, I've had some problems with this in the metagenomics datasets where input files have been required to be sorted. There I ended up using external sorting by building a SQL database and subsequently extracting them as sorted. It scales very well with RAM but it took quite some time so I'd try and avoid it.

Investigate read sorting on big files & possibly add solution problematic sizes

This is probably reasonable to investigate. Just to note that the unix sort command should be able to handle any size files. It has a build in handling for running out of memory and then starts writing temp files. It might however be that the program becomes really slow at this point.

I also want to clarify that "sorted" in this case only requires the barcodes to be grouped together not that they have to appear in alphabetical order. Possibly on could take advantage of this when writing a more efficient program. One idea I had was to make use of the clstr file from starcode which contains the number of reads in each cluster. If one could keep track of each cluster as they fill up one could then output them in groups as they become full. I however expect that this would be quite memory intensive.

FrickTobias · 2019-12-04T12:18:05Z

I agree it's not much but since it's easily handled I'd at least map these reads with bwa and append them to the file.

pontushojer and others added 8 commits November 29, 2019 11:40

Add ema (emerald) as a mapper

d630b3f

Added rule to sort FASTQ file on barcode which is required for ema. U…

0347ea5

…pdated tagging policy for read names.

Index testdata and used ruby 2.6 for osx

7d36508

Update brew on travis to run ruby 2.6

f19848b

Ingore binary and specify version as 2.6.4

fe3ae86

Remove custom install for ema, add conda install.

fb5099b

Skip barcodes containing N bases

fc6c7ba

Added newline

f695dac

FrickTobias reviewed Dec 3, 2019

View reviewed changes

Changes from PR review. Remove awk part from sort command and other m…

616434d

…inor changes.

FrickTobias merged commit c69f36d into master Dec 4, 2019

This was referenced Dec 4, 2019

EMA: investigate how sorting scales with big data #169

Closed

EMA: mapping corner cases #170

Closed

pontushojer deleted the ema branch December 17, 2019 10:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ema (Emerald) mapper added. #168

Ema (Emerald) mapper added. #168

pontushojer commented Dec 3, 2019 •

edited

Loading

FrickTobias commented Dec 3, 2019

pontushojer commented Dec 3, 2019

FrickTobias left a comment

FrickTobias Dec 3, 2019

FrickTobias Dec 3, 2019

marcelm Dec 3, 2019

pontushojer Dec 4, 2019

pontushojer commented Dec 4, 2019

FrickTobias commented Dec 4, 2019

	read_mapper: bowtie2 # Choose bwa, bowtie2, minimap2 and ema
	read_mapper: bowtie2 # Choose bwa, bowtie2, minimap2 or ema

Ema (Emerald) mapper added. #168

Ema (Emerald) mapper added. #168

Conversation

pontushojer commented Dec 3, 2019 • edited Loading

FrickTobias commented Dec 3, 2019

pontushojer commented Dec 3, 2019

FrickTobias left a comment

Choose a reason for hiding this comment

FrickTobias Dec 3, 2019

Choose a reason for hiding this comment

FrickTobias Dec 3, 2019

Choose a reason for hiding this comment

marcelm Dec 3, 2019

Choose a reason for hiding this comment

pontushojer Dec 4, 2019

Choose a reason for hiding this comment

pontushojer commented Dec 4, 2019

FrickTobias commented Dec 4, 2019

pontushojer commented Dec 3, 2019 •

edited

Loading