Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ema (Emerald) mapper added. #168

Merged
merged 9 commits into from
Dec 4, 2019
Merged

Ema (Emerald) mapper added. #168

merged 9 commits into from
Dec 4, 2019

Conversation

pontushojer
Copy link
Collaborator

@pontushojer pontushojer commented Dec 3, 2019

See issue #114

A few additional things had to be added for implementing the new mapper.

  • Sorting of the barcoded FASTQ file based on barcode. This I have done using a bash one-liner which is fairly fast.
  • New barcode tagging scheme for ema FASTQs. ema requires the read names to be in the "10x" format which corresponds to. @ST-E00269:339:H27G2CCX2:7:1102:21186:8060:AAAAAAAATATCTACGCTCA BX:Z:AAAAAAAATATCTACGCTCA. For this tagfastq had to be updated to take in information about the current mapper and modify read names accordingly. Note that uncorrected barcodes CANNOT be included in this scheme.
  • Removal of non barcoded reads from FASTQs. These cause errors with ema. Reads for which the barcode could not be identified are skipped in tagfastq for ema.
  • Removal of barcode containing N bases. These cause errors with ema. The rule extract_DBS now skips barcodes containing N.

Runtime for sorting chr22 testdata (~6 million reads).

  • real: 1m30.065s, 1m28.889s, 1m29.104s
  • user:1m50.679s, 1m49.464s, 1m49.437s
  • sys: 0m11.854s, 0m11.280s, 0m11.329s

Runtimes for mapping chr22 testdata (~6 million reads).

  • real: 3m2.813s, 3m2.105s, 3m1.974s
  • user: 60m26.098s, 60m24.210s, 60m21.574s
  • sys: 0m16.194s, 0m15.654s, 0m15.309s

@FrickTobias
Copy link
Owner

What was the reason for excluding N-base-containing barcodes? Are the N-base containing barcodes not compatible with ema?

@pontushojer
Copy link
Collaborator Author

What was the reason for excluding N-base-containing barcodes? Are the N-base containing barcodes not compatible with ema?

I get an error originating from line 54 in this ema script. Seems the barcode is encoded using A,T,C,G and thus N is not allowed.

Copy link
Owner

@FrickTobias FrickTobias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice, I think this is a good start and I think this setting will account for most cases.

But if accept this we should create the following issues to investigate the corner cases (I am assuming they mostly are).

  • Add non-barcoded reads to mapping results
  • Include reads with N-containing barcode to mapping results

We should probably also add an issue for big files, I've had some problems with this in the metagenomics datasets where input files have been required to be sorted. There I ended up using external sorting by building a SQL database and subsequently extracting them as sorted. It scales very well with RAM but it took quite some time so I'd try and avoid it.

  • Investigate read sorting on big files & possibly add solution problematic sizes

src/blr/blr.yaml Outdated
@@ -3,7 +3,7 @@ molecule_tag: MI # Used to store molecule ID, same as 10x default.
num_mol_tag: MN # Used to store number of molecules per barcode
sequence_tag: RX # Used to store original barcode sequence in bam file. 'RX' is 10x genomic default
genome_reference: # Path to indexed reference
read_mapper: bowtie2 # Choose bwa, bowtie2 or minimap2
read_mapper: bowtie2 # Choose bwa, bowtie2, minimap2 and ema
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
read_mapper: bowtie2 # Choose bwa, bowtie2, minimap2 and ema
read_mapper: bowtie2 # Choose bwa, bowtie2, minimap2 or ema

Comment on lines 43 to 49
"pigz -cd {input.fastq} |"
" paste - - - - |"
" awk -F ' ' '{{print $2,$0}}' |"
" sort -t ' ' -k1,1 |"
" cut -d' ' -f 2- |"
" tr '\t' '\n' |"
" gzip > {output.fastq}"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the awk can be removed and I think it makes sense to use pigz for zipping.

"pigz -cd {input.fastq} |"
" paste - - - - |"
" sort -t "_" -k 3 |"
" tr "\t" "\n" |"
" pigz > {output.fastq}"

Results from a 100x blr-testdata-0.2 sorting.

time pigz -cd 100x-testfile.fq.gz | paste - - - - | sort -t "_" -k 3 | tr "\t" "\n" | pigz > sort.fq.gz

real	1m39.001s
user	1m46.161s
sys	0m2.255s
time pigz -cd 100x-testfile.fq.gz | paste  - - - | awk -F ' ' '{{print $2,$0}}' | sort -t ' ' -k1,1  | cut -d' ' -f 2- | tr '\t' '\n' | gzip > sort.fq.gz 

real	1m45.886s
user	2m16.831s
sys	0m2.180s

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using pigz for decompression, please use it in single-thread mode (add option -p 1). According to the man page, decompression cannot be parallelized, so it just spawns some extra I/O threads which help to reduce wall-clock time. However, this comes at the cost of higher total CPU time. What actually is faster than gzip for some weird reason is pigz -dc -p 1, which seems to use a more efficient (albeit still single-threaded) decompression algorithm.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, I will add changes! 👍

@pontushojer
Copy link
Collaborator Author

But if accept this we should create the following issues to investigate the corner cases (I am assuming they mostly are).

  • Add non-barcoded reads to mapping results

This would be quite easy as we could justa output additional file for non-barcoded reads in tagfastq. Then implement conditions for how to handle them. Possibly they could be merged into the other BAM file after being mapped separately. However the number of reads missing barcodes are quite small, in my test file there were just 72 read pairs (0.001% of total). Possibly its not worth bothering with.

  • Include reads with N-containing barcode to mapping results

Do you mean to mapp these separately in ema or whether to include them at all? Currently they are also quite few, actual there were 72 in my dataset (the same that appeared as non-barcoded following tagfastq).

We should probably also add an issue for big files, I've had some problems with this in the metagenomics datasets where input files have been required to be sorted. There I ended up using external sorting by building a SQL database and subsequently extracting them as sorted. It scales very well with RAM but it took quite some time so I'd try and avoid it.

Investigate read sorting on big files & possibly add solution problematic sizes

This is probably reasonable to investigate. Just to note that the unix sort command should be able to handle any size files. It has a build in handling for running out of memory and then starts writing temp files. It might however be that the program becomes really slow at this point.

I also want to clarify that "sorted" in this case only requires the barcodes to be grouped together not that they have to appear in alphabetical order. Possibly on could take advantage of this when writing a more efficient program. One idea I had was to make use of the clstr file from starcode which contains the number of reads in each cluster. If one could keep track of each cluster as they fill up one could then output them in groups as they become full. I however expect that this would be quite memory intensive.

@FrickTobias
Copy link
Owner

I agree it's not much but since it's easily handled I'd at least map these reads with bwa and append them to the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants