Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to remove internal barcodes #632

Closed
jfy133 opened this issue Dec 11, 2020 · 14 comments · Fixed by #765
Closed

Add ability to remove internal barcodes #632

jfy133 opened this issue Dec 11, 2020 · 14 comments · Fixed by #765
Assignees
Labels
enhancement New feature or request pending Addressed on branch waiting for related PR question Further information is requested
Milestone

Comments

@jfy133
Copy link
Member

jfy133 commented Dec 11, 2020

Is your feature request related to a problem? Please describe

During a workshop today @eg715 asked if we had the ability to also remove internal barcodes - in addition to indicies, which we currently don't, but absolutely should because I know a lot of groups use this now and is perfect thing for eager.

However I have no experience this, so need someone with knowledge to help develop this.

Describe the solution you'd like

Secondary FASTQ trimming step to remove internal barcodes after adapater removal

Describe alternatives you've considered

NA

Additional context

cutadapt maybe able to do this for us, and is used in other nf-core pipelines (as pointed out by @DiegoBrambilla )

@jfy133 jfy133 added enhancement New feature or request question Further information is requested labels Dec 11, 2020
@DiegoBrambilla
Copy link

Hi,
Thanks for filling me in.
Yes, there is a cutadapt process in nf-core ampliseq:
https://github.com/nf-core/ampliseq/blob/master/main.nf#L444-L471

Note that the above code chunk has been developed for read files produced in the same sequencing run (ampliseq optionally can process read files belonging to multiple sequencing runs).

It is also noteworthy that there is hardly a set of adapters that works for every dataset.
Adapters are dependent on the sequencing technology used, and even among the same sequencing platform (e.g. Illumina NovaSeq) a broad choice of adapters are used.

You might consider using trimgalore which automatically detects adapter types by default.
Even so, I recently happened to run cutadapt after trimgalore since some nextera transposase adapters were not detected by cutadapt.

I guess one way is to have the users to optionally provide a list of adapters through a pipeline flag (example in the above code chunk: --FW_primer --RV_primer; and in the process script section: -g ${params.FW_primer} -G ${params.RV_primer}).

@jfy133
Copy link
Member Author

jfy133 commented Dec 11, 2020

Ah ok. TrimGalore indeed seems what we are looking for. What @eg715 was referring to by internal barcodes are sample-specific mini-indices, so as these would be custom per sample, so the auto-detection is what we need.

@DiegoBrambilla
Copy link

Yeah, it is definitely easier to insert TrimGalore in the code as long as you are happy with auto-detection.

@apeltzer
Copy link
Member

Side note on that: The effort that went into evaluating AdapterRemoval v2 for aDNA analysis should be taken into account ;-) I am not fully sure we can achieve the same with cutadapt as most of the community almost agrees on AdapterRemoval v2 :-)

=> Might also be something to ask in the https://github.com/mikkelschubert/adapterremoval repository, Mikkel tends to just implement things if there is a nice use case :-)

@jfy133
Copy link
Member Author

jfy133 commented Dec 11, 2020

Note this would be secondary to AdapterRemoval, and an alternative woudl only used for removing the inline barcodes, but valid point. It if could be integrated into AR (maybe it already is?) that would be awesome.

@apeltzer
Copy link
Member

I just asked if this is something already possible

@apeltzer
Copy link
Member

Can folks please reply on the MikkelSchubert/adapterremoval#50 (comment) issue to explain a bit more for Mikkel - that would be extremely helpful 👍🏼 Thank you!

@jfy133
Copy link
Member Author

jfy133 commented May 5, 2021

Actually, had an alternative thought, as AR2 isn't working as we thought.

After talking with some colleagues, barcode removal should only occur once you know you've already got the right data. Separation of samples based on their barode should ALREADY have happened before you start mapping. Therefore anything in your FASTQ files should already only the right barcodes of your sample, and you can simply just do a hard trimming at both ends of the read. We shouldn't be doing demultiplexing for them (in a sense), and we don't need more fancy detection systems.

I believe fastp already includes this sort of 'global' trimming: https://github.com/OpenGene/fastp#global-trimming

Maybe we can do that instead?

Do you think that would work @TCLamnidis ?

@TCLamnidis
Copy link
Collaborator

That would work, but it feels like it misses the point.

The reason to include internal barcodes (as I understand it) is to avoid hopping over of reads, and ensure the reads you do have are genuine. By blindly trimming the internal barcodes just because the external ones exist, we are effectively making the use of internal barcodes moot when processing data with nf-core/eager.

Doing it this way would be a quick "hack" to get that type of data added, but it feels like we are also effectively excluding the adoption of nf-core/eager from labs that routinely use internal barcodes?

@jfy133
Copy link
Member Author

jfy133 commented May 6, 2021

True but this goes back to the following:

Separation of samples based on their barode should ALREADY have happened before you start mapping.

If you're checking for barcode hopping, this is something that should be done at the demultiplexing level, which eager does not (and will not) do.

If we follow this concept, then we can consider the following 'workflow'

  1. User gets demultiplexes FASTQ files
  2. Run their own adapter removal step (so then barcodes are exposed at the end)
  3. They then check for barcodes in FASTQ files match the library ID it should be assigned with. They should also remove any reads with the incorrect barcode (as this is again, equivalent to demultiplexing processing)
  • Note: this would require manual work anyway, checking the level of index hopping etc. etc.
  1. Once they've removed 'incorrect' reads, they are ready for the eager pipeline, they will skip adapterremoval, but one could argue the barcode checking step may not have yet removed teh barcodes, so then they can use fastp to trim the other end of reads

Another example, is that @aidaanva knows of ENA/SRA data that still has barcodes on it. So they've removed the reads derived from hopping, and the adapters, but still has the inline barcode (for whatever reason). So it is the same sort of thing.

Or you disagree and we should do the sorting for users too? My fear is if we go down that route then we have to come up a way of estimating barcode hopping for the user... unless you think this is worth the effort? But then should we go even further back and include demultiplexing?

@TCLamnidis
Copy link
Collaborator

I don't believe demultiplexing should be part of eager. Usually it is done more centrally and hence not really something the pipeline should include imho.
I see the benefit of hard clipping of barcodes, but it then needs to be quite explicit that we essentially ignore their existence and they provide no information to eager at all. I am then happy for them to be added to the pipeline.

How do you imagine this being specified? Do we want users to provide the length of the barcodes in a per sample basis (i.e. TSV column)? That would be more flexible, but a uniform barcode length for an eager run would be much easier to implement.

@jfy133
Copy link
Member Author

jfy133 commented May 6, 2021

I would also go for the uniform one. Some people might complain 'but I want to mix public and my own data in one run' but if there is public-data with dodgy barcodes, they should be scruitinised more closely anyway and can be in a separate run.

Could maybe reconsider in DSL2, and have a column specifying 'includes barcodes' or not, but then it might vary depending if it is single or double barcodes....

Anyway, are you happy to continue but with experimenting with fastp instead? Or would you rather me take over (you could do mpileup instead, as maybe you have more epxeirence with that when I have 0)?

@TCLamnidis
Copy link
Collaborator

I can take over mpileup. <3

@jfy133 jfy133 assigned jfy133 and unassigned TCLamnidis May 21, 2021
@jfy133
Copy link
Member Author

jfy133 commented May 21, 2021

Added in https://github.com/nf-core/eager/tree/inline-barcode-trimming

but currently running it with skipping_AR fails:

$ nextflow run ../../main.nf -profile singularity,test_tsv --run_post_ar_trimming --skip_adapterremoval

Error executing process > 'lanemerge (JK2782)'

Caused by:
  Process `lanemerge (JK2782)` terminated with an error exit status (255)

Command executed:

  cat JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz JK2782_R1_postartrimmed.fq.gz > "JK2782"_R1_lanemerged.fq.gz
  cat JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz > "JK2782"_R2_lanemerged.fq.gz

Command exit status:
  255

Command output:
  (empty)

Command error:
  WARNING: skipping mount of /nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq: no such file or directory
  FATAL:   container creation failed: mount /nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq->/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq error: while mounting /nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq: mount source /nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq doesn't exist

@jfy133 jfy133 added this to the 2.4 "Wangen" milestone Jun 4, 2021
@jfy133 jfy133 added the pending Addressed on branch waiting for related PR label Jul 18, 2021
@jfy133 jfy133 linked a pull request Jul 26, 2021 that will close this issue
11 tasks
@jfy133 jfy133 closed this as completed Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pending Addressed on branch waiting for related PR question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants