Add ability to remove internal barcodes #632

jfy133 · 2020-12-11T15:03:25Z

Is your feature request related to a problem? Please describe

During a workshop today @eg715 asked if we had the ability to also remove internal barcodes - in addition to indicies, which we currently don't, but absolutely should because I know a lot of groups use this now and is perfect thing for eager.

However I have no experience this, so need someone with knowledge to help develop this.

Describe the solution you'd like

Secondary FASTQ trimming step to remove internal barcodes after adapater removal

Describe alternatives you've considered

NA

Additional context

cutadapt maybe able to do this for us, and is used in other nf-core pipelines (as pointed out by @DiegoBrambilla )

DiegoBrambilla · 2020-12-11T15:22:17Z

Hi,
Thanks for filling me in.
Yes, there is a cutadapt process in nf-core ampliseq:
https://github.com/nf-core/ampliseq/blob/master/main.nf#L444-L471

Note that the above code chunk has been developed for read files produced in the same sequencing run (ampliseq optionally can process read files belonging to multiple sequencing runs).

It is also noteworthy that there is hardly a set of adapters that works for every dataset.
Adapters are dependent on the sequencing technology used, and even among the same sequencing platform (e.g. Illumina NovaSeq) a broad choice of adapters are used.

You might consider using trimgalore which automatically detects adapter types by default.
Even so, I recently happened to run cutadapt after trimgalore since some nextera transposase adapters were not detected by cutadapt.

I guess one way is to have the users to optionally provide a list of adapters through a pipeline flag (example in the above code chunk: --FW_primer --RV_primer; and in the process script section: -g ${params.FW_primer} -G ${params.RV_primer}).

jfy133 · 2020-12-11T15:24:35Z

Ah ok. TrimGalore indeed seems what we are looking for. What @eg715 was referring to by internal barcodes are sample-specific mini-indices, so as these would be custom per sample, so the auto-detection is what we need.

DiegoBrambilla · 2020-12-11T15:28:06Z

Yeah, it is definitely easier to insert TrimGalore in the code as long as you are happy with auto-detection.

apeltzer · 2020-12-11T17:03:19Z

Side note on that: The effort that went into evaluating AdapterRemoval v2 for aDNA analysis should be taken into account ;-) I am not fully sure we can achieve the same with cutadapt as most of the community almost agrees on AdapterRemoval v2 :-)

=> Might also be something to ask in the https://github.com/mikkelschubert/adapterremoval repository, Mikkel tends to just implement things if there is a nice use case :-)

jfy133 · 2020-12-11T20:27:57Z

Note this would be secondary to AdapterRemoval, and an alternative woudl only used for removing the inline barcodes, but valid point. It if could be integrated into AR (maybe it already is?) that would be awesome.

apeltzer · 2020-12-14T08:27:25Z

I just asked if this is something already possible

apeltzer · 2020-12-17T13:19:42Z

Can folks please reply on the MikkelSchubert/adapterremoval#50 (comment) issue to explain a bit more for Mikkel - that would be extremely helpful 👍🏼 Thank you!

jfy133 · 2021-05-05T09:18:08Z

Actually, had an alternative thought, as AR2 isn't working as we thought.

After talking with some colleagues, barcode removal should only occur once you know you've already got the right data. Separation of samples based on their barode should ALREADY have happened before you start mapping. Therefore anything in your FASTQ files should already only the right barcodes of your sample, and you can simply just do a hard trimming at both ends of the read. We shouldn't be doing demultiplexing for them (in a sense), and we don't need more fancy detection systems.

I believe fastp already includes this sort of 'global' trimming: https://github.com/OpenGene/fastp#global-trimming

Maybe we can do that instead?

Do you think that would work @TCLamnidis ?

TCLamnidis · 2021-05-06T08:29:47Z

That would work, but it feels like it misses the point.

The reason to include internal barcodes (as I understand it) is to avoid hopping over of reads, and ensure the reads you do have are genuine. By blindly trimming the internal barcodes just because the external ones exist, we are effectively making the use of internal barcodes moot when processing data with nf-core/eager.

Doing it this way would be a quick "hack" to get that type of data added, but it feels like we are also effectively excluding the adoption of nf-core/eager from labs that routinely use internal barcodes?

jfy133 · 2021-05-06T09:07:52Z

True but this goes back to the following:

Separation of samples based on their barode should ALREADY have happened before you start mapping.

If you're checking for barcode hopping, this is something that should be done at the demultiplexing level, which eager does not (and will not) do.

If we follow this concept, then we can consider the following 'workflow'

User gets demultiplexes FASTQ files
Run their own adapter removal step (so then barcodes are exposed at the end)
They then check for barcodes in FASTQ files match the library ID it should be assigned with. They should also remove any reads with the incorrect barcode (as this is again, equivalent to demultiplexing processing)

Note: this would require manual work anyway, checking the level of index hopping etc. etc.

Once they've removed 'incorrect' reads, they are ready for the eager pipeline, they will skip adapterremoval, but one could argue the barcode checking step may not have yet removed teh barcodes, so then they can use fastp to trim the other end of reads

Another example, is that @aidaanva knows of ENA/SRA data that still has barcodes on it. So they've removed the reads derived from hopping, and the adapters, but still has the inline barcode (for whatever reason). So it is the same sort of thing.

Or you disagree and we should do the sorting for users too? My fear is if we go down that route then we have to come up a way of estimating barcode hopping for the user... unless you think this is worth the effort? But then should we go even further back and include demultiplexing?

TCLamnidis · 2021-05-06T09:14:56Z

I don't believe demultiplexing should be part of eager. Usually it is done more centrally and hence not really something the pipeline should include imho.
I see the benefit of hard clipping of barcodes, but it then needs to be quite explicit that we essentially ignore their existence and they provide no information to eager at all. I am then happy for them to be added to the pipeline.

How do you imagine this being specified? Do we want users to provide the length of the barcodes in a per sample basis (i.e. TSV column)? That would be more flexible, but a uniform barcode length for an eager run would be much easier to implement.

jfy133 · 2021-05-06T10:27:34Z

I would also go for the uniform one. Some people might complain 'but I want to mix public and my own data in one run' but if there is public-data with dodgy barcodes, they should be scruitinised more closely anyway and can be in a separate run.

Could maybe reconsider in DSL2, and have a column specifying 'includes barcodes' or not, but then it might vary depending if it is single or double barcodes....

Anyway, are you happy to continue but with experimenting with fastp instead? Or would you rather me take over (you could do mpileup instead, as maybe you have more epxeirence with that when I have 0)?

TCLamnidis · 2021-05-06T11:39:36Z

I can take over mpileup. <3

jfy133 · 2021-05-21T13:52:22Z

Added in https://github.com/nf-core/eager/tree/inline-barcode-trimming

but currently running it with skipping_AR fails:

$ nextflow run ../../main.nf -profile singularity,test_tsv --run_post_ar_trimming --skip_adapterremoval

Error executing process > 'lanemerge (JK2782)'

Caused by:
  Process `lanemerge (JK2782)` terminated with an error exit status (255)

Command executed:

  cat JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz JK2782_R1_postartrimmed.fq.gz > "JK2782"_R1_lanemerged.fq.gz
  cat JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz > "JK2782"_R2_lanemerged.fq.gz

Command exit status:
  255

Command output:
  (empty)

Command error:
  WARNING: skipping mount of /nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq: no such file or directory
  FATAL:   container creation failed: mount /nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq->/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq error: while mounting /nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq: mount source /nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq doesn't exist

jfy133 added enhancement New feature or request question Further information is requested labels Dec 11, 2020

apeltzer mentioned this issue Dec 14, 2020

Internal barcodes MikkelSchubert/adapterremoval#50

Open

jfy133 mentioned this issue Mar 19, 2021

Add fastq trimming functionality #431

Closed

jfy133 assigned TCLamnidis Mar 30, 2021

jfy133 assigned jfy133 and unassigned TCLamnidis May 21, 2021

jfy133 added this to the 2.4 "Wangen" milestone Jun 4, 2021

jfy133 mentioned this issue Jun 7, 2021

Add basic functionality for barcode trimming/fastq trimming #765

Merged

11 tasks

jfy133 added the pending Addressed on branch waiting for related PR label Jul 18, 2021

jfy133 linked a pull request Jul 26, 2021 that will close this issue

Add basic functionality for barcode trimming/fastq trimming #765

Merged

11 tasks

jfy133 closed this as completed Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to remove internal barcodes #632

Add ability to remove internal barcodes #632

jfy133 commented Dec 11, 2020 •

edited

Loading

DiegoBrambilla commented Dec 11, 2020

jfy133 commented Dec 11, 2020

DiegoBrambilla commented Dec 11, 2020

apeltzer commented Dec 11, 2020

jfy133 commented Dec 11, 2020

apeltzer commented Dec 14, 2020

apeltzer commented Dec 17, 2020

jfy133 commented May 5, 2021

TCLamnidis commented May 6, 2021

jfy133 commented May 6, 2021

TCLamnidis commented May 6, 2021

jfy133 commented May 6, 2021 •

edited

Loading

TCLamnidis commented May 6, 2021

jfy133 commented May 21, 2021 •

edited

Loading

Add ability to remove internal barcodes #632

Add ability to remove internal barcodes #632

Comments

jfy133 commented Dec 11, 2020 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

DiegoBrambilla commented Dec 11, 2020

jfy133 commented Dec 11, 2020

DiegoBrambilla commented Dec 11, 2020

apeltzer commented Dec 11, 2020

jfy133 commented Dec 11, 2020

apeltzer commented Dec 14, 2020

apeltzer commented Dec 17, 2020

jfy133 commented May 5, 2021

TCLamnidis commented May 6, 2021

jfy133 commented May 6, 2021

TCLamnidis commented May 6, 2021

jfy133 commented May 6, 2021 • edited Loading

TCLamnidis commented May 6, 2021

jfy133 commented May 21, 2021 • edited Loading

jfy133 commented Dec 11, 2020 •

edited

Loading

jfy133 commented May 6, 2021 •

edited

Loading

jfy133 commented May 21, 2021 •

edited

Loading