-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to remove internal barcodes #632
Comments
Hi, Note that the above code chunk has been developed for read files produced in the same sequencing run (ampliseq optionally can process read files belonging to multiple sequencing runs). It is also noteworthy that there is hardly a set of adapters that works for every dataset. You might consider using trimgalore which automatically detects adapter types by default. I guess one way is to have the users to optionally provide a list of adapters through a pipeline flag (example in the above code chunk: --FW_primer --RV_primer; and in the process script section: -g ${params.FW_primer} -G ${params.RV_primer}). |
Ah ok. TrimGalore indeed seems what we are looking for. What @eg715 was referring to by internal barcodes are sample-specific mini-indices, so as these would be custom per sample, so the auto-detection is what we need. |
Yeah, it is definitely easier to insert TrimGalore in the code as long as you are happy with auto-detection. |
Side note on that: The effort that went into evaluating AdapterRemoval v2 for aDNA analysis should be taken into account ;-) I am not fully sure we can achieve the same with cutadapt as most of the community almost agrees on AdapterRemoval v2 :-) => Might also be something to ask in the https://github.com/mikkelschubert/adapterremoval repository, Mikkel tends to just implement things if there is a nice use case :-) |
Note this would be secondary to AdapterRemoval, and an alternative woudl only used for removing the inline barcodes, but valid point. It if could be integrated into AR (maybe it already is?) that would be awesome. |
I just asked if this is something already possible |
Can folks please reply on the MikkelSchubert/adapterremoval#50 (comment) issue to explain a bit more for Mikkel - that would be extremely helpful 👍🏼 Thank you! |
Actually, had an alternative thought, as AR2 isn't working as we thought. After talking with some colleagues, barcode removal should only occur once you know you've already got the right data. Separation of samples based on their barode should ALREADY have happened before you start mapping. Therefore anything in your FASTQ files should already only the right barcodes of your sample, and you can simply just do a hard trimming at both ends of the read. We shouldn't be doing demultiplexing for them (in a sense), and we don't need more fancy detection systems. I believe Maybe we can do that instead? Do you think that would work @TCLamnidis ? |
That would work, but it feels like it misses the point. The reason to include internal barcodes (as I understand it) is to avoid hopping over of reads, and ensure the reads you do have are genuine. By blindly trimming the internal barcodes just because the external ones exist, we are effectively making the use of internal barcodes moot when processing data with nf-core/eager. Doing it this way would be a quick "hack" to get that type of data added, but it feels like we are also effectively excluding the adoption of nf-core/eager from labs that routinely use internal barcodes? |
True but this goes back to the following:
If you're checking for barcode hopping, this is something that should be done at the demultiplexing level, which eager does not (and will not) do. If we follow this concept, then we can consider the following 'workflow'
Another example, is that @aidaanva knows of ENA/SRA data that still has barcodes on it. So they've removed the reads derived from hopping, and the adapters, but still has the inline barcode (for whatever reason). So it is the same sort of thing. Or you disagree and we should do the sorting for users too? My fear is if we go down that route then we have to come up a way of estimating barcode hopping for the user... unless you think this is worth the effort? But then should we go even further back and include demultiplexing? |
I don't believe demultiplexing should be part of eager. Usually it is done more centrally and hence not really something the pipeline should include imho. How do you imagine this being specified? Do we want users to provide the length of the barcodes in a per sample basis (i.e. TSV column)? That would be more flexible, but a uniform barcode length for an eager run would be much easier to implement. |
I would also go for the uniform one. Some people might complain 'but I want to mix public and my own data in one run' but if there is public-data with dodgy barcodes, they should be scruitinised more closely anyway and can be in a separate run. Could maybe reconsider in DSL2, and have a column specifying 'includes barcodes' or not, but then it might vary depending if it is single or double barcodes.... Anyway, are you happy to continue but with experimenting with |
I can take over |
Added in https://github.com/nf-core/eager/tree/inline-barcode-trimming but currently running it with skipping_AR fails:
|
Is your feature request related to a problem? Please describe
During a workshop today @eg715 asked if we had the ability to also remove internal barcodes - in addition to indicies, which we currently don't, but absolutely should because I know a lot of groups use this now and is perfect thing for eager.
However I have no experience this, so need someone with knowledge to help develop this.
Describe the solution you'd like
Secondary FASTQ trimming step to remove internal barcodes after adapater removal
Describe alternatives you've considered
NA
Additional context
cutadapt maybe able to do this for us, and is used in other nf-core pipelines (as pointed out by @DiegoBrambilla )
The text was updated successfully, but these errors were encountered: