-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced pipeline logic to support user-defined sample_name
input
#24
Conversation
|
…column based on 'sample' (meta.irida_id) for proper nf-iridanext plugin metadata functionality.
…s not provided in the samplesheet
…ted in the input samplesheet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much @kylacochrane . This is amazing work. Looks great in IRIDA Next 😄
I have a few comments for you below. I will go through some more review of this next week.
// Ensure ID is unique by appending meta.irida_id if needed | ||
while (processedIDs.contains(meta.id)) { | ||
meta.id = "${meta.id}_${meta.irida_id}" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a useful solution to ensure unique keys, but I wonder if all of this is better serviced by unique keys with strict patterns (the same criteria as the previous IDs) handled by the schema_input.json
code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an idea of about a different way of creating unique keys, so as to not rely on the meta.irida_id
to append to meta.id
.
The idea is, rather than appending meta.irida_id
we append the index of the sample in the samplesheet.csv, so that every sample from {1..n}
has _1
to _n
appended.
My logic is that it might be less confusing to have _n
for every sample rather than samples with that long IRIDA-Next ID appended to them. I'd suggest doing it to every sample, so users are not confused as to why it shows up. The code I was thinking would look something like this (my use of closures is amateur at best, so bear with me here).
Channel.fromSamplesheet("input").toList().flatMap { it.withIndex().collect{ entry, idx ->
entry[0].id = "${entry[0].id}_${idx}"
return [entry[0], entry[1], entry[2]]
}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emarinier thanks for the comment. The issue with unique keys/strict patterns is that the pipeline will then error if something is passed that is incorrect. Since IRIDA Next is pretty loose on sample names (e.g., allows spaces in names, etc) this could lead to errors in running those samples unless they were renamed
@sgsutcliffe thanks for your suggestion. I know we have already spoken about this for other pipelines/the best way forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for taking the time to show me this running on IRIDA Next
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! One last thing to change before merging is to update the documentation (README.md, CHANGELOG.md, docs/usage.md) with the new samplesheet format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much @kylacochrane . This is amazing work 😄 . Sorry for taking so long to review it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for adding this documentation Kyla. This looks really great 😄
I just have one comment in-line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks so much 😄
|
||
### `Changed` | ||
|
||
- Added the ability to include a `sample_name` column in the input samplesheet.csv. Allows for compatibility with IRIDA-Next input configuration [PR24](https://github.com/phac-nml/speciesabundance/pull/24) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to change descriptions (left it out originally but it was requested to be added). Basically create sub-bullet points with this information:
sample_name
special characters will be replaced with"_"
- If no
sample_name
is supplied in the columnsample
will be used - To avoid repeat values for
sample_name
allsample_name
values will be suffixed with the uniquesample
value from the input file
README.md
Outdated
|
||
`sample_name`, allows more flexibility in naming output files or sample identification. Unlike `sample`, `sample_name` is not required to contain unique values. `Nextflow` requires unique sample names, and therefore in the instance of repeat `sample_names`, `sample` will be suffixed to any `sample_name`. Non-alphanumeric characters (excluding `_`,`-`,`.`) will be replaced with `"_"`. | ||
|
||
An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest putting this example samplesheet under the Input header than put tests/data/samplename_samplesheet.csv
for the IRIDA-Next Optional Input Configuration, as this is the example samplesheet that shows the sample_name
column the way you did in your docs/usage
.
docs/usage.md
Outdated
|
||
An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. | ||
An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline, which includes the `sample_name` column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again with the test samplesheet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments (suggestions). Looks good.
Thanks for the suggestions Steven! I made the changes: 1af1a8f |
For IRIDA Next Use
This PR enhances the pipeline configuration and logic, allowing users to specify a
sample_name
for their project samples in IRIDA Next. All output files generated by the pipeline will be named using the user-providedsample_name
instead of the IRIDA Next Identifier.In IRIDA Next, each sample is automatically assigned an identifier by the platform. The changes in this PR ensure that the iridanext.output.json file is generated using this IRIDA Next Identifier (set to
meta.irida_id
in the pipeline), guaranteeing that output files and metadata are accurately linked to the correct sample on the IRIDA Next platform.For local use, the pipeline includes logic to set
sample_name
equal tosample
(i.e., meta.id = meta.irida_id), so local users don't need to provide both sample and sample_name in the input samplesheet.Additional logic is included to rename duplicates and deal with spaces in the
sample_name
parameter.This pipeline has been tested on a local instance of IRIDA Next, which has also been updated to support the additional
sample_name
parameter as detailed here: phac-nml/irida-next#678PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).