Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced pipeline logic to support user-defined sample_name input #24

Merged
merged 15 commits into from
Sep 25, 2024

Conversation

kylacochrane
Copy link
Collaborator

@kylacochrane kylacochrane commented Aug 7, 2024

For IRIDA Next Use
This PR enhances the pipeline configuration and logic, allowing users to specify a sample_name for their project samples in IRIDA Next. All output files generated by the pipeline will be named using the user-provided sample_name instead of the IRIDA Next Identifier.

In IRIDA Next, each sample is automatically assigned an identifier by the platform. The changes in this PR ensure that the iridanext.output.json file is generated using this IRIDA Next Identifier (set to meta.irida_id in the pipeline), guaranteeing that output files and metadata are accurately linked to the correct sample on the IRIDA Next platform.

For local use, the pipeline includes logic to set sample_name equal to sample (i.e., meta.id = meta.irida_id), so local users don't need to provide both sample and sample_name in the input samplesheet.

Additional logic is included to rename duplicates and deal with spaces in the sample_name parameter.

This pipeline has been tested on a local instance of IRIDA Next, which has also been updated to support the additional sample_name parameter as detailed here: phac-nml/irida-next#678

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).

Copy link

github-actions bot commented Aug 7, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit f6e6b0f

+| ✅ 134 tests passed       |+
#| ❔  26 tests were ignored |#
!| ❗   3 tests had warnings |!

❗ Test warnings:

  • nextflow_config - Config manifest.version should end in dev: 2.1.1
  • schema_lint - Schema $id should be https://raw.githubusercontent.com/phac-nml/speciesabundance/master/nextflow_schema.json
    Found https://raw.githubusercontent.com/phac-nml/speciesabundance/main/nextflow_schema.json
  • nfcore_yml - nf-core version not set in .nf-core.yml

❔ Tests ignored:

  • files_exist - File is ignored: assets/nf-core-speciesabundance_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-speciesabundance_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-speciesabundance_logo_dark.png
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • files_exist - File is ignored: lib/Utils.groovy
  • files_exist - File is ignored: lib/WorkflowMain.groovy
  • files_exist - File is ignored: lib/NfcoreTemplate.groovy
  • files_exist - File is ignored: lib/WorkflowSpeciesabundance.groovy
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • files_unchanged - File does not exist: assets/nf-core-speciesabundance_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-speciesabundance_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-speciesabundance_logo_dark.png
  • files_unchanged - File ignored due to lint config: docs/README.md
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/speciesabundance/speciesabundance/.github/workflows/awstest.yml
  • actions_awsfulltest - actions_awsfulltest
  • pipeline_name_conventions - pipeline_name_conventions

✅ Tests passed:

Run details

  • nf-core/tools version 2.14.1
  • Run at 2024-09-25 21:25:49

Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @kylacochrane . This is amazing work. Looks great in IRIDA Next 😄

image

I have a few comments for you below. I will go through some more review of this next week.

workflows/speciesabundance.nf Outdated Show resolved Hide resolved
workflows/speciesabundance.nf Outdated Show resolved Hide resolved
tests/main.nf.test Show resolved Hide resolved
assets/schema_input.json Outdated Show resolved Hide resolved
tests/main.nf.test Show resolved Hide resolved
workflows/speciesabundance.nf Outdated Show resolved Hide resolved
Comment on lines +85 to +88
// Ensure ID is unique by appending meta.irida_id if needed
while (processedIDs.contains(meta.id)) {
meta.id = "${meta.id}_${meta.irida_id}"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a useful solution to ensure unique keys, but I wonder if all of this is better serviced by unique keys with strict patterns (the same criteria as the previous IDs) handled by the schema_input.json code?

Copy link

@sgsutcliffe sgsutcliffe Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an idea of about a different way of creating unique keys, so as to not rely on the meta.irida_id to append to meta.id.

The idea is, rather than appending meta.irida_id we append the index of the sample in the samplesheet.csv, so that every sample from {1..n} has _1 to _n appended.

My logic is that it might be less confusing to have _n for every sample rather than samples with that long IRIDA-Next ID appended to them. I'd suggest doing it to every sample, so users are not confused as to why it shows up. The code I was thinking would look something like this (my use of closures is amateur at best, so bear with me here).

Channel.fromSamplesheet("input").toList().flatMap { it.withIndex().collect{ entry, idx ->
        entry[0].id = "${entry[0].id}_${idx}"
        return [entry[0], entry[1], entry[2]]
        }}

Copy link
Member

@apetkau apetkau Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emarinier thanks for the comment. The issue with unique keys/strict patterns is that the pipeline will then error if something is passed that is incorrect. Since IRIDA Next is pretty loose on sample names (e.g., allows spaces in names, etc) this could lead to errors in running those samples unless they were renamed

@sgsutcliffe thanks for your suggestion. I know we have already spoken about this for other pipelines/the best way forward.

@kylacochrane kylacochrane marked this pull request as ready for review August 12, 2024 18:12
Copy link

@sgsutcliffe sgsutcliffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for taking the time to show me this running on IRIDA Next

Copy link

@sgsutcliffe sgsutcliffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One last thing to change before merging is to update the documentation (README.md, CHANGELOG.md, docs/usage.md) with the new samplesheet format.

Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much @kylacochrane . This is amazing work 😄 . Sorry for taking so long to review it.

modules/local/topN/main.nf Show resolved Hide resolved
Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for adding this documentation Kyla. This looks really great 😄

I just have one comment in-line.

README.md Show resolved Hide resolved
Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks so much 😄


### `Changed`

- Added the ability to include a `sample_name` column in the input samplesheet.csv. Allows for compatibility with IRIDA-Next input configuration [PR24](https://github.com/phac-nml/speciesabundance/pull/24)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to change descriptions (left it out originally but it was requested to be added). Basically create sub-bullet points with this information:

  • sample_name special characters will be replaced with "_"
  • If no sample_name is supplied in the column sample will be used
  • To avoid repeat values for sample_name all sample_name values will be suffixed with the unique sample value from the input file

README.md Outdated

`sample_name`, allows more flexibility in naming output files or sample identification. Unlike `sample`, `sample_name` is not required to contain unique values. `Nextflow` requires unique sample names, and therefore in the instance of repeat `sample_names`, `sample` will be suffixed to any `sample_name`. Non-alphanumeric characters (excluding `_`,`-`,`.`) will be replaced with `"_"`.

An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
Copy link

@sgsutcliffe sgsutcliffe Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest putting this example samplesheet under the Input header than put tests/data/samplename_samplesheet.csv for the IRIDA-Next Optional Input Configuration, as this is the example samplesheet that shows the sample_name column the way you did in your docs/usage.

docs/usage.md Outdated

An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline, which includes the `sample_name` column.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again with the test samplesheet

Copy link

@sgsutcliffe sgsutcliffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor comments (suggestions). Looks good.

@kylacochrane
Copy link
Collaborator Author

Just a few minor comments (suggestions). Looks good.

Thanks for the suggestions Steven! I made the changes: 1af1a8f

@kylacochrane kylacochrane reopened this Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants