-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split mutect2 VCF to snps, mnps and indels #170
Conversation
…ne-call-sSNV into sfitz-mv-mutect-indels
…ne-call-sSNV into sfitz-mv-mutect-indels
…ne-call-sSNV into sfitz-mv-mutect-indels
…sfitz-mv-mutect-indels
module/common.nf
Outdated
""" | ||
|
||
process index_VCF_bcftools { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally use bgzip
and tabix
for VCF indexing, which has been modularized by Mao. https://github.com/uclahs-cds/pipeline-Nextflow-module/tree/main/modules/common/index_VCF_tabix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've converted it back to using the external compress_index_vcf
. However, I'm wondering how important it is to use that module. The reason being that bcftools
works directly on the zipped files and it seems inefficient to be uncompressing and then calling another process to compress and index it. Instead it could all be done with the one process, e.g. for filter_VCF_bcftools:
set -euo pipefail
bcftools view -f PASS --output-type z --output ${params.output_filename}_pass.vcf.gz $filtered
bcftools index --tbi ${params.output_filename}_pass.vcf.gz
For split_VCF:
set -euo pipefail
bcftools view --types $var_type --output-type z --output ${params.output_filename}_pass_${var_type}.vcf.gz ${passing_vcf}
Here it is relevant for those two steps. In the next step towards creating an intersected list of SNVs, I'm using bcftools
on the output of each of the tools in order to standardize the sample names, and also again on the MuSE output to reorder the samples. That would add another 4 times the vcf would be uncompressed and recompressed.
I'm not sure but I think in this case the VCFs could be left uncompressed throughout these steps, but there are many bcftools
commands that require the VCF to be zipped and indexed.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agreed with Sorel about the compressed part. If bcftools
can automatically generate vcf.gz, would it be better to directly save it as the compressed form?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to add updated testing results:
nftest run a_mini-mutect2
- output:
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/nftest-output
- log:
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/nftest-output/log-nftest-20230420T005619Z.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to consider with BCFtools:
bcftools view
allows for output file format specification (can choose to output compressed, uncompressed, etc) and also has a--write-index
option that will automatically generate the index when the compressed output option is chosen
Given that there are several steps involving BCFtools and the commands used are capable of automatically generating the index file, we should be able to avoid having to use any external (either from the module or a specific process defined here) compression/indexing processes. How do we feel about using the built-in option from BCFtools? @tyamaguchi-ucla @sorelfitzgibbon
Also, on a side note, it's probably worth de-coupling the compression and the indexing processes in the modularized workflow to detect if an input file is already compressed and only run indexing in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it with a VCF and they produced the same index (identical SHA512 checksums) so we should be ok. They're probably using the same function(s) between tabix/bcftools index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compression seems to produce slightly different files though, even with the same compression level. Though based on the bcftools issue, bgzip/tabix is probably the most reliant way to proceed to avoid missing records/index properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the issue is only for lines exceeding 2GB? As noted in the issue @tyamaguchi-ucla linked, 100,000 samples with 10 alternate alleles are handled correctly, but with 120 alternate alleles has a problem. Are there any ucla-cds pipelines that are going to make sense of 120 alternate alleles? And don't you think that as data sets grow, if this becomes a notable problem bcftools
will address it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sorelfitzgibbon The issue that I linked was just an example.
Are there any ucla-cds pipelines that are going to make sense of 120 alternate alleles?
Regenotype gSNP and gSV pipelines. Our Regeneron dataset is going to be ~150K samples minimum. In Regenotype-gSNP pipeline, we previously had this issue https://github.com/uclahs-cds/pipeline-regenotype-gSNP/pull/48 although it was due to an issue with GATK. Also, I think it's possible for this pipeline (esp. Mutect2 multi-sample mode) to have 120 alt alleles.
In regenotype-gSNP, it looks like @stefaneng used bcftools norm
and then tabix -p
. Not sure if he encountered an issue with bcftools index
or not. https://github.com/uclahs-cds/pipeline-regenotype-gSNP/blame/804ee2956993994a3e967bc3e9178cdacf5830da/module/joint-genotype-processes.nf @stefaneng do you happen to remember why you went with tabix -p
instead of bcftools index
?
And don't you think that as data sets grow, if this becomes a notable problem bcftools will address it?
Maybe or maybe not. We don't have control over this.
@yashpatel6 @sorelfitzgibbon As we're publishing call-sSNV soon, we would want to use a consistent indexing method across the algorithms if possible. If not, reviewers might ask why certain final VCFs were indexed using bcftools or tabix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tyamaguchi-ucla. I guess my question wasn't so much whether we'd ever see 120 alternate alleles, but whether they would be useful. Of course every parameter choice and filtering step loses some variants and are these messy variants important to keep in this context, or are they better addressed in pipelines specific for that purpose? Is it worth possibly several extra VCF decompression and compression to disk steps in order to avoid this potential loss here, or in most of the pipelines? I may well be naive but I have to think that given the extremely wide use of bcftools
, if the issue was significant it would be addressed? :-)
Submodule updated and merged main into this branch. Waiting for final approval to merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Anything else to add @tyamaguchi-ucla @maotian06 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find the log for this test case where all algorithms were tested - log_output_dir: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/nftest-output/a_mini_n2-all-tools-std-input/call-sSNV-6.0.0/TWGSAMIN000001-T001-S01-F/log-call-sSNV-6.0.0-20230504T212122Z
Can you please point it to me? @sorelfitzgibbon
Ah, sorry! Because of the issue with identical SM tags causing the multiple tumor test to fail, my log cleanup script deleted it. I've now restored it from the snapshot, but in the meantime I also reran it. nftest run |
I accidentally overwrote some of the output files while testing on another branch. So I reran the testing to keep it clean for reviewers. Is it possible/desirable to add the branch name to the nftest output path? nftest run |
Looks like the test failed?
Was this directory created by the test run? Also, it looks like there are some mostly empty log directories in |
No, just the multi-sample failed, which will be fixed by the other PR. |
No, sorry that was created by the other branch. |
I've cleaned up those with no successful runs within them |
I just noticed but it looks like Also, it looks like we can move some test directories to corresponding released dev directories. (e.g. |
@yashpatel6 is it possible for |
I do not have permission to move these files. Perhaps @maotian06 can move them? |
I don't have permission to move. Seem like @sorelfitzgibbon you are the owner of those files:
|
Thanks, I got that one. |
It isn't, |
…sfitz-mv-mutect-indels
@tyamaguchi-ucla the output directories listed in the Description section, for both CPCG and Amin samples are current and ready for review. |
Any reason not to use the standardized output structure in /hot/software ? If not, can you please set up the output path properly? (see https://github.com/uclahs-cds/tool-NF-test#environment-settings)
Hmm, why is the version
Did the pipeline generate the tmp files?
I don't think we need this directory
As we don't know which branch was tested here, we can probably delete this directory or any reason we shouldn't? |
@tyamaguchi-ucla, I have addressed all of your comments.
NF-test is set up to use Amini. I could move these results to And/or do you want me to add CPCG options to NF-test? If you want CPCG added to NF-test, are these suitable standard inputs for testing (and should I locate or create a contamination file to be included?):
The testing with CPCG was done a while ago. It was to test the
No, sorry that was me testing the identity of indices created by
removed
There are probably past PRs that refer to specific logs listed in here (with a different path but the same log-file-name). I'm OK with deleting it. |
@sorelfitzgibbon NF-test aside, your comment makes me wonder if we're on the same page regarding the standardized test directory structure. Could I ask you why we use this particular directory structure or I'm happy to have a quick chat if that works better? https://confluence.mednet.ucla.edu/pages/viewpage.action?spaceKey=BOUTROSLAB&title=Nextflow+pipeline+standardization#Nextflowpipelinestandardization-Pipelinedevelopmenttestrundirectorystructure |
I thought we were going to discuss this at the nextflow working group tomorrow. Whether just the final testing of a branch should be archived or all testing or something in-between. Tangentially, I assume at least this line from that confluence page is out of date now that |
Sure but I think tests generated for reviewers should be final in general to avoid confusing reviewers (maybe except when a PR is draft and the developer specifically wants somebody to look at a test directory). Of course, developers are welcome to create tests not meant for reviewers. However, test outputs should not be generated under version controlled directories to avoid accidental push. We can discuss this topic more tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Description
Mutect2
workflow to split the output VCF into three VCFs with SNVs, MNVs and indels.Mutect2
pipeline steps inREADME
Mutect2
flowchart inREADME
Closes #169
Testing Results
CPCG Tumor/Normal Paired Sample:
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-mv-mutect-indels/mutect2.CPCG-chr1.log
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-mv-mutect-indels/CPCG.noContam.yaml
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-mv-mutect-indels/CPCG.chr1.config
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-mv-mutect-indels/CPCG-chr1
Variant Counts
926 non-SNVs removed from a total of 2617 chr1 variants.
A-mini:
nftest run
nftest log:
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-mv-mutect-indels/log-nftest-20230516T003258Z.log
Three log output
directories
:/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-mv-mutect-indels/a_mini_n2-all-tools-std-input/call-sSNV-6.0.0/TWGSAMIN000001-T001-S01-F/log-call-sSNV-6.0.0-20230516T003303Z
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-mv-mutect-indels/a_mini_n2-mutect2-tumor-only/call-sSNV-6.0.0/TWGSAMIN000001-T001-S01-F/log-call-sSNV-6.0.0-20230516T004011Z
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-mv-mutect-indels/a_mini-mutect2-multiple-samples/call-sSNV-6.0.0/TWGSAMIN000001/log-call-sSNV-6.0.0-20230516T004618Z
Checklist
I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added my name to the contributors listings in the
manifest
block in thenextflow.config
as part of this pull request; I am listed already, or do not wish to be listed. (This acknowledgement is optional.)I have added the changes included in this pull request to the
CHANGELOG.md
under the next release version or unreleased, and updated the date.I have updated the version number in the
metadata.yaml
andmanifest
block of thenextflow.config
file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)I have tested the pipeline on at least one A-mini sample.