Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqtransform permanentFail : CFastaReader: Seq-id lcl|0-45901 is a duplicate around line 993 #42

Closed
scorreard opened this issue Jun 20, 2023 · 2 comments

Comments

@scorreard
Copy link

scorreard commented Jun 20, 2023

Describe the bug
seqtransform permanentFail

Hi team! Thanks for the tool, I used your tool several times after generating hifiasm assemblies and it worked perfectly, so not an installation issue. This time, I generated an assembly using Flye with both Hifi reads and ONT reads (simplex). I run fcs adaptor before scaffolding.
I think the error is due to 2 contigs having the same length, even though they have different sequences.
Looking forward your feedback,

Solenne

To Reproduce

/app/fcs/bin/av_screen_x \
    -o output/ \
    --debug --euk \
    input_ont_fastq_1_assembly_consensus.cut250.tigmint.fa.k32.w100.z100.ntLink.scaffolds_cleaned3.fa

I could share the genome with you if needed, but not sure it is necessary

Software versions :

  • OS CentOS 7
  • Cloud Platform VM : No, local HPC
  • Docker or Singularity version : singularity version 3.8.7-1
  • Docker or Singularity FCS image version : ftp.ncbi.nlm.nih.gov-genomes-TOOLS-FCS-releases-0.4.0-fcs-adaptor.sif

Log Files

Tail of output/fcs_adaptor.log

/projects/cbp/scratch/Monterey_sea_lemon_010/V1/work/78/6f41b71d829668e24d570374e6a55c/output/debug.4dgcxzo_/tmp-outdirqnyhvbc2$ seqtransform \
   -out \
   validated.fna_0.cleaned_fa \
   -in \
   /projects/cbp/scratch/Monterey_sea_lemon_010/V1/work/78/6f41b71d829668e24d570374e6a55c/output/debug.4dgcxzo_/tmpsbhke6mp/stg38e3a4b6-ca70-4d01-964c-9ca0fad363d8/validated.fna_0.fna \
   -seqaction-xml-file \
   /projects/cbp/scratch/Monterey_sea_lemon_010/V1/work/78/6f41b71d829668e24d570374e6a55c/output/debug.4dgcxzo_/tmpsbhke6mp/stg5297ee96-87b6-4f2c-8cd3-c017eb97e817/fcs_calls.xml \
   -report \
   seqtransform.log
[job seqtransform_step] Max memory used: 24MiB
[job seqtransform_step] completed permanentFail
[step seqtransform_step] completed permanentFail
[workflow GenerateCleanedFasta] completed permanentFail
[step GenerateCleanedFasta] completed permanentFail
[workflow ] completed permanentFail
Output will be placed in: /projects/cbp/scratch/Monterey_sea_lemon_010/V1/work/78/6f41b71d829668e24d570374e6a55c/output
Executing the workflow
Traceback (most recent call last):
 File "/projects/cbp/scratch/tmp/Bazel.runfiles_2b1_r3a1/runfiles/cgr_fcs/apps/public/av_screen_x/av_screen_x.py", line 270, in <module>
   sys.exit(main())
 File "/projects/cbp/scratch/tmp/Bazel.runfiles_2b1_r3a1/runfiles/cgr_fcs/apps/public/av_screen_x/av_screen_x.py", line 258, in main
   p.launch()
 File "/projects/cbp/scratch/tmp/Bazel.runfiles_2b1_r3a1/runfiles/cgr_fcs/apps/public/av_screen_x/av_screen_x.py", line 181, in launch
   pipeline(**self.pipeline_args)
 File "/projects/cbp/scratch/tmp/Bazel.runfiles_2b1_r3a1/runfiles/pip_deps_pypi__cwltool_3_1_20211107152837/cwltool/factory.py", line 34, in __call__
   raise WorkflowStatus(out, status)
cwltool.factory.WorkflowStatus: Completed permanentFail

Tail of output/debug.4dgcxzo_/tmp-outdirqnyhvbc2/seqtransform.log

	<msg level='info'  code='No edits'  location='0-45901'>success</msg>
	<msg level='error'  code='bad input format'  location='line 994'>NCBI C++ Exception:&#xa;    T0 &quot;/netopt/ncbi_tools64/c++.by-date/20221028/GCC730-Release64MT/../src/objmgr/uti
l/sequence.cpp&quot;, line 2941: Error: (CObjmgrUtilException::eBadLocation) ncbi::objects::CFastaOstream::x_WriteSeqIds() - Duplicate Seq-id lcl|0-45901 in FASTA output&#xa;</msg>
	<msg level='error'  code='bad input format'  location='lcl|0-45901'>CFastaReader: Seq-id lcl|0-45901 is a duplicate around line 993</msg>
</command-line-tool-report>
grep '0-45901' input_ont_fastq_1_assembly_consensus.cut250.tigmint.fa.k32.w100.z100.ntLink.scaffolds_cleaned3.fa 
>contig_3807::contig_3807:0-45901 None-None
>contig_3898::contig_3898:0-45901 None-None

grep '0-45901' -A 1 input_ont_fastq_1_assembly_consensus.cut250.tigmint.fa.k32.w100.z100.ntLink.scaffolds_cleaned3.fa
==> shows that the 2 sequences are different

Additional context
I think it thinks the sequence is duplicated because it has the same coordinates '0-45901' even though they are different contigs?

@scorreard scorreard changed the title [BUG]: <title> seqtransform permanentFail : CFastaReader: Seq-id lcl|0-45901 is a duplicate around line 993 Jun 20, 2023
@etvedte
Copy link
Contributor

etvedte commented Jun 20, 2023

Hi Solenne,

It seems to be an issue with your FASTA seq-ids/headers. I made a testing FASTA with the exact headers you included above and got a similar error. When I deleted the trailing None-None from both sequences, it worked. When I deleted one None-None, it also worked. When I replaced None-None with two identical strings following the contigid:coordinates, I got the error.

We may need to post some guidelines about FASTA header formatting if we see more similar issues. If you want to move forward now I would just adjust the headers to make them simpler yet distinct.

Eric

@scorreard
Copy link
Author

Thanks Eric,
I'll try removing the 'None-None' and will update you later this week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants