[BUG] Annotation fails, cause mysterious #323

dlhuseby29 · 2024-10-01T19:34:41Z

Describe the bug
I can successfully annotate the test genome, and I can run the annotation process on my own sequence, as long as I run it from the command line in the 'for-my-own-use' way, so errors are ignored. As soon as I run it using the yaml file inputs, it fails within a few minutes. I have stripped down the yaml files to the minimum, but it still doesn't complete the annotation.

Expected behavior
I expect the annotation process to complete without errors --or if it fails to at least fail in a way such that I can tell what has gone wrong and fix it.

Software versions (please complete the following information):
iOS 13.6.9
pgap.py 2024-07-18.build7555
Docker version 27.2.0, build 3ab4256

Log Files
I've attached the full cwltool.log, but the first permanentFail is here:

[2024-10-01 19:17:12] DEBUG [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] initial work dir {}
[2024-10-01 19:17:12] INFO [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] /pgap/output/debug/tmp-outdir/_g21qgzc$ xml_evaluate
-input
/pgap/output/debug/tmpdir/aviqz9nx/stg49659d99-b614-4b88-8754-a66ec8ce1077/sequences.val
-xpath-fail
'//*[
( @Severity="ERROR" or @Severity="REJECT" )
and not(contains(@code, "GENERIC_MissingPubRequirement"))
and not(contains(@code, "SEQ_DESCR_ChromosomeLocation"))
and not(contains(@code, "SEQ_DESCR_MissingLineage"))
and not(contains(@code, "SEQ_DESCR_NoTaxonID"))
and not(contains(@code, "SEQ_DESCR_OrganismIsUndefinedSpecies"))
and not(contains(@code, "SEQ_DESCR_StrainWithEnvironSample"))
and not(contains(@code, "SEQ_DESCR_BacteriaMissingSourceQualifier"))
and not(contains(@code, "SEQ_DESCR_UnwantedCompleteFlag"))
and not(contains(@code, "SEQ_FEAT_BadCharInAuthorLastName"))
and not(contains(@code, "SEQ_FEAT_ShortIntron"))
and not(contains(@code, "SEQ_INST_InternalNsInSeqRaw"))
and not(contains(@code, "SEQ_INST_ProteinsHaveGeneralID"))
and not(contains(@code, "SEQ_PKG_NucProtProblem"))
and not(contains(@code, "SEQ_PKG_ComponentMissingTitle"))
]
' > /pgap/output/debug/tmp-outdir/_g21qgzc/initial_asnval_diag.xml
[2024-10-01 19:17:12] DEBUG Could not collect memory usage, job ended before monitoring began.
[2024-10-01 19:17:12] WARNING [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] exited with status: 1
[2024-10-01 19:17:12] WARNING [job Prepare_Unannotated_Sequences_asnvalidate_evaluate] completed permanentFail

Additional context
The final message in the crash is always that it hates my name for some reason?

Failer nodes:

Bad last name 'Lastname'

Bad first name 'Firstname'

Any suggestions of how to get around this would be really appreciated.

Kind regards,

cwltool.log

azat-badretdin · 2024-10-02T09:18:25Z

Thank you for your report, user @dlhuseby29 !

Kudos for looking into permanentFail in cwltool.log. Right move!

The output says

Bad last name 'Lastname'

Bad first name 'Firstname'

This is from our QA which needed to prevent users from submitting genomes to Genbank under the default name from our examples.

I would recommend to come up with a different fictitious name in the input file.

dlhuseby29 · 2024-10-02T09:45:04Z

It gave me this name error even when I had a submol yaml file with my name in it....

It seems like there is something going wrong with the processing of the yaml files, since it appears that I can successfully run the pipeline with the following command:

./pgap.py -r --debug -o test_annotation_2 -g EN1740_complete.fasta -s 'Pseudomonas aeruginosa'

But not at all if I use this command:

./pgap.py -r --debug -o result input.yaml

With extremely stripped down input and submol yaml files:

fasta:
class: File
location: EN1740_complete.fasta
submol:
class: File
location: submol2.yaml

and

organism:
genus_species: Pseudomonas aeruginosa

azat-badretdin · 2024-10-02T09:54:53Z

It gave me this name error even when I had a submol yaml file with my name in it....

Could you please post the relevant portion of cwltool.log file (as you did before) for this case?

Meanwhile, if input.yaml method works for you feel free to use it as workaround.

azat-badretdin · 2024-10-02T09:55:29Z

Also: does it work with Quick Start example for -s/-g option combo?

azat-badretdin · 2024-10-02T09:58:52Z

I am confused. The cwltool.log you attached does not complain about "Bad name". Instead it ends at failing at xml_evaluate

Please have a look at initial_asnval_diag.xml in the output directory. Is it there? If yes, it might have additional clues: messages with ERROR in them

dlhuseby29 · 2024-10-02T10:10:57Z

Yes, I have confused the issue.

The bad name is just a side error that pops up every time I run it which makes it seem like the problem I am having is with the name. I added this below the additional context heading, but I realize that this was confusing.

The actual problem is posted above that and it is that whenever I have any yaml files as inputs, the annotation fails within the first 20 minutes or so. If I just run everything from the command line with a manual input of the fasta file and no metadata besides the organism, everything seems to run fine.

Sorry for the confusing post.

dlhuseby29 · 2024-10-02T10:12:38Z

I just had it fail in this way using the yaml file inputs. I have posted the cwltool.log for this failure.
cwltool.log

azat-badretdin · 2024-10-02T11:12:15Z

Please have a look at initial_asnval_diag.xml in the output directory. Is it there? If yes, it might have additional clues: messages with ERROR in them

dlhuseby29 · 2024-10-02T12:33:21Z

I don't see that file in the output folder of any of these runs.

azat-badretdin · 2024-10-02T12:43:22Z

Do you have any .xml files?

azat-badretdin · 2024-10-02T12:43:36Z

Could you please post the listing of output folder?

dlhuseby29 · 2024-10-02T12:46:59Z

Here is a screenshot of everything in there and a zipped version of the entire folder.

Archive.zip

azat-badretdin · 2024-10-02T12:51:39Z

OK, you have a debug/ foilder already. This is good. Please find this file using

find debug/ -type f -name initial_asnval_diag.xml

and post `grep ERROR ' output here.

Thanks!

dlhuseby29 · 2024-10-02T12:57:25Z

Maybe I did this wrong, but...

(base) xxxxxxx@UUC-02V8279HTD5 alpkek6w % grep ERROR initial_asnval_diag.xml
Bad last name 'Lastname'
Bad first name 'Firstname'

The whole file is attached (with txt appended for upload purposes), because maybe I did the search wrong.
initial_asnval_diag.txt

azat-badretdin · 2024-10-02T13:01:03Z

Thanks. I just found it myself as well, using your Archive.zip link (sorry did not notice right away)

gpipedev21:debug$ less ./debug/tmp-outdir/alpkek6w/initial_asnval_diag.xml
Failer nodes:
<?xml version="1.0" encoding="UTF-8"?>
<message severity="ERROR" seq-id="lcl|contig001" code="GENERIC_BadSubmissionAuthorName">Bad last name 'Lastname'</message>

<?xml version="1.0" encoding="UTF-8"?>
<message severity="ERROR" seq-id="lcl|contig001" code="GENERIC_BadSubmissionAuthorName">Bad first name 'Firstname'</message>

This is becoming interesting.

So what happens when you run the example in Qucik Start? It should have the same problem. In the absence of submission parameters for authors it resorts to the same lastname/firstname combo

Also:

You have posted cwltool.log snippet without including it in "code" markup. This resulted in removal of all XML elements from the output of cwltool.log except their content. This resulted in confusion, I did not realize immediately that the report of these "bad last" lines came in the form of XML.

dlhuseby29 · 2024-10-02T13:04:03Z

Sorry about that, I know just enough of this stuff to get myself into trouble, but not really enough to solve any problems. I will try to do the code markup next time.

If I run this as a quick start annotation, it runs cleanly without any trouble. Here is a screenshot of the output folder:

dlhuseby29 · 2024-10-02T13:06:01Z

And as I said, this is the entire text of the submol.yaml:

organism:
genus_species: Pseudomonas aeruginosa

I've run it with other versions of the submol.yaml, but everything seems to generate a similar crash.

azat-badretdin · 2024-10-02T13:18:32Z

When you run it with -g/-s options you do not need any submol.yaml files.

Did you try


./pgap.py -r -o mg37_results -g $HOME/.pgap/test_genomes/MG37/ASM2732v1.annotation.nucleotide.1.fasta -s 'Mycoplasmoides genitalium

the example from Quick Start? This example is part of our regular testing of software before the release.

I am failing to catch what specifically changes between this example and your example

dlhuseby29 · 2024-10-02T13:21:22Z

Yes, I have tried the test annotation, and it works fine.

If I annotate this genome (my data) using the command line (-g/-s), it also works fine.

If I annotate this genome with an input of yaml files, it fails.

I'm currently running a '-g/-s' annotation on one of my fasta files and it has been happily churning for about an hour. If I input a yaml file it dies within 10 minutes or so.

azat-badretdin · 2024-10-02T13:31:15Z

If I annotate this genome with an input of yaml files, it fails.

OK. So let's work on this one. This is better. For this case I think you have to specify the correct first and last name and if you did that already as you said and failed, I need both input.yaml and submol.yaml (please post them as "code" markup)

Thanks!

dlhuseby29 · 2024-10-02T14:30:26Z

input.yaml

fasta:
  class: File
  location: EN1740_complete.fasta
submol:
  class: File
  location: submol.yaml

submol.yaml

organism:
    genus_species: Pseudomonas aeruginosa

These are the files that I have been trying to use. Originally, I had a submol.yaml with more metadata, but if the basic form above doesn't work, then the longer version certainly won't work?

Anyway, here is the more complete version:

submol2.yaml

topology: 'circular'
organism:
    genus_species: 'Pseudomonas aeruginosa' 
    strain: 'EN1740'
contact_info:
    last_name: 'Huseby'
    first_name: 'Douglas'
    email: '[email protected]'
    organization: 'Uppsala University'
    department: 'Department of Medical Biochemistry and Microbiology'
    street: 'Husargatan 3, Box 582'
    city: 'Uppsala'
    postal_code: '75124'
    country: 'Sweden'

azat-badretdin · 2024-10-02T14:37:57Z

Thanks

Could you please post the relevant portion of the cwltool.log output. Does it complain about last name Lastname or your own name?

Also, if the failure persist, I might recommend to follow the example of the submol that comes in test_genomes/ directories and include authors: section in your submol.yaml file.

dlhuseby29 · 2024-10-02T15:14:52Z

I'm not sure if you wanted me to actually find it in the cwltool.log or not, since it isn't super obvious there. I've attached that file so you can see if you can find anything.

Finding the initial_asnval_diag.xml file in the debug folder had the same thing in it though:

Failer nodes:
<?xml version="1.0" encoding="UTF-8"?>
<message severity="ERROR" seq-id="lcl|contig001" code="GENERIC_BadSubmissionAuthorName">Bad last name 'Lastname'</message>

<?xml version="1.0" encoding="UTF-8"?
[cwltool.log](https://github.com/user-attachments/files/17231854/cwltool.log)
>
<message severity="ERROR" seq-id="lcl|contig001" code="GENERIC_BadSubmissionAuthorName">Bad first name 'Firstname'</message>

cwltool.log

azat-badretdin · 2024-10-02T15:24:16Z

I think what happens is that you need to fill up both contact info and authors sections of submol.yaml, could you please try this?

dlhuseby29 · 2024-10-02T15:44:47Z

So I took your suggestion and grabbed the submol.yaml from the MG37 test folder.

I ran it as-is only changing the organism name. This seemed to run as if it was going to do the whole annotation. I didn't run it the full time, but if it fails, it actually usually fails in under 5 minutes and this went for 15 minutes before I killed it.

If I delete the 'author' portion and everything after in this submol.yaml, I recreate the error and it fails in 3 minutes due to a BadSubmissionAuthor error.

azat-badretdin · 2024-10-02T15:50:15Z

It looks like there is a problem when in YAML input scenario only contact_info: is specified but not authors

So the workaround is to always specify them both.

dlhuseby29 · 2024-10-02T16:20:49Z

Looks like I don't even have to do that much. This worked for the submol.yaml:

organism:
    genus_species: Pseudomonas aeruginosa 
authors:
    -     author:
            first_name: 'Arnold'
            last_name: 'Schwarzenegger'

And it would not work with just the organism specified, so it seems like specifying an author is an absolute requirement now?

So, is this just a 'me' problem?

Thanks for working through this with me. Such an oddly specific problem.

azat-badretdin · 2024-10-02T19:05:27Z

And it would not work with just the organism specified, so it seems like specifying an author is an absolute requirement now?

It looks this way. Without this field, Dr. Firstname Lastname sneaks into your submissions and that triggers our validation guards that do not like this hyperactive scientist.

azat-badretdin added the PGAPX-1434 label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Annotation fails, cause mysterious #323

[BUG] Annotation fails, cause mysterious #323

dlhuseby29 commented Oct 1, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024 •

edited

Loading

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024 •

edited

Loading

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024 •

edited

Loading

dlhuseby29 commented Oct 2, 2024 •

edited

Loading

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024 •

edited

Loading

azat-badretdin commented Oct 2, 2024

[BUG] Annotation fails, cause mysterious #323

[BUG] Annotation fails, cause mysterious #323

Comments

dlhuseby29 commented Oct 1, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024 • edited Loading

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024 • edited Loading

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024

azat-badretdin commented Oct 2, 2024 • edited Loading

dlhuseby29 commented Oct 2, 2024 • edited Loading

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024 • edited Loading

azat-badretdin commented Oct 2, 2024

dlhuseby29 commented Oct 2, 2024 •

edited

Loading

dlhuseby29 commented Oct 2, 2024 •

edited

Loading

azat-badretdin commented Oct 2, 2024 •

edited

Loading

dlhuseby29 commented Oct 2, 2024 •

edited

Loading

dlhuseby29 commented Oct 2, 2024 •

edited

Loading