Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agat_sp_filter_by_ORF_size.pl is not isoform aware #512

Open
GallVp opened this issue Dec 11, 2024 · 7 comments
Open

agat_sp_filter_by_ORF_size.pl is not isoform aware #512

GallVp opened this issue Dec 11, 2024 · 7 comments

Comments

@GallVp
Copy link

GallVp commented Dec 11, 2024

Thank you for your amazing work. I am a big fan of AGAT.

Describe the bug

agat_sp_filter_by_ORF_size.pl includes a multi-isoform gene to the good_gene_list if the first isoform passes the test filter even if the second isoform was to fail the test filter. This behaviour is either unintended or not documented.

General (please complete the following information):

  • AGAT version: quay.io/biocontainers/agat:1.4.2--pl5321hdfd78af_0
  • AGAT installation/use: docker
  • OS: macOS

To Reproduce

Please use the below GFF to do,

agat_sp_filter_by_ORF_size.pl -g t.gff -s 24 -o output

mRNA gene19851.t2 of gene gene19851 should fail the test and, therefore, gene gene19851 should be removed. But that is not what happens. Rather, gene19851 is retained in the output_sup24.gff file.

t.gff >

##gff-version 3
###
chr23	AUGUSTUS	gene	16515075	16516672	.	-	.	ID=gene19849;description=Protein%20of%20unknown%20function%20%28DUF1635%29
chr23	AUGUSTUS	mRNA	16515075	16516597	1	-	.	ID=gene19849.t1;Parent=gene19849;description=Protein%20of%20unknown%20function%20%28DUF1635%29
chr23	AUGUSTUS	exon	16515075	16515794	.	-	.	ID=gene19849.t1.exon1;Parent=gene19849.t1
chr23	AUGUSTUS	CDS	16515075	16515794	1	-	0	ID=gene19849.t1.cds1;Parent=gene19849.t1
chr23	AUGUSTUS	exon	16516562	16516597	.	-	.	ID=gene19849.t1.exon2;Parent=gene19849.t1
chr23	AUGUSTUS	CDS	16516562	16516597	1	-	0	ID=gene19849.t1.cds2;Parent=gene19849.t1
chr23	gmst	mRNA	16515075	16516672	.	-	.	ID=gene19849.t2;Parent=gene19849;description=Protein%20of%20unknown%20function%20%28DUF1635%29
chr23	gmst	exon	16515075	16515794	50.2	-	0	ID=gene19849.t2.exon1;Parent=gene19849.t2
chr23	gmst	CDS	16515075	16515794	50.2	-	0	ID=gene19849.t2.cds1;Parent=gene19849.t2
chr23	gmst	exon	16516562	16516672	50.2	-	0	ID=gene19849.t2.exon2;Parent=gene19849.t2
chr23	gmst	CDS	16516562	16516672	50.2	-	0	ID=gene19849.t2.cds2;Parent=gene19849.t2
###
chr23	gmst	gene	16530414	16531453	.	-	.	ID=gene19850;description=Myb-like%20DNA-binding%20domain
chr23	gmst	mRNA	16530414	16531453	.	-	.	ID=gene19850.t1;Parent=gene19850;description=Myb-like%20DNA-binding%20domain
chr23	gmst	exon	16530414	16531041	42.7	-	1	ID=gene19850.t1.exon1;Parent=gene19850.t1
chr23	gmst	CDS	16530414	16531041	42.7	-	1	ID=gene19850.t1.cds1;Parent=gene19850.t1
chr23	gmst	exon	16531197	16531453	42.7	-	0	ID=gene19850.t1.exon2;Parent=gene19850.t1
chr23	gmst	CDS	16531197	16531453	42.7	-	0	ID=gene19850.t1.cds2;Parent=gene19850.t1
###
chr23	AUGUSTUS	gene	16530414	16531542	.	-	.	ID=gene19851;description=Differing%20isoform%20descriptions
chr23	AUGUSTUS	mRNA	16530414	16531542	1	-	.	ID=gene19851.t1;Parent=gene19851;description=Myb-like%20DNA-binding%20domain
chr23	AUGUSTUS	exon	16530414	16530721	.	-	.	ID=gene19851.t1.exon1;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16530414	16530721	1	-	2	ID=gene19851.t1.cds1;Parent=gene19851.t1
chr23	AUGUSTUS	exon	16530824	16531041	.	-	.	ID=gene19851.t1.exon2;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16530824	16531041	1	-	1	ID=gene19851.t1.cds2;Parent=gene19851.t1
chr23	AUGUSTUS	exon	16531197	16531326	.	-	.	ID=gene19851.t1.exon3;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16531197	16531326	1	-	2	ID=gene19851.t1.cds3;Parent=gene19851.t1
chr23	AUGUSTUS	exon	16531428	16531542	.	-	.	ID=gene19851.t1.exon4;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16531428	16531542	1	-	0	ID=gene19851.t1.cds4;Parent=gene19851.t1
chr23	GeneMark.hmm3	mRNA	16531514	16531542	.	-	.	ID=gene19851.t2;Parent=gene19851;description=Hypothetical%20protein%20%7C%20no%20eggnog%20hit
chr23	GeneMark.hmm3	exon	16531514	16531542	.	-	0	ID=gene19851.t2.exon1;Parent=gene19851.t2
chr23	GeneMark.hmm3	CDS	16531514	16531542	.	-	0	ID=gene19851.t2.cds1;Parent=gene19851.t2
###
chr23	AUGUSTUS	gene	16539401	16545431	.	+	.	ID=gene19852;description=nuclease%20HARBI1
chr23	AUGUSTUS	mRNA	16539401	16545431	1	+	.	ID=gene19852.t1;Parent=gene19852;description=nuclease%20HARBI1
chr23	AUGUSTUS	exon	16539401	16539509	.	+	.	ID=gene19852.t1.exon1;Parent=gene19852.t1
chr23	AUGUSTUS	CDS	16539401	16539509	1	+	0	ID=gene19852.t1.cds1;Parent=gene19852.t1
chr23	AUGUSTUS	exon	16544386	16545431	.	+	.	ID=gene19852.t1.exon2;Parent=gene19852.t1
chr23	AUGUSTUS	CDS	16544386	16545431	1	+	2	ID=gene19852.t1.cds2;Parent=gene19852.t1
###
chr23	AUGUSTUS	gene	16556338	16556796	.	+	.	ID=gene19853;description=Zinc%20finger%20protein
chr23	AUGUSTUS	mRNA	16556338	16556796	1	+	.	ID=gene19853.t1;Parent=gene19853;description=Zinc%20finger%20protein
chr23	AUGUSTUS	exon	16556338	16556796	.	+	.	ID=gene19853.t1.exon1;Parent=gene19853.t1
chr23	AUGUSTUS	CDS	16556338	16556796	1	+	0	ID=gene19853.t1.cds1;Parent=gene19853.t1
###

< END

@Juke34
Copy link
Collaborator

Juke34 commented Dec 11, 2024

Currently, in such case the gene should be reported in both outputs output_NOT_sup24.gff and output_sup24.gff
We can add a flag e.g. --all_iso to report in output_sup24.gff only if all isoforms pass the test.

@Juke34
Copy link
Collaborator

Juke34 commented Dec 11, 2024

@GallVp
What do you think should happen if one isoform pass the test but an other isoform do not have CDS (cases already observed)?

Juke34 added a commit that referenced this issue Dec 11, 2024
@GallVp
Copy link
Author

GallVp commented Dec 11, 2024

Thank you @Juke34 for the prompt response.

I think one of the following outcomes make sense,

  1. Filter test is applied at the transcript level so that in the case of multiple transcripts, the one which passes the test goes to the good_gene_list and the others which fail the filter are added to the bad_gene_list. This sort of behaviour can be achieved with Gffread -l parameter. Please see a test here: https://github.com/Plant-Food-Research-Open/genepal/blob/f9807724bace7b0d2b3a9b3ce892dc49c887996d/modules/local/tests/gffread/main.nf.test#L28-L32
  2. Filter the gene if any one of its isoforms fails the filter test. I believe this is what you have implemented with --all_iso parameter? This is also a good solution. My suggestion is that this should be the default behaviour as it is in accordance with the principle of least surprise. But at the same time I do understand the need to stay backwards compatible.

@GallVp
Copy link
Author

GallVp commented Dec 12, 2024

@GallVp

What do you think should happen if one isoform pass the test but an other isoform do not have CDS (cases already observed)?

I think a transcript without CDS should pass into the good_gene_list as I understand this is the current behaviour of agat_sp_filter_by_ORF_size.pl. If there is no ORF, there is no application of the filter test. Removing such transcripts would imply that the tool assumes the absence of CDSs as an ORF of length 0. Doing so will be also remove all non-protein coding transcripts.

@Juke34
Copy link
Collaborator

Juke34 commented Dec 13, 2024

Ok I finally updated to get the original expected behaviour. Any transcript that does not pas the test is discarded.

So

chr23	AUGUSTUS	gene	16530414	16531542	.	-	.	ID=gene19851;description=Differing%20isoform%20descriptions
chr23	AUGUSTUS	mRNA	16530414	16531542	1	-	.	ID=gene19851.t1;Parent=gene19851;description=Myb-like%20DNA-binding%20domain
chr23	AUGUSTUS	exon	16530414	16530721	.	-	.	ID=gene19851.t1.exon1;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16530414	16530721	1	-	2	ID=gene19851.t1.cds1;Parent=gene19851.t1
chr23	AUGUSTUS	exon	16530824	16531041	.	-	.	ID=gene19851.t1.exon2;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16530824	16531041	1	-	1	ID=gene19851.t1.cds2;Parent=gene19851.t1
chr23	AUGUSTUS	exon	16531197	16531326	.	-	.	ID=gene19851.t1.exon3;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16531197	16531326	1	-	2	ID=gene19851.t1.cds3;Parent=gene19851.t1
chr23	AUGUSTUS	exon	16531428	16531542	.	-	.	ID=gene19851.t1.exon4;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16531428	16531542	1	-	0	ID=gene19851.t1.cds4;Parent=gene19851.t1
chr23	GeneMark.hmm3	mRNA	16530414	16531542	.	-	.	ID=gene19851.t2;Parent=gene19851;description=Hypothetical%20protein%20%7C%20no%20eggnog%20hit
chr23	GeneMark.hmm3	exon	16530414	16531542	.	-	0	ID=gene19851.t2.exon1;Parent=gene19851.t2
chr23	GeneMark.hmm3	CDS	16531514	16531542	.	-	0	ID=gene19851.t2.cds1;Parent=gene19851.t2

becomes with agat_sp_filter_by_ORF_size.pl -g in.gff -s 24 -o output

output_sup24.gff

##gff-version 3
chr23	AUGUSTUS	gene	16530414	16531542	.	-	.	ID=gene19851;description=Differing isoform descriptions
chr23	AUGUSTUS	mRNA	16530414	16531542	1	-	.	ID=gene19851.t1;Parent=gene19851;description=Myb-like DNA-binding domain
chr23	AUGUSTUS	exon	16530414	16530721	.	-	.	ID=gene19851.t1.exon1;Parent=gene19851.t1
chr23	AUGUSTUS	exon	16530824	16531041	.	-	.	ID=gene19851.t1.exon2;Parent=gene19851.t1
chr23	AUGUSTUS	exon	16531197	16531326	.	-	.	ID=gene19851.t1.exon3;Parent=gene19851.t1
chr23	AUGUSTUS	exon	16531428	16531542	.	-	.	ID=gene19851.t1.exon4;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16530414	16530721	1	-	2	ID=gene19851.t1.cds1;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16530824	16531041	1	-	1	ID=gene19851.t1.cds2;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16531197	16531326	1	-	2	ID=gene19851.t1.cds3;Parent=gene19851.t1
chr23	AUGUSTUS	CDS	16531428	16531542	1	-	0	ID=gene19851.t1.cds4;Parent=gene19851.t1

and output_NOT_sup24.gff

##gff-version 3
chr23	AUGUSTUS	gene	16530414	16531542	.	-	.	ID=gene19851;description=Differing isoform descriptions
chr23	GeneMark.hmm3	mRNA	16530414	16531542	.	-	.	ID=gene19851.t2;Parent=gene19851;description=Hypothetical protein | no eggnog hit
chr23	GeneMark.hmm3	exon	16530414	16531542	.	-	0	ID=gene19851.t2.exon1;Parent=gene19851.t2
chr23	GeneMark.hmm3	CDS	16531514	16531542	.	-	0	ID=gene19851.t2.cds1;Parent=gene19851.t2
chr23	AGAT	three_prime_UTR	16530414	16531513	.	-	.	ID=agat-three_prime_utr-1;Parent=gene19851.t2

@Juke34
Copy link
Collaborator

Juke34 commented Dec 13, 2024

Would you think I should add:

  • an option to say, if one of the isoforms do not pass the test we remove the whole gene (original behaviour)
  • an option to say, if one of the isoforms do pass the test we keep the whole gene even if other isoforms do pass the test (original behaviour)
    ?

@GallVp
Copy link
Author

GallVp commented Dec 15, 2024

Thank you @Juke34

This is amazing.

  • an option to say, if one of the isoforms do not pass the test we remove the whole gene (original behaviour)
  • an option to say, if one of the isoforms do pass the test we keep the whole gene even if other isoforms do pass the test (original behaviour)
    ?

I can't think of a use-case where these options will be useful.

Your new implementation where filter is applied at the transcript level and the passing and failing isoforms are separated out works perfectly for all my needs. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants