Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(IPVC-2276): skip accessions from gff that are not in the tx_info file, log missing acs to file #18

Merged
merged 10 commits into from
Apr 5, 2024

Conversation

sptaylor
Copy link

@sptaylor sptaylor commented Apr 2, 2024

The exonset file, derived from gff files, contains transcript accessions that are not present in the transcript info file/Seqrepo, and this causes downstream issues. We are adding a step to filter out these missing transcripts from the exonsets file.
Demo:

+ sbin/filter_exonset_transcripts.py --tx-info /workdir/loading/gbff.txinfo.gz --exonsets /workdir/loading/gff.exonsets.gz --missing-ids /workdir/loading/filtered_tx_acs.txt
+ tee /workdir/logs/filter_exonset_transcripts.log
+ gzip -c
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_000853.3 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NR_003491.3 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NR_033319.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NR_033320.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NR_033321.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NR_156186.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001284286.1 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001284289.1 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001284288.1 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001002837.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001284287.1 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001350317.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 INFO     [__main__] Filtered out exon sets for 12 transcripts: NR_033320.2,NM_001284286.1,NM_001350317.2,NR_033319.2,NR_156186.2,NM_000853.3,NM_001284287.1,NM_001284288.1,NM_001002837.2,NR_003491.3,NM_001284289.1,NR_033321.2

The filtered exonsets are printed to stdout and the missing transcript ids are printed to file.

@sptaylor sptaylor requested review from bsgiles73 and nvta1209 April 2, 2024 15:39
@sptaylor sptaylor marked this pull request as ready for review April 2, 2024 15:39
@@ -115,7 +115,7 @@ def _get_exon_number_from_id(alignment_id: str) -> int:
return int(alignment_id.split("-")[-1])


def parse_gff_file(file_paths: List[str]) -> dict[str, List[GFFRecord]]:
def parse_gff_files(file_paths: List[str]) -> dict[str, List[GFFRecord]]:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the name to reflect the function takes a list of files

Copy link

@bsgiles73 bsgiles73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My description in the ticket really made it look like this was an issue with ncbi_parse_genomic_gff.py, but it is not. Sorry for the confusion. But now I have a question. If we have extra transcripts in the GFF files, that are missing from SeqRepo, how many places might break? You definitely found one exonset-to-seqinfo. If the extra transcripts are in the GFF. That means they should be in the intermediate file. Which means they get into the databaes, via the uta load-exonset step? Or are they skipped because the transcripts are missing from the transcript table?

@@ -49,6 +49,8 @@ if __name__ == "__main__":
ac_re = re.compile("[NX][CGMPR]_")

opts = parse_args(sys.argv[1:])
input_dir = os.path.dirname(opts.FILES[0])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an output directory to the supported arguments. So we can have this file directed to the directory the user chooses?

@sptaylor sptaylor marked this pull request as draft April 2, 2024 20:51
@sptaylor sptaylor changed the title feat(IPVC-2276): skip accessions from gff that are not in seqrepo, log to file feat(IPVC-2276): skip accessions from gff that are not in the tx_info file, log missing acs to file Apr 4, 2024
@sptaylor sptaylor marked this pull request as ready for review April 4, 2024 22:53
Copy link

@bsgiles73 bsgiles73 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM



def filter_exonset(exonset_file, transcript_ids, missing_ids_file):
with open_file(exonset_file) as es_f, open(missing_ids_file, 'w') as missing_f:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@sptaylor sptaylor merged commit 49e08e9 into main Apr 5, 2024
1 check passed
@sptaylor sptaylor deleted the IPVC-2276-filter-accessions branch April 5, 2024 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants