feat(IPVC-2276): skip accessions from gff that are not in the tx_info file, log missing acs to file #18

sptaylor · 2024-04-02T15:38:40Z

The exonset file, derived from gff files, contains transcript accessions that are not present in the transcript info file/Seqrepo, and this causes downstream issues. We are adding a step to filter out these missing transcripts from the exonsets file.
Demo:

+ sbin/filter_exonset_transcripts.py --tx-info /workdir/loading/gbff.txinfo.gz --exonsets /workdir/loading/gff.exonsets.gz --missing-ids /workdir/loading/filtered_tx_acs.txt
+ tee /workdir/logs/filter_exonset_transcripts.log
+ gzip -c
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_000853.3 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NR_003491.3 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NR_033319.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NR_033320.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NR_033321.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NR_156186.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001284286.1 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001284289.1 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001284288.1 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001002837.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001284287.1 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 WARNING  [__main__] Exon set transcript NM_001350317.2 not found in txinfo file. Filtering out.
2024-04-04 20:42:22 INFO     [__main__] Filtered out exon sets for 12 transcripts: NR_033320.2,NM_001284286.1,NM_001350317.2,NR_033319.2,NR_156186.2,NM_000853.3,NM_001284287.1,NM_001284288.1,NM_001002837.2,NR_003491.3,NM_001284289.1,NR_033321.2

The filtered exonsets are printed to stdout and the missing transcript ids are printed to file.

sptaylor · 2024-04-02T15:40:53Z

sbin/ncbi_parse_genomic_gff.py

@@ -115,7 +115,7 @@ def _get_exon_number_from_id(alignment_id: str) -> int:
    return int(alignment_id.split("-")[-1])


-def parse_gff_file(file_paths: List[str]) -> dict[str, List[GFFRecord]]:
+def parse_gff_files(file_paths: List[str]) -> dict[str, List[GFFRecord]]:


Updated the name to reflect the function takes a list of files

bsgiles73

My description in the ticket really made it look like this was an issue with ncbi_parse_genomic_gff.py, but it is not. Sorry for the confusion. But now I have a question. If we have extra transcripts in the GFF files, that are missing from SeqRepo, how many places might break? You definitely found one exonset-to-seqinfo. If the extra transcripts are in the GFF. That means they should be in the intermediate file. Which means they get into the databaes, via the uta load-exonset step? Or are they skipped because the transcripts are missing from the transcript table?

bsgiles73 · 2024-04-02T17:13:50Z

sbin/exonset-to-seqinfo

@@ -49,6 +49,8 @@ if __name__ == "__main__":
    ac_re = re.compile("[NX][CGMPR]_")

    opts = parse_args(sys.argv[1:])
+    input_dir = os.path.dirname(opts.FILES[0])


Can we add an output directory to the supported arguments. So we can have this file directed to the directory the user chooses?

…to a file" This reverts commit 4ecb9b4.

bsgiles73

LGTM

bsgiles73 · 2024-04-05T17:55:24Z

sbin/filter_exonset_transcripts.py

+
+
+def filter_exonset(exonset_file, transcript_ids, missing_ids_file):
+    with open_file(exonset_file) as es_f, open(missing_ids_file, 'w') as missing_f:


sptaylor added 3 commits April 2, 2024 08:21

feat(IPVC-2276): skip acs missing from seqrepo and save them to a file

4ecb9b4

feat(IPVC-2276): update fx name

dc48fa0

feat(IPVC-2276): update fx name

5b77ba6

sptaylor requested review from bsgiles73 and nvta1209 April 2, 2024 15:39

sptaylor marked this pull request as ready for review April 2, 2024 15:39

style(IPVC-2276): add empty new line

523f06c

sptaylor commented Apr 2, 2024

View reviewed changes

bsgiles73 reviewed Apr 2, 2024

View reviewed changes

sptaylor marked this pull request as draft April 2, 2024 20:51

sptaylor added 6 commits April 2, 2024 14:14

Merge branch 'main' into IPVC-2276-filter-accessions

2eea2a0

Revert "feat(IPVC-2276): skip acs missing from seqrepo and save them …

50ff903

…to a file" This reverts commit 4ecb9b4.

refactor(IPVC-2276): pull out open_file into module

0e15485

feat(IPVC-2276): filter exonsets by transcript info file

18b0443

feat(IPVC-2276): add exonset filtering to build pipeline

9327609

style(IPVC-2276): add newline

348d942

sptaylor changed the title ~~feat(IPVC-2276): skip accessions from gff that are not in seqrepo, log to file~~ feat(IPVC-2276): skip accessions from gff that are not in the tx_info file, log missing acs to file Apr 4, 2024

sptaylor marked this pull request as ready for review April 4, 2024 22:53

bsgiles73 approved these changes Apr 5, 2024

View reviewed changes

sptaylor merged commit 49e08e9 into main Apr 5, 2024
1 check passed

sptaylor deleted the IPVC-2276-filter-accessions branch April 5, 2024 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(IPVC-2276): skip accessions from gff that are not in the tx_info file, log missing acs to file #18

feat(IPVC-2276): skip accessions from gff that are not in the tx_info file, log missing acs to file #18

sptaylor commented Apr 2, 2024 •

edited

Loading

sptaylor Apr 2, 2024

bsgiles73 left a comment

bsgiles73 Apr 2, 2024

bsgiles73 left a comment

bsgiles73 Apr 5, 2024



		def filter_exonset(exonset_file, transcript_ids, missing_ids_file):
		with open_file(exonset_file) as es_f, open(missing_ids_file, 'w') as missing_f:

feat(IPVC-2276): skip accessions from gff that are not in the tx_info file, log missing acs to file #18

feat(IPVC-2276): skip accessions from gff that are not in the tx_info file, log missing acs to file #18

Conversation

sptaylor commented Apr 2, 2024 • edited Loading

sptaylor Apr 2, 2024

Choose a reason for hiding this comment

bsgiles73 left a comment

Choose a reason for hiding this comment

bsgiles73 Apr 2, 2024

Choose a reason for hiding this comment

bsgiles73 left a comment

Choose a reason for hiding this comment

bsgiles73 Apr 5, 2024

Choose a reason for hiding this comment

sptaylor commented Apr 2, 2024 •

edited

Loading