-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(IPVC-2276): skip accessions from gff that are not in the tx_info file, log missing acs to file #18
Conversation
@@ -115,7 +115,7 @@ def _get_exon_number_from_id(alignment_id: str) -> int: | |||
return int(alignment_id.split("-")[-1]) | |||
|
|||
|
|||
def parse_gff_file(file_paths: List[str]) -> dict[str, List[GFFRecord]]: | |||
def parse_gff_files(file_paths: List[str]) -> dict[str, List[GFFRecord]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the name to reflect the function takes a list of files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My description in the ticket really made it look like this was an issue with ncbi_parse_genomic_gff.py
, but it is not. Sorry for the confusion. But now I have a question. If we have extra transcripts in the GFF files, that are missing from SeqRepo, how many places might break? You definitely found one exonset-to-seqinfo
. If the extra transcripts are in the GFF. That means they should be in the intermediate file. Which means they get into the databaes, via the uta load-exonset
step? Or are they skipped because the transcripts are missing from the transcript
table?
sbin/exonset-to-seqinfo
Outdated
@@ -49,6 +49,8 @@ if __name__ == "__main__": | |||
ac_re = re.compile("[NX][CGMPR]_") | |||
|
|||
opts = parse_args(sys.argv[1:]) | |||
input_dir = os.path.dirname(opts.FILES[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add an output directory to the supported arguments. So we can have this file directed to the directory the user chooses?
…to a file" This reverts commit 4ecb9b4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
|
||
def filter_exonset(exonset_file, transcript_ids, missing_ids_file): | ||
with open_file(exonset_file) as es_f, open(missing_ids_file, 'w') as missing_f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
The exonset file, derived from gff files, contains transcript accessions that are not present in the transcript info file/Seqrepo, and this causes downstream issues. We are adding a step to filter out these missing transcripts from the exonsets file.
Demo:
The filtered exonsets are printed to stdout and the missing transcript ids are printed to file.