Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Add number of ambiguous characters in seqkit stats #490

Closed
apcamargo opened this issue Oct 30, 2024 · 7 comments
Closed

Comments

@apcamargo
Copy link

The number of ambiguous characters is often a good metric for genome quality. As far as I know, there's no way to add the number of counts of a given character to seqkit stats. I'm using -G "Nn" to fool seqkit to think that N is a gap character, but it would be nice to have that as an additional (optional) column.

@shenwei356
Copy link
Owner

Do ambiguous characters mean "N" for DNA/RNA and "N/X" for amino acids?

@apcamargo
Copy link
Author

Yes. Sorry, the title is indeed confusing.

@shenwei356
Copy link
Owner

OK, so it means we just need to count "N" for DNA/RNA, and "N"+"X" for proteins.

@apcamargo
Copy link
Author

Yep!

@shenwei356
Copy link
Owner

shenwei356 commented Oct 30, 2024

added, just count N/n/X/x for any kind of sequence.

@kakuk9
Copy link

kakuk9 commented Nov 14, 2024

Hi Wei. Could you also add a feature that allows us to filter nt/aa sequences based on the number and/or proportion of ambiguous characters of individual sequence (i.e. N/X) for fasta files? Thanks :D

@shenwei356
Copy link
Owner

@kakuk9

$ echo -ne ">s\nactgn\n" | seqkit fx2tab -B NX
s       actgn           20.00

$ echo -ne ">s\nactgn\n" | seqkit fx2tab -B NX | awk -F'\t' '$4 >= 20' | seqkit tab2fx 
>s
actgn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants