Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set the Z value dynamically according to the database used #103

Open
carlosribas opened this issue Jan 9, 2020 · 5 comments
Open

Set the Z value dynamically according to the database used #103

carlosribas opened this issue Jan 9, 2020 · 5 comments
Assignees

Comments

@carlosribas
Copy link
Contributor

If someone searches just in miRBase, it should be miRBase-specific

@carlosribas carlosribas self-assigned this Jan 9, 2020
@carlosribas
Copy link
Contributor Author

Hi @blakesweeney. Just for the record, I added the esl-seqstat command to rnacentral-import-pipeline. The idea is to put this file somewhere where I can download and parse the results.

@carlosribas
Copy link
Contributor Author

Hey @blakesweeney! There is a problem running the esl-seqstat command in pdbe:

$ esl-seqstat pdbe-0.fasta
Parse failed (sequence file pdbe-0.fasta):
Line 6316: illegal character F

We also have this F character on lines 7466 and 12603. Any suggestions on how to solve this without being manually?

@blakesweeney
Copy link
Member

Without looking at those sequences, I'm betting they are tRNA and the F character is the amino acid on it. There are likely other cases with different characters as well. The easiest thing to do would be exclude those sequences from search, but I'm not sure that is a good idea. Another choice is to strip those characters off the sequence, which has other possible issues. I'd lean toward doing a very crude modification of the sequences to strip off things that are not ACGU, from the end of tRNA sequences only, but that is something that @AntonPetrov would need to weigh in on.

@AntonPetrov
Copy link
Member

This is not a new problem: in previous releases we generated a special fasta file for the old search (the _excluded file contained all the exceptional sequences): http://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/13.0/sequences/.internal/

Is it possible to continue excluding some sequences from sequence search as before?

@blakesweeney
Copy link
Member

Sure, we can exclude them like we do currently. I'll add that filtering step to this export as well.

blakesweeney added a commit to RNAcentral/rnacentral-import-pipeline that referenced this issue Feb 7, 2020
This is for RNAcentral/rnacentral-sequence-search#103. We should only
have parsable sequences in the sequence search dataset. This should
select only the sequences that nhmmer can work with.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants