-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select reduced transcriptome from clusters #18
Comments
Hi @mpaya, The clustering methodology of CD-HIT is considerably different from that of RapClust. Specifically, in CD-HIT selecting a single cluster member as a representative is often reasonable because the clusters are formed from sequences that are generally very similar. However, RapClust aims to cluster together multiple transcript isoforms of the same gene, which can vary considerably in their length and sequence composition (e.g. through the inclusion or exclusion of alternatively-spliced exons). Hence, the idea of selecting a single representative sequence from the cluster isn't as straightforward, though it is true that selecting the longest transcript is likely to choose the one that contains much of the sequence in the cluster, it is not necessarily likely to be pairwise-similar to all cluster members. More generally, how you select a representative might depend on which type of analysis you hope to do. One approach to representative generation that is compatible with RapClust is the Lace method from the Oshlack group --- it's probably worth taking a look over that paper if you're not already familiar with it and seeing if it will suit your needs. |
Hi @rob-p, For this current project, the analysis that I was expecting to do was just comparing results of CD-HIT and RapClust. On the reduced assemblies, after selection of a single cluster representative, the purpose is to use Transrate, Transdecoder and BUSCO results for comparison. So our concern was whether this naive representative selection on RapClust to generate this artificial reduced assembly may or not be acceptable. I wasn't familiar with Lace, would you recommend to use this output instead for the indicated purpose? Thank you for your kind help |
Hi,
I am comparing clustering results from CD-HIT and RapClust. One of the characteristics of CD-HIT is that it selects one representative transcript per cluster, while rapclust doesn't. Would it be representative to also select the largest transcript from RapClust clusters to generate assemblies with reduced redundancy?
Thank you
The text was updated successfully, but these errors were encountered: