Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script needed for assembly evaluation #36

Open
jorvis opened this issue Jan 29, 2016 · 1 comment
Open

Script needed for assembly evaluation #36

jorvis opened this issue Jan 29, 2016 · 1 comment

Comments

@jorvis
Copy link
Owner

jorvis commented Jan 29, 2016

We have need of a script which simulates fragmented sequences based on more-complete input sequence. This is perhaps best illustrated with a current use case.

We are using unsheared, paired-end reads aligned to transcriptome assemblies to determine real evidence for each, or even possibly group them further. We expect overlapping transcripts like this to be assembled:

5'---------------------3'
               5'----------------------------------3'

But paired-end grouping might also be able to pull these together, even inserting Ns given a known library insert size, if read mate pairs span the gap between them:

5'---------------------3'
                                          5'----------------------------------3'

So, here, this proposed script would allow me to take a known set of transcripts and artificially fragment them, generating some fragments that overlap and others that are separate from one another. This could be controlled with user-configurable options such as:

--min_overlap_distance=-200
--max_overlap_distance=100
--fragmentation_factor=6

Notice the negative value above, which allows for the 2nd case above where sequence fragments do not overlap. With these options, the script would transform a FASTA file with 1000 sequences into one with around 6000 sequences, with fragments generated with an overlap distance of up to 100bp and as far as 200bp apart from each other based on their parent sequence.

Data should be appended to the header descriptions in the product sequences to indicate their source and coordinates.

@jorvis
Copy link
Owner Author

jorvis commented Jan 29, 2016

As a side note, we should evaluate this with what this does: http://www.ncbi.nlm.nih.gov/pubmed/22962361

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant