Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for circular genomes? #25

Closed
teojcryan opened this issue Sep 19, 2024 · 5 comments · Fixed by #31 or #34
Closed

Support for circular genomes? #25

teojcryan opened this issue Sep 19, 2024 · 5 comments · Fixed by #31 or #34
Assignees
Labels
enhancement New feature or request

Comments

@teojcryan
Copy link
Contributor

I'm using MeSS to simulate reads from circular genomes and want to ensure that the reads span the arbitrary breakpoint so that the resulting assembly retains as much of its circularity. I noticed that another tool, readSimulator, has a feature for handling circular genomes, and I was wondering if MeSS has something similar or if it might be in the works?

@teojcryan teojcryan changed the title Support for Circular Genomes? Support for circular genomes? Sep 19, 2024
@farchaab farchaab self-assigned this Sep 19, 2024
@farchaab farchaab added the enhancement New feature or request label Sep 19, 2024
@farchaab
Copy link
Collaborator

Hello @teojcryan, thank you for using MeSS !

The current MeSS version does not handle circular genomes yet !

We plan to implement readSimulator's solution to shuffle genome breakpoints in a future update.

This is in the works and we will let you know as soon as possible.

@farchaab
Copy link
Collaborator

farchaab commented Oct 17, 2024

Hello @teojcryan, we merged #31 to support circular genomes !

We added a new --rotate parameter to randomly shuffle genome start points as described by readSimulator .

Using test.tar.gz we simulated reads for a single circular phage genome, with the commands below:

mess simulate -i test.tsv --fasta fastas -o mess_out/no-rotate --sdm apptainer --threads 10
mess simulate -i test.tsv --fasta fastas -o mess_out/rotate --sdm apptainer --threads 10 --rotate 10

Assemblies were done with unicycler:

unicycler -1 mess_out/no-rotate/fastq/sample1_R1.fq.gz -2 mess_out/no-rotate/fastq/sample1_R2.fq.gz -o unicycler/no-rotate --threads 10
unicycler -1 mess_out/rotate/fastq/sample1_R1.fq.gz -2 mess_out/rotate/fastq/sample1_R2.fq.gz -o unicycler/rotate --threads 10

Assembly graphs with Bandage:

  • --rotate 1 (default, linear)
  • --rotate 10 (circular)

@teojcryan
Copy link
Contributor Author

Thank you so much—this is incredibly appreciated! The added functionality will be very helpful, especially for bacterial communities.

I was thinking, would it be possible to take this a step further by controlling the rotation at a more granular level? For example, for more complex communities with linear and circular genomes, it could be interesting to allow the option of providing a rotate parameter for each row in an input file. Something like this:

sample taxon reads rotate
sample1 487 174840 1
sample1 727 90679 10
sample1 729 13129 10
sample2 28132 147863 1
sample2 199 147545 1
sample2 729 131300 10

Just a thought for future iterations in case this seems like something others might find useful too!

@teojcryan
Copy link
Contributor Author

I've managed to get rotate to work by reproducing your test example, but ran into an error while generating synthetic data using the example input table.

[Fri Oct 18 16:19:16 2024]
localrule rotate_contigs:
    input: mess_test_rotate/processing/split/GCF_001298465.1_NZ_CP012541.1.fna, mess_test_rotate/processing/cov.tsv
    output: mess_test_rotate/processing/rotate/GCF_001298465.1_NZ_CP012541.1_2.fna
    log: mess_test_rotate/logs/seqkit/restart/GCF_001298465.1_NZ_CP012541.1_2.log
    jobid: 270
    reason: Missing output files: mess_test_rotate/processing/rotate/GCF_001298465.1_NZ_CP012541.1_2.fna
    wildcards: fasta=GCF_001298465.1, contig=NZ_CP012541.1, n=2
    resources: tmpdir=/tmp, mem_mb=2000, mem_mib=1908, mem=2000MB, time=00:00:10

path/tools/MeSS/mess/workflow/rules/preflight/functions.smk:131: PerformanceWarning: indexing past lexsort depth may impact performance.
  

            seqkit restart -i fasta            contig         n
GCF_001298465.1  NZ_CP012541.1  2    415443
Name: random_start, dtype: int64 mess_test_rotate/processing/split/GCF_001298465.1_NZ_CP012541.1.fna |             seqkit seq -i |             seqkit replace -p .+ -r NZ_CP012541.1_2 > mess_test_rotate/processing/rotate/GCF_001298465.1_NZ_CP012541.1_2.fna
            
Activating conda environment: .snakemake/conda/73d4154dff2dca37454187e4d514b264_
Error: invalid argument "fasta" for "-i, --new-start" flag: strconv.ParseInt: parsing "fasta": invalid syntax
Usage:
  seqkit restart [flags] 

Aliases:
  restart, rotate

Flags:
  -h, --help            help for restart
  -i, --new-start int   new start position (1-base, supporting negative value counting from the end)
                        (default 1)

Global Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on
                                        which seqkit guesses the sequence type (0 for whole seq)
                                        (default 10000)
      --compress-level int              compression level for gzip, zstd, xz and bzip2. type "seqkit -h"
                                        for the range and default value for each format (default -1)
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2|
                                        Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
  -X, --infile-list string              file of input files list (one file per line), if given, they are
                                        appended to files from cli arguments
  -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it
                                        automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. can also set with environment variable
                                        SEQKIT_THREADS) (default 4)

invalid argument "fasta" for "-i, --new-start" flag: strconv.ParseInt: parsing "fasta": invalid syntax
[Fri Oct 18 16:19:19 2024]
Error in rule rotate_contigs:
    jobid: 270
    input: mess_test_rotate/processing/split/GCF_001298465.1_NZ_CP012541.1.fna, mess_test_rotate/processing/cov.tsv
    output: mess_test_rotate/processing/rotate/GCF_001298465.1_NZ_CP012541.1_2.fna
    log: mess_test_rotate/logs/seqkit/restart/GCF_001298465.1_NZ_CP012541.1_2.log (check log file(s) for error details)
    conda-env: path/.snakemake/conda/73d4154dff2dca37454187e4d514b264_
    shell:
        
            seqkit restart -i fasta            contig         n
GCF_001298465.1  NZ_CP012541.1  2    415443
Name: random_start, dtype: int64 mess_test_rotate/processing/split/GCF_001298465.1_NZ_CP012541.1.fna |             seqkit seq -i |             seqkit replace -p .+ -r NZ_CP012541.1_2 > mess_test_rotate/processing/rotate/GCF_001298465.1_NZ_CP012541.1_2.fna
            
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Logfile mess_test_rotate/logs/seqkit/restart/GCF_001298465.1_NZ_CP012541.1_2.log not found.

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-10-18T161726.025648.snakemake.log
WorkflowError:
At least one job did not complete successfully.
[2024:10:18 16:19:20] ERROR: Snakemake failed

@farchaab farchaab reopened this Oct 19, 2024
@farchaab
Copy link
Collaborator

Hello @teojcryan, thanks for your valuable tests and suggestions !

I fixed the seqkit bug when using rotate on the test data on the dev branch

Additionally, I added the possiblity to set rotate values in the input table !

Using samples.tsv.gz , the command below ran without issues:

mess run -i samples.tsv --threads 10

Let me know if you encounter other bugs or have other suggestions !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants