AmpliconSuite · mattjones315 · Mar 4, 2024 · Mar 4, 2024 · Mar 4, 2024 · Apr 16, 2024
diff --git a/README.md b/README.md
@@ -6,14 +6,14 @@ of origin. Users can invoke a read simulator on the simulated structures to gene
 
 It takes as input a reference genome fasta and a yaml configuration file.
 
-The latest version of ecSimulator is **0.6.0**.
+The latest version of ecSimulator is **0.7.1**.
 
 ## Requirements and Installation
-### Basic requirements:
-ecSimulator requires python3 and the `numpy` and `intervaltree` python libraries.
+### Basic requirements
+ecSimulator requires python >=3 and the `numpy` and `intervaltree` python libraries.
 
 ```shell
-pip3 install numpy intervaltree  # or conda install intervaltree, etc.
+conda install numpy intervaltree  # or pip3 install numpy intervaltree
 git clone https://github.com/jluebeck/ecSimulator.git
 ```
 
@@ -35,34 +35,53 @@ ecSimulator also requires the AmpliconArchitect data repo to be present, which c
 A reference genome fasta file from which to extract genome sequences is also required.
 These are available in the [AmpliconArchitect data repo](https://datasets.genepattern.org/?prefix=data/module_support_files/AmpliconArchitect/) for users who may already have that tool installed.
 
-### Optional dependencies for read simulation:
+### Optional dependencies for read simulation
+ecSimulator enables users to simulate reads from the resulting structures, using Nanopore or Illumina read simulators.
 
-ecSimulator allows users to simulate nanopore reads from the resulting structures, using [NanoSim](https://github.com/bcgsc/NanoSim).
-Nanosim can be installed by performing
+#### Nanopore: NanoSim
+[NanoSim](https://github.com/bcgsc/NanoSim) can be installed by running
 ```shell
 conda install -c bioconda nanosim
 ```
 
 We provide a pre-generated NanoSim read simulation model generated using input data from a Promethion with version 10.4.1 flow cell and Guppy as the basecaller (version [email protected]).
 No additional configuration of the simulator is required by the user.
 
-### Computing requirements:
+
+#### Illumina: Mason2
+[Mason2](https://github.com/seqan/seqan/tree/main/apps/mason2) can be installed by running
+```shell
+conda install -c bioconda mason
+```
+
+
+
+### Computing requirements
 The memory an CPU requirements of ecSimulator are minimal, but please try to have more than 4Gb RAM available. The simulator should finish within 30s-1min for typical simulation runs.
-Simulating thousands of structures may take slightly longer.
+Simulating thousands of structures may take slightly longer. Read simulation is more resource intensive, and timing varies by the tool used and depth of coverage given.
 
 ## Usage
-### Focal amplification simulation:
+### Focal amplification simulation
 An example command to generate an ecDNA of episomal origin with at least two non-overlapping genomic segments
 with default SV frequencies:
 >`ecSimulator/src/ecSimulator.py --ref_name GRCh38 --ref_fasta /path/to/hg38.fa -o test`
 
 Or with customized simulation parameters:
 >`ecSimulator/src/ecSimulator.py --ref_name GRCh38 --ref_fasta /path/to/hg38.fa --config_file your_config.yaml -o test`
 
-### Nanopore read simulation:
+### Nanopore read simulation
+
+To simulate Nanopore reads from the simulated focal amplification using NanoSim, you can do the following
+>`ecSimulator/src/run_nanosim.py -t [threads] -o test_amplicon_1_nanosim --amplicon_fasta test_amplicon1.fasta --amplicon_coverage 1`
+
+Check with `-h` to see other arguments for setting background regions, and adjusting other read simulation parameters.
+
+### Illumina read simulation
+To simulate Illumina reads from the simulated focal amplification using Mason, you can do the following
+>`ecSimulator/src/run_mason.py -t [threads] -o test_amplicon_1_mason --amplicon_fasta test_amplicon1.fasta --amplicon_coverage 1`
+
+Check with `-h` to see other arguments for setting background regions, and adjusting other read simulation parameters.
 
-To simulate Nanopore reads from the simulated focal amplification (using NanoSim), you can do the following
->`ecSimulator/src/run_nanosim.py -o test_amplicon_1_read_sim --amplicon_fasta test_amplicon1.fasta --amplicon_coverage 1`
 
 ## Options
 ecSimulator takes the following command-line arguments:
@@ -120,8 +139,10 @@ This can be useful if you later want to create mixtures of heterogeneous but hig
 
 **Users can use the `cycles_file_to_fasta.py` script to convert any intermediate (or generally, any) AA-formatted `_cycles.txt` file to a fasta.**
 
-The Nanosim process will create two fastq files (aligned.fastq and unaligned.fastq) and a file describing the locations of the errors in the reads. A complete description of these files is available [here](https://github.com/bcgsc/NanoSim#2-simulation-stage-1).
+### NanoSim outputs
+
+The NanoSim process will create two fastq files (aligned.fastq and unaligned.fastq) and a file describing the locations of the errors in the reads. A complete description of these files is available [here](https://github.com/bcgsc/NanoSim#2-simulation-stage-1).
 
-To combine the background and amplicon fastq files, users an simply do
+To combine the background and amplicon fastq files, users can simply do
 >`cat [sample]_amplicon_unaligned_reads.fastq [sample]_amplicon_aligned_reads.fastq
 [sample]_background_unaligned_reads.fastq [sample]_background_aligned_reads.fastq`
diff --git a/src/_version.py b/src/_version.py
@@ -1,2 +1,2 @@
-__version__ = "0.7.0"
+__version__ = "0.7.1"
 __author__ = "Jens Luebeck (jluebeck [ at] ucsd.edu)"
diff --git a/src/run_mason.py b/src/run_mason.py
@@ -0,0 +1,145 @@
+#!/usr/bin/env python3
+"""
+Simulates short-read Illumina data from amplicon sequences using Mason.
+"""
+
+import argparse
+from subprocess import call
+import os
+import sys
+
+from utilities import compute_number_of_reads_to_simulate, get_fasta_length, pseudocircularize_fasta
+
+
+def check_mason_path(mason_path=None):
+    if not os.path.isfile(mason_path):
+
+        raise Exception("Cannot find Mason. Specify an existing file path for " 
+                        " Mason.")
+
+def run_mason(orig_fasta, mason_path, output_prefix, read_length, fragment_max_size, fragment_mean_size, fragment_size_std,
+              coverage, circ_or_linear, nthreads, seed=None):
+    """Simulates short-read data with Mason.
+
+    Args:
+        fasta: Path to FASTA file containing amplicon.
+        mason_path: Path to Mason executable.
+        output_prefix: Prefix for file path to write out simulated read data.
+        read_length: Read length for simulated Illumina data (e.g., 150).
+        coverage: Target coverage for amplicon.
+        circ_or_linear: Whether to simulate reads from a circular or
+            linear amplicon. (Note used currently.)
+        nthreads: Number of threads to utilize.
+        seed: Random seed to utilize.
+
+    Returns:
+        None. Call Mason to simulate data.
+    """
+
+    if circ_or_linear == 'linear':
+        fasta = orig_fasta
+
+    elif circ_or_linear == "circular":
+        # modify the fasta (concatenate to itself 2*c times)
+        fasta = pseudocircularize_fasta(orig_fasta, coverage)
+        # reduce coverage to 0.5*c to account for extension of fasta length by 2
+        coverage *= 0.5
+
+    else:
+        raise Exception("Invalid value for circ_or_linear.")
+
+    fasta_length = get_fasta_length(fasta)
+    num_reads = str(compute_number_of_reads_to_simulate(coverage, fasta_length, 2 * read_length))
+
+    R1 = f"{output_prefix}_R1.fastq.gz"
+    R2 = f"{output_prefix}_R2.fastq.gz"
+    cmd = ("{} -ir {} -n {} --illumina-read-length {} --fragment-max-size {} --fragment-mean-size {} --fragment-size-std-dev {} "
+           "--seq-technology illumina --num-threads {} -o {} -or {}").format(
+        mason_path, fasta, num_reads, read_length, fragment_max_size, fragment_mean_size, fragment_size_std, nthreads, R1, R2
+    )
+
+    if seed:
+        cmd+=(" --seed " + str(seed))
+
+    print(cmd)
+    call(cmd, shell=True)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="Call Mason to simulate nanopore reads from amplicon genome "
+                                                 "structures and background genome regions.")
+    parser.add_argument("-o", "--output_prefix", help="Prefix for output filenames.", required=True)
+    parser.add_argument("--amplicon_fasta", help="Fasta file of simulated amplicon structure.", required=True)
+    parser.add_argument("--amplicon_coverage", type=float, help="Coverage for amplicon region - scale up to simulate "
+                                                                "higher amplicon CNs", required=True)
+
+    parser.add_argument("-l", "--read_length", help="Read length", type=int, default=150)
+    parser.add_argument("--background_fasta", help="Fasta file of background genome regions. Do not provide if you want"
+                                                   " reads from amplicon only")
+    parser.add_argument("--background_coverage", type=float, help="Coverage for background region - scale up to "
+                                                                  "simulate lower amplicon CNs")
+    parser.add_argument("-t", "--num_threads", type=int, help="Number of threads to use in all stages (default=1).",
+                        default=1)
+    parser.add_argument("--seed", help="Manually seed the pseudo-random number generator for the simulation.")
+    parser.add_argument("--mason_path", help="Custom path to Mason executable (mason_simulator).", default='mason_simulator')
+    parser.add_argument("--amp_is_linear", action='store_true', help="Set if the simulated amplicon is linear instead"
+                                                                     " of circular.")
+    parser.add_argument("--fragment_max_size", type=int, help="Largest fragment size to use when using uniform fragment size simulation."
+                                                              " In range [1..inf]", default="600")
+    parser.add_argument("--fragment_mean_size", type=int, help="Mean fragment size to use when using uniform fragment size simulation."
+                                                                 " In range [1..inf]", default="300")
+    parser.add_argument("--fragment_size_std_dev", type=int, help="Fragment size standard deviation when using normally distributed fragment size simulation."
+          " In range [1..inf].", default=100)
+
+
+    # Note: this parameter is not used for now.
+    parser.add_argument("--amp_is_linear", action='store_true', help="Set if the simulated amplicon is linear instead"
+                                                                     " of circular.")
+    args = parser.parse_args()
+    if args.background_fasta and not args.background_coverage:
+        print("--background_coverage required if --background_fasta is provided!")
+        sys.exit(1)
+
+    circular_or_linear = "linear" if args.amp_is_linear else "circular"
+
+    # set the path of the mason read simulation tool
+    if not args.mason_path.endswith('mason_simulator') and args.mason_path != 'mason_simulator':
+        args.mason_path += "/mason_simulator"
+
+    check_mason_path(args.mason_path)
+
+    # simulate reads from the amplicon
+    print("Simulating reads from amplicon...")
+    amplicon_prefix = args.output_prefix + "_amplicon_reads"
+    background_prefix = args.output_prefix + "_background_reads"
+    run_mason(
+        args.amplicon_fasta,
+        args.mason_path,
+        amplicon_prefix,
+        args.read_length,
+        args.fragment_max_size,
+        args.fragment_mean_size,
+        args.fragment_size_std_dev,
+        args.amplicon_coverage,
+        circular_or_linear,
+        args.num_threads,
+        args.seed
+    )
+
+    # simulate reads from the background
+    if args.background_fasta:
+        print("\nSimulating reads from background...")
+        background_prefix = args.output_prefix + "_background_reads"
+        run_mason(
+            args.background_fasta,
+            args.mason_path,
+            background_prefix,
+            args.read_length,
+            args.fragment_max_size,
+            args.fragment_mean_size,
+            args.fragment_size_std_dev,
+            args.background_coverage,
+            "linear",
+            args.num_threads,
+            args.seed
+        )
diff --git a/src/run_nanosim.py b/src/run_nanosim.py
@@ -6,6 +6,8 @@
 import os
 import sys
 
+from utilities import compute_number_of_reads_to_simulate, get_fasta_length
+
 # this script will run the nanosim pipeline to simulate nanopore reads from ecDNA and background genome.
 # it assumes nanosim is installed and its scripts are on the system path (true if installed by conda).
 
@@ -16,29 +18,10 @@ def extract_nanosim_model(model_path):
     if not os.path.isdir(model_path):
         print("Extracting model files...")
         cmd = "tar -xzf {}".format(model_path + ".tar.gz")
-        print(cmd)
+        print(cmd) 
         call(cmd, shell=True)
 
 
-def get_fasta_length(fasta):
-    tlen = 0
-    with open(fasta) as infile:
-        for line in infile:
-            if line.startswith(">"):
-                continue
-
-            tlen+=len(line.rstrip())
-
-    print("Fasta size is " + str(tlen) + "bp")
-    return tlen
-
-
-def compute_number_of_reads_to_simulate(coverage, fasta_len, mean_rl):
-    nreads = int(ceil(coverage * fasta_len / mean_rl))
-    print("Number of reads to simulate for {}x coverage: {}".format(str(coverage), str(nreads)))
-    return nreads
-
-
 def run_nanosim(fasta, model_pre, output_pre, coverage, circ_or_linear, read_length_tuple, nthreads, seed=None):
     fasta_length = get_fasta_length(fasta)
     num_reads = str(compute_number_of_reads_to_simulate(coverage, fasta_length, model_mean_read_length))

diff --git a/src/utilities.py b/src/utilities.py
@@ -0,0 +1,56 @@
+"""
+File for basic utilities.
+"""
+from math import ceil
+
+
+def get_fasta_length(fasta):
+    tlen = 0
+    with open(fasta) as infile:
+        for line in infile:
+            if line.startswith(">"):
+                continue
+
+            tlen+=len(line.rstrip())
+
+    print("Fasta size is " + str(tlen) + "bp")
+    return tlen
+
+
+def pseudocircularize_fasta(fasta, c):
+    # read a fasta and makes c copies of each entry concatenated with itself
+    # writes a new fasta and return the path of that file
+    base, ext = os.path.splitext(fasta)
+    output_fasta = f"{base}_circularized{ext}"
+
+    with open(fasta, 'r') as infile, open(output_fasta, 'w') as outfile:
+        seq_name = ""
+        seq_data = ""
+
+        for line in infile:
+            if line.startswith('>'):
+                # Write the previous entry if it exists
+                if seq_name and seq_data:
+                    pseudocircularized_seq = seq_data * c
+                    outfile.write(f"{seq_name}\n")
+                    outfile.write(f"{pseudocircularized_seq}\n")
+
+                # Start a new entry
+                seq_name = line.strip()
+                seq_data = ""
+            else:
+                seq_data += line.strip()
+
+        # Write the last entry
+        if seq_name and seq_data:
+            pseudocircularized_seq = seq_data * c
+            outfile.write(f"{seq_name}\n")
+            outfile.write(f"{pseudocircularized_seq}\n")
+
+    return output_fasta
+
+
+def compute_number_of_reads_to_simulate(coverage, fasta_len, mean_rl):
+    nreads = int(ceil(coverage * fasta_len / mean_rl))
+    print("Number of reads to simulate for {}x coverage: {}".format(str(coverage), str(nreads)))
+    return nreads