Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mason sim #2

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 36 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ of origin. Users can invoke a read simulator on the simulated structures to gene

It takes as input a reference genome fasta and a yaml configuration file.

The latest version of ecSimulator is **0.6.0**.
The latest version of ecSimulator is **0.7.1**.

## Requirements and Installation
### Basic requirements:
ecSimulator requires python3 and the `numpy` and `intervaltree` python libraries.
### Basic requirements
ecSimulator requires python >=3 and the `numpy` and `intervaltree` python libraries.

```shell
pip3 install numpy intervaltree # or conda install intervaltree, etc.
conda install numpy intervaltree # or pip3 install numpy intervaltree
git clone https://github.com/jluebeck/ecSimulator.git
```

Expand All @@ -35,34 +35,53 @@ ecSimulator also requires the AmpliconArchitect data repo to be present, which c
A reference genome fasta file from which to extract genome sequences is also required.
These are available in the [AmpliconArchitect data repo](https://datasets.genepattern.org/?prefix=data/module_support_files/AmpliconArchitect/) for users who may already have that tool installed.

### Optional dependencies for read simulation:
### Optional dependencies for read simulation
ecSimulator enables users to simulate reads from the resulting structures, using Nanopore or Illumina read simulators.

ecSimulator allows users to simulate nanopore reads from the resulting structures, using [NanoSim](https://github.com/bcgsc/NanoSim).
Nanosim can be installed by performing
#### Nanopore: NanoSim
[NanoSim](https://github.com/bcgsc/NanoSim) can be installed by running
```shell
conda install -c bioconda nanosim
```

We provide a pre-generated NanoSim read simulation model generated using input data from a Promethion with version 10.4.1 flow cell and Guppy as the basecaller (version [email protected]).
No additional configuration of the simulator is required by the user.

### Computing requirements:

#### Illumina: Mason2
[Mason2](https://github.com/seqan/seqan/tree/main/apps/mason2) can be installed by running
```shell
conda install -c bioconda mason
```



### Computing requirements
The memory an CPU requirements of ecSimulator are minimal, but please try to have more than 4Gb RAM available. The simulator should finish within 30s-1min for typical simulation runs.
Simulating thousands of structures may take slightly longer.
Simulating thousands of structures may take slightly longer. Read simulation is more resource intensive, and timing varies by the tool used and depth of coverage given.

## Usage
### Focal amplification simulation:
### Focal amplification simulation
An example command to generate an ecDNA of episomal origin with at least two non-overlapping genomic segments
with default SV frequencies:
>`ecSimulator/src/ecSimulator.py --ref_name GRCh38 --ref_fasta /path/to/hg38.fa -o test`

Or with customized simulation parameters:
>`ecSimulator/src/ecSimulator.py --ref_name GRCh38 --ref_fasta /path/to/hg38.fa --config_file your_config.yaml -o test`

### Nanopore read simulation:
### Nanopore read simulation

To simulate Nanopore reads from the simulated focal amplification using NanoSim, you can do the following
>`ecSimulator/src/run_nanosim.py -t [threads] -o test_amplicon_1_nanosim --amplicon_fasta test_amplicon1.fasta --amplicon_coverage 1`

Check with `-h` to see other arguments for setting background regions, and adjusting other read simulation parameters.

### Illumina read simulation
To simulate Illumina reads from the simulated focal amplification using Mason, you can do the following
>`ecSimulator/src/run_mason.py -t [threads] -o test_amplicon_1_mason --amplicon_fasta test_amplicon1.fasta --amplicon_coverage 1`

Check with `-h` to see other arguments for setting background regions, and adjusting other read simulation parameters.

To simulate Nanopore reads from the simulated focal amplification (using NanoSim), you can do the following
>`ecSimulator/src/run_nanosim.py -o test_amplicon_1_read_sim --amplicon_fasta test_amplicon1.fasta --amplicon_coverage 1`

## Options
ecSimulator takes the following command-line arguments:
Expand Down Expand Up @@ -120,8 +139,10 @@ This can be useful if you later want to create mixtures of heterogeneous but hig

**Users can use the `cycles_file_to_fasta.py` script to convert any intermediate (or generally, any) AA-formatted `_cycles.txt` file to a fasta.**

The Nanosim process will create two fastq files (aligned.fastq and unaligned.fastq) and a file describing the locations of the errors in the reads. A complete description of these files is available [here](https://github.com/bcgsc/NanoSim#2-simulation-stage-1).
### NanoSim outputs

The NanoSim process will create two fastq files (aligned.fastq and unaligned.fastq) and a file describing the locations of the errors in the reads. A complete description of these files is available [here](https://github.com/bcgsc/NanoSim#2-simulation-stage-1).

To combine the background and amplicon fastq files, users an simply do
To combine the background and amplicon fastq files, users can simply do
>`cat [sample]_amplicon_unaligned_reads.fastq [sample]_amplicon_aligned_reads.fastq
[sample]_background_unaligned_reads.fastq [sample]_background_aligned_reads.fastq`
2 changes: 1 addition & 1 deletion src/_version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__version__ = "0.7.0"
__version__ = "0.7.1"
__author__ = "Jens Luebeck (jluebeck [ at] ucsd.edu)"
145 changes: 145 additions & 0 deletions src/run_mason.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
#!/usr/bin/env python3
"""
Simulates short-read Illumina data from amplicon sequences using Mason.
"""

import argparse
from subprocess import call
import os
import sys

from utilities import compute_number_of_reads_to_simulate, get_fasta_length, pseudocircularize_fasta


def check_mason_path(mason_path=None):
if not os.path.isfile(mason_path):

raise Exception("Cannot find Mason. Specify an existing file path for "
" Mason.")

def run_mason(orig_fasta, mason_path, output_prefix, read_length, fragment_max_size, fragment_mean_size, fragment_size_std,
coverage, circ_or_linear, nthreads, seed=None):
"""Simulates short-read data with Mason.

Args:
fasta: Path to FASTA file containing amplicon.
mason_path: Path to Mason executable.
output_prefix: Prefix for file path to write out simulated read data.
read_length: Read length for simulated Illumina data (e.g., 150).
coverage: Target coverage for amplicon.
circ_or_linear: Whether to simulate reads from a circular or
linear amplicon. (Note used currently.)
nthreads: Number of threads to utilize.
seed: Random seed to utilize.

Returns:
None. Call Mason to simulate data.
"""

if circ_or_linear == 'linear':
fasta = orig_fasta

elif circ_or_linear == "circular":
# modify the fasta (concatenate to itself 2*c times)
fasta = pseudocircularize_fasta(orig_fasta, coverage)
# reduce coverage to 0.5*c to account for extension of fasta length by 2
coverage *= 0.5

else:
raise Exception("Invalid value for circ_or_linear.")

fasta_length = get_fasta_length(fasta)
num_reads = str(compute_number_of_reads_to_simulate(coverage, fasta_length, 2 * read_length))

R1 = f"{output_prefix}_R1.fastq.gz"
R2 = f"{output_prefix}_R2.fastq.gz"
cmd = ("{} -ir {} -n {} --illumina-read-length {} --fragment-max-size {} --fragment-mean-size {} --fragment-size-std-dev {} "
"--seq-technology illumina --num-threads {} -o {} -or {}").format(
mason_path, fasta, num_reads, read_length, fragment_max_size, fragment_mean_size, fragment_size_std, nthreads, R1, R2
)

if seed:
cmd+=(" --seed " + str(seed))

print(cmd)
call(cmd, shell=True)


if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Call Mason to simulate nanopore reads from amplicon genome "
"structures and background genome regions.")
parser.add_argument("-o", "--output_prefix", help="Prefix for output filenames.", required=True)
parser.add_argument("--amplicon_fasta", help="Fasta file of simulated amplicon structure.", required=True)
parser.add_argument("--amplicon_coverage", type=float, help="Coverage for amplicon region - scale up to simulate "
"higher amplicon CNs", required=True)

parser.add_argument("-l", "--read_length", help="Read length", type=int, default=150)
parser.add_argument("--background_fasta", help="Fasta file of background genome regions. Do not provide if you want"
" reads from amplicon only")
parser.add_argument("--background_coverage", type=float, help="Coverage for background region - scale up to "
"simulate lower amplicon CNs")
parser.add_argument("-t", "--num_threads", type=int, help="Number of threads to use in all stages (default=1).",
default=1)
parser.add_argument("--seed", help="Manually seed the pseudo-random number generator for the simulation.")
parser.add_argument("--mason_path", help="Custom path to Mason executable (mason_simulator).", default='mason_simulator')
parser.add_argument("--amp_is_linear", action='store_true', help="Set if the simulated amplicon is linear instead"
" of circular.")
parser.add_argument("--fragment_max_size", type=int, help="Largest fragment size to use when using uniform fragment size simulation."
" In range [1..inf]", default="600")
parser.add_argument("--fragment_mean_size", type=int, help="Mean fragment size to use when using uniform fragment size simulation."
" In range [1..inf]", default="300")
parser.add_argument("--fragment_size_std_dev", type=int, help="Fragment size standard deviation when using normally distributed fragment size simulation."
" In range [1..inf].", default=100)


# Note: this parameter is not used for now.
parser.add_argument("--amp_is_linear", action='store_true', help="Set if the simulated amplicon is linear instead"
" of circular.")
args = parser.parse_args()
if args.background_fasta and not args.background_coverage:
print("--background_coverage required if --background_fasta is provided!")
sys.exit(1)

circular_or_linear = "linear" if args.amp_is_linear else "circular"

# set the path of the mason read simulation tool
if not args.mason_path.endswith('mason_simulator') and args.mason_path != 'mason_simulator':
args.mason_path += "/mason_simulator"

check_mason_path(args.mason_path)

# simulate reads from the amplicon
print("Simulating reads from amplicon...")
amplicon_prefix = args.output_prefix + "_amplicon_reads"
background_prefix = args.output_prefix + "_background_reads"
run_mason(
args.amplicon_fasta,
args.mason_path,
amplicon_prefix,
args.read_length,
args.fragment_max_size,
args.fragment_mean_size,
args.fragment_size_std_dev,
args.amplicon_coverage,
circular_or_linear,
args.num_threads,
args.seed
)

# simulate reads from the background
if args.background_fasta:
print("\nSimulating reads from background...")
background_prefix = args.output_prefix + "_background_reads"
run_mason(
args.background_fasta,
args.mason_path,
background_prefix,
args.read_length,
args.fragment_max_size,
args.fragment_mean_size,
args.fragment_size_std_dev,
args.background_coverage,
"linear",
args.num_threads,
args.seed
)
23 changes: 3 additions & 20 deletions src/run_nanosim.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
import os
import sys

from utilities import compute_number_of_reads_to_simulate, get_fasta_length

# this script will run the nanosim pipeline to simulate nanopore reads from ecDNA and background genome.
# it assumes nanosim is installed and its scripts are on the system path (true if installed by conda).

Expand All @@ -16,29 +18,10 @@ def extract_nanosim_model(model_path):
if not os.path.isdir(model_path):
print("Extracting model files...")
cmd = "tar -xzf {}".format(model_path + ".tar.gz")
print(cmd)
print(cmd)
call(cmd, shell=True)


def get_fasta_length(fasta):
tlen = 0
with open(fasta) as infile:
for line in infile:
if line.startswith(">"):
continue

tlen+=len(line.rstrip())

print("Fasta size is " + str(tlen) + "bp")
return tlen


def compute_number_of_reads_to_simulate(coverage, fasta_len, mean_rl):
nreads = int(ceil(coverage * fasta_len / mean_rl))
print("Number of reads to simulate for {}x coverage: {}".format(str(coverage), str(nreads)))
return nreads


def run_nanosim(fasta, model_pre, output_pre, coverage, circ_or_linear, read_length_tuple, nthreads, seed=None):
fasta_length = get_fasta_length(fasta)
num_reads = str(compute_number_of_reads_to_simulate(coverage, fasta_length, model_mean_read_length))
Expand Down
56 changes: 56 additions & 0 deletions src/utilities.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
"""
File for basic utilities.
"""
from math import ceil


def get_fasta_length(fasta):
tlen = 0
with open(fasta) as infile:
for line in infile:
if line.startswith(">"):
continue

tlen+=len(line.rstrip())

print("Fasta size is " + str(tlen) + "bp")
return tlen


def pseudocircularize_fasta(fasta, c):
# read a fasta and makes c copies of each entry concatenated with itself
# writes a new fasta and return the path of that file
base, ext = os.path.splitext(fasta)
output_fasta = f"{base}_circularized{ext}"

with open(fasta, 'r') as infile, open(output_fasta, 'w') as outfile:
seq_name = ""
seq_data = ""

for line in infile:
if line.startswith('>'):
# Write the previous entry if it exists
if seq_name and seq_data:
pseudocircularized_seq = seq_data * c
outfile.write(f"{seq_name}\n")
outfile.write(f"{pseudocircularized_seq}\n")

# Start a new entry
seq_name = line.strip()
seq_data = ""
else:
seq_data += line.strip()

# Write the last entry
if seq_name and seq_data:
pseudocircularized_seq = seq_data * c
outfile.write(f"{seq_name}\n")
outfile.write(f"{pseudocircularized_seq}\n")

return output_fasta


def compute_number_of_reads_to_simulate(coverage, fasta_len, mean_rl):
nreads = int(ceil(coverage * fasta_len / mean_rl))
print("Number of reads to simulate for {}x coverage: {}".format(str(coverage), str(nreads)))
return nreads