-
Notifications
You must be signed in to change notification settings - Fork 1
/
workflow.txt
46 lines (37 loc) · 3.64 KB
/
workflow.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
**************************************************************************
* Workflow for functional profiling of metagenomic sample with Carnelian *
**************************************************************************
A. Training Carnelian with gold standard: (Needs to be done only once)
carnelian train -k 8 --num_hash <number-of-hash-functions-be-used> -l 30 -c 5 data/EC-2192-DB <model_dir>
Note: We recommend using 2 or 4 hash functions. Even with one hash function, the results are reasonably good.
B. Functional binning:
B1. Preprocessing reads: (Needs to be done for every sample/individual. Other third-party tools can also be used for pre-processing.)
Paired-end:
i) Quality filtering and adapter trimming:
java -jar trimmomatic-0.36.jar PE <input_forward_fastq> <input_reverse_fastq> <output_paired_forward_fastq> <output_unpaired_forward_fastq> <output_paired_reverse_fastq> <output_unpaired_reverse_fastq> ILLUMINACLIP:adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:30 MINLEN:90
ii) Deconvolving:
perl deconseq.pl -f <output_paired_forward_fasta_from_trimming> -dbs hsref -i 90 -c 90 -out_dir <forward_dir>
perl deconseq.pl -f <output_paired_reverse_fasta_from_trimming> -dbs hsref -i 90 -c 90 -out_dir <reverse_dir>;
iii) Concatenate reads into a multi-fasta file:
cat <forward_dir/*_clean.fa> <reverse_dir/*_clean.fa> > <sample.fasta>
Single-end:
i) Quality filtering and adapter trimming:
java -jar trimmomatic-0.35.jar SE <input_fastq> <output_fastq> ILLUMINACLIP:adapters/TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:30 MINLEN:90
ii) Deconvolving:
perl deconseq.pl -f <output_fasta_from_trimming> -dbs hsref -i 90 -c 90 -out_dir <output_dir>
iii) Rename the clean fasta file:
mv <output_dir/*_clean.fa> <sample.fasta>
B2. Annotating one sample:
carnelian annotate -k 8 -n <number_of_cpus> <path_to_directory_containing_input_fasta> <trained_model_directory> <path_to_sample_output_directory> <path_to_FragGeneScan>
Note on paired-end inputs:
--------------------------
If you have paired-end inputs, you can process either treat the reads individually. In that case, you need to convert the reads
to fasta format and put them in a single file before running Carnelian on it. If you want to use the paired-end relationships,
please use the merge_pairs.py script under utils folder to generate a merged fasta file. The script will construct a longer
read if the forward and reverse reads (reverse complemented) have overlaps. If they don't overlap, the script will link them by
filling the inner distance with 'N's. The inner distance can be specified by the user (default:50).
C. Functional profiling and abundance analysis: (Needs to be performed once all the samples have been labeled.)
i) Move the label files from individual sample directories to one single directory <labels_dir>. The files should be named after the same sample identifiers that appear in the <sampleinfo_file> file.
ii) Run the following command for creating the raw and effective counts matrices:
carnelian abundance <labels_dir> <abundance_matrix_dir> <sampleinfo_file> data/EC-2192-DB/ec_info.tsv;
Note: The sampleinfo_file is a tab-separated file with three columns: sample_id, group, fraglen. sample_id column will contain the unique sample identifiers, group will indicate the classes the samples belong to. For example, if it is a case-control study, this column will contain case/control, or if it is a population study, it could contain the ancestry of the samples. The fraglen column will contain the average read length of sequences in amino acids for each sample (say the sequences have an average length of 120 bp, then the entry for it will be 40).