This page lists all scripts used to analyze data for the plant microbiome project managed by the Reich lab in the Weizmann institute of science in Rehovot, Israel.
In the folder scripts you will find all kinds of scripts used for different analysis steps, here I will list each script purpose and an example of how to run it properly, please read carefully before starting.
All Relevant scripts are in the scripts/db_enrichment_all_steps folder
Use the: 01_prepare_raw_files.sh script
This script will filter the long reads fastq file and will only keep reads which match the primers that are in the primer file you will list, it will also reverse complement the reads so in the end all reads will have the same directionality
steps to perform
- create a primer file
- change the path to the primer file and the r script called reads_per_region.R
Run this script from the root folder where all long read files are found
Use the: enrich_db.sh script
Two files are required:
- Long read file in a fasta/fastq format(can be compressed)
- Database to enrich
Packages needed: cd-hit version 4.8.1, if working in the weizmann wexac cluster just use:
module load cd-hit/4.8.1
To use the script:
./enrich_db.sh -I "fasta/fastq input file" -D "database to compare to in fasta format" -N "sample name" -c "OPTIONAL - clustering identity to use with cd-hit"
NOTE - the three arguments -I, -D, -N are mandatory, otherwise the script won't run
NOTE - make sure to allocate enough memory, the larger the database the more memory needed
Outputs:
- First output is a fasta file containing non-redundant sequences that does not match the database
- Second output is a log.txt file containing all verbose
Use the: 03_Edit_following_cdhit_2d.sh script
NOTE - Make sure to use the correct seeds file in the first grep argument, this should be the seeds output from the last step which contains the seeds to add to the database
This script will change the header names to running numbers (in this example it changes from 2000000 onwards) i other numbers are needed change in the script. It will also create a table with original header and to what number it was changed In the last step it will linearize all sequences and create the final fasta file to give to Noam, and will be added to the database for SMURF analysis
found in the approximate_taxonomy folder
- First create a blast database from the original silva database (without any of our additions) (the script blast_db_build.sh) NOTE - make sure to cut the headers to only the id, otherwise the blast database build command will output an error becasuse the headers are too long
- Next, run the seeds that you want to add to the updated databse against the Silva datbase. to find the bast matching taxonomy (the script blast_run.sh)
- In the last step run the arrange_blast.R script to create a table with best matching taxon to each new header
Use the: UMI_pipeline.sh script
Five inputs are required:
- Forward primers to use for region demultiplexing (fasta format, see in sample data)
- Reverse primers to use for region demultiplexing (fasta format, see in sample data)
- Read1 fastq file (can be compressed)
- Read2 fastq file (can be compressed)
- Sample name
Packages needed: if working in the weizmann wexac cluster just use:
module load cutadapt/4.2-GCCcore-11.3.0
module load cd-hit/4.8.1
module load BBMap/38.90-GCC-10.2.0
module load seqtk/1.2
module load seqkit/2.3.1
module load R/3.5.1
To use the script:
./UMI_pipeline.sh "for_primers" "reverse primers" "read1 fastq" "read2 fastq" "sample_name"
NOTE - make sure to allocate enough memory and cores, the cd-hit process can take some time, more cores are much faster
Outputs:
- Two fastq files, R1 and R2, called $sample_name"_L001_R1_001.fastq.gz" and $sample_name"_L001_R2_001.fastq.gz"
- Log files, cutadapt_log.txt, clustering_df.txt - here you can see reads per region in the raw data and in the final files