Skip to content

Latest commit

 

History

History
567 lines (386 loc) · 18.1 KB

Linux Commands Cheat Sheet.md

File metadata and controls

567 lines (386 loc) · 18.1 KB

Linux Commands Cheat Sheet

This commands cheat sheet is particularly for working with and handling genomic data and files. Genomic data can be large and convoluted. Open-source bioinformatics programs sometimes require very particular file formats, or sometimes, you simply have 600 sequence files to deal with >.<" Hope this is useful!

Files

File Extensions

You will come across MANY different file extensions in bioinformatics. File extensions are arbitary and you can use anything but they should be informative about the data inside them. e.g. I will often name files with a single list inside "something.list" so that I know it's a list just by looking at the file name.

Ext. Name Meaning
.tar.gz zipped tar archive Very large data or program inside.
.fa
.fasta
fasta file (generic) contains sequence data with a sequence header like: ">seq1". at this stage, a fasta is assembled reads
.fna fasta file (nucleotide) contains nucleotide data (ATGC...)
.faa fasta file (amino acid) contains peptide or protein sequences (MSQL...)
.fastq unprocessed reads file (nucl) contains header, sequence and read quality info. Each read is 4 lines of info
.tsv tab-separated values column1 column2 column3
.csv comma-separated values column1,column2,column3
.gbk genbank file has gene/protein specific annotation information. NCBI annotation files
.gff general feature format has gene/protein specific annotation information per contig/chromosome, tab separated
sam, bam mapping files sam is human-readable mapping file
bam is the indexed binary
Use samtools or bamtools to deal with these
.sh shell script has commands to run consecutively using bash

Download from internet and unzip Files

Genomic data/databases and bioinformatics programs often come zipped in a tar. Tar = tape archive To extract:

#download database from NCBI ftp site - downloads to current dir
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/Viruses/all.faa.tar.gz
file: all.faa.tar.gz

#to extract zipped tar files (.tar.gz)
$ tar -zvxf all.faa.tar.gz 	#-z unzip, -v verbose, -x extract, -f file

#extract to particular location
$ tar -zvxf all.faa.tar.gz -C ./Viral_genes/

#only zipped, not tar
$ gunzip File1.txt.gz

Find files

Listing files
#List the current directory
$ ls
#list a random directory
$ ls ./random/
#list the current directory in chronological order with ownership details
$ ls -lrt
-rw-rw-r-- 1 vmkhot vmkhot  12462377 Feb 16 04:00 viral-fullseq-trim.fa
#list ALL files (including hidden files that start with .)
$ ls -a
.  ..  .bash_history  .bashrc  .cpan  .gnupg  .profile  .ssh  .viminfo
#list in a list
$ ls -1
data
viral-fullseq-trim.fa
gtdbtk_test

Find and Execute

The "find" command can help you search recursively through directories and subdirectories for particular file names, extensions or patterns

The general format is:

find [options] path/to/start expression

The simplest type of find command:

#finds all files with the file extension ".fa" in the random directory and it's subdirectories
$ find ./random -name "*.fa"

Other useful find commands/options

#iname makes the search case-insensitive
$ find ./random -iname "*.fa"

# -regex finding regex patterns
$ find -regex ".*\.fa"

# regex allows you to search multiple extensions; e.g. if you didn't know whether your file was called fasta or fa
$ find -regex ".*\.fa" || ".fasta"

# find FILES with pattern* in random directory, print file name and \n line break
$ find ./random -name "pattern*.fa" -type f printf "%p\n" 

# find and copy all files to new location
find ./ -iname "slurm*.sh" -exec cp {} ./scripts/ \;

# find files listed in a txt file and move them to another directory
while IFS= read -r f; do echo "$f"; mv "$f".fa ./good_bins ;done < good_bins.list

File movement

These are basic file utilities

Copy
cp ./location/to/file1 ./new/location/file1
Move or rename
mv ./current/file1 ./destination/file1

Renaming a file is the same "mv" command, but the current and destination locations are the same directory.

E.g. To rename all ".fa" files to ".fna"

for f in *.fa; do mv -- "$f" "${f%.fa}.fna"; done

To split large directory into subdirectories. Change 300 to wanted number

i=0; for f in *; do d=dir_$(printf %03d $((i/300+1))); mkdir -p $d; mv "$f" $d; let i++; done
Remove

!!STOP!! Make sure you are absolutely sure about what you are deleting. Linux does not have a "recycle bin". Deleted files and directories are gone FOREVER.

#basic
$ rm file.txt

#Delete an empty directory
$ rm -d /path/to/directory

#PARTICULAR DANGER!!!!
#Delete a directory with files and other directories. This will wipe "random" and everything random contains recursively
$ rm -rf path/to/random
Move a file to your local desktop

There's an "scp" command that you can use. IMHO it's annoying to type in all the time.

I use MobaXterm on Windows, which has a terminal, text editor and SFTP connection (Secure File Transfer Protocol), where you can drag and drop files in and out to your desktop. Alternatively you can WinSCP (also Windows, also SFTP).

For Mac users: CyberDuck

Look inside files

To peek
$ less file.txt

$ more file.txt

#To EXIT the less/more view
q

#head will print you FIRST 10 lines from file.txt
$ head file.txt
$ head -n 100 file.txt #prints first 100 lines instead; n = number

#tail will print you LAST 10 files from file.txt
$ tail file.txt
$ tail -n 100 file.txt
To edit

use nano or vim - I prefer vim but it's more annoying to use/learn for sure

You can also use WinSCP or MobaXterm which have in-built text editors.

Search Inside and Edit Files

Very often you can search inside and edit files right from the command line, without opening the file at all. This is very unintuitive for Windows/Mac users who are used to seeing the changes as they edit, like in Word.

You might be wondering why one would ever do this. Well, imagine a sequence file with a 1000 sequences and you need to change every header starting with "bin03". This would be painful to do manually, but can be done easily and in seconds with a sed one-liner. You are likely to encounter these kind of scenarios all the time!

Grep, awk and sed are the champions of file editing from the command line. They each have lots of options - I've only included here what I've use commonly.

Regular Expressions (regex) are a way to describe a pattern. Worth learning for quick changes. They are compatible across multiple scripting languages like python and perl as well. Google for more info and how to use these. You can test out your regex patterns here: Regex101

Grep - Pattern find

You can print lines from a file if you know a pattern you are looking for using "grep"

Grep is very useful and has tonnes of options (refer to manual) and I use it basically everyday. Grep is pattern matcher so it uses regex patterns. Useful grep one-liners:

#basic usage
$ grep ">" SeqFile.faa #will print all lines which have > in them (e.g. fasta headers)

#prints all lines with > in them + 2 trailing lines after (fasta header + 2 sequence lines)
$ grep -A 2 ">" SeqFile.faa 
$ grep -B 2    # 2 lines before pattern
$ grep -C 2    # 2 lines before and after pattern

$ grep -v "xxx" #print lines NOT matching xxx

$ grep -i #case insensitive

$ grep -c #print number of lines matching

$ grep -n #print line numbers

$ grep -w #exact match (not regex)

#match multiple patterns
$ grep -A1 "VC37\|VC38\|VC7\|VC36" genes.faa > top4_genes.faa

#My favourite
$ grep -Ff patterns.txt file.txt #this grabs the patterns from a file with a list of patterns and searches for them inside file.txt

#you can combine options
#this will look for headers specified in "headers.list" in an amino acid fasta file "file.faa" and print you the matching headers and 1 sequence line after it
$ grep -A1 -Ff headers.list file.faa

To search large zipped files without opening them!

$ zgrep   #zgrep can be used with all above commands

Awk - Pattern find

Awk is effectively a programming language. It's used to search for patterns (line by line) and perform some action on them. It's mega useful for fasta files as well as tsv/csv types.

Basic usage:

This will search for "pattern" in your file and print the entire line "$0" from file.txt

awk /pattern/ '{print $0}' file.txt
CSV/TSV

For the "for loops", see the section on loops.

# This will split your file by the field separator "-F" (whatever you choose) and print column 1
#tab-separated
$ awk -F'\t' '{print $1}' file.tsv
# comma separated
$ awk -F',' '{print $1}' file.csv

# if statements. Prints full line(row) where column 1 >= 100 in a tab-separated file to output.tsv
$ awk -F'\t' '{if($1 >= 100) print $0{}' file.tsv > output.tsv

# count highest number of columns
$ awk '{print NF}' file | sort -nu | tail -n 1

# AWK can be used to filter tsv files quite effectively. e.g. outputs from blast
# filter column3 (% id) >= 95 AND column11 (evalue) <= 1E-5 and print the full row
$ awk -F'\t' '{if($3 >= 95 && $11 <= 1E-5) print $0}' blast.tsv > filteredblast.tsv

# search strings from filter.txt against column 2 in data.tsv
$ awk -F "\t" 'FNR==NR {hash[$0]; next} !($2 in hash)' filter.txt data.tsv > output.tsv

awk 'BEGIN{FS=OFS="\t"} $1~/VC/ {gsub(/_/, "\|", $1)} 1' temp.out

# filter by column 11 > 0.8, print file name to the first column and cat all files
$ for f in *deduped.tblout; do file=$(basename $f .tblout);awk -v a=${file} '{if ($11 >= 0.8) print a,'\t',$0;}' < ${file}.tblout >> cat_file.tblout ; done
Fasta

Useful one-liners taken from the internet so not going to explain these

#SPLIT MULTILINE FASTA INTO INDIVIDUAL FILES
$ awk -F '|' '/^>/ {F=sprintf("%s.fasta",$2); print > F;next;} {print >> F;}' < cyanoWGS_CRISPRs.fasta

#MULTILINE FASTA TO SINGLELINE FASTA
$ awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < input.faa > output.singleline.faa

#SPLIT GIANT FASTA INTO PIECES OF 1000 SEQUENCES (ONLY FOR SINGLELINE)
$ awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%1000==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < sequences.fa

#UPDATE FASTA HEADERS WITH FILE NAME
$ for f in *.fasta; do file=$(basename $f .fasta); awk -v a=${file} '/^>/{print ">" a "." ++i ; next}{print}' < ${file}.fasta > ${file}.new.fasta; done

Sed - Search and Replace

"Stream Editor"

I think of sed primarily as a "search and replace" utility. There are many file changes one can make with sed and they require learning regex and it's very fast for larger files. (much faster than grep imho).

Commonly used sed one-liners:

#Basic. the "g" at the end means "global", i.e. replace every instance in the file
sed 's/find/replace/g' input.txt > output.txt

#inplace file editing (!!!DANGER!!!, make sure your command is perfect before using "-i")
sed -i 's/find/replace/g' input.txt

#replace entire first line with xx
sed -i '1s/.*/1620/'

#replace last "_" (underscore) in each line with xx
sed 's/\(.*\)_/\1xx/'

#print lines between two patterns
sed -n '/>pattern1/,/>pattern2/p' file.fasta

# print lines between two patterns, excluding the last pattern
sed -n '/>contig_xx/,/>/' contigs.fasta | head -n -1

# sed if line starts with ">VC"
sed '/^>VC/s/search_string/replace_string/'

# delete line matching pattern (prints the remainder)
sed '/pattern/d'

# delete from matching pattern(#) to end of line, incl pattern (e.g. shorten fasta headers)
sed -i 's/#.*$//' file.fna

# delete line matching pattern (exact match with \b) + the line after
# e.g. singleline fasta - it will match the header and delete both header and sequence
sed '/k141_57368\b/,+1 d' file.fa
#OR
sed -e '/k141_57368\b/{N;d;}' file.fa
#multiple
sed '/k141_57368\b\|k141_88513\b/,+1 d' file.fa

# add filenames to fasta headers
for f in *.fa; do sed -i "s/^>/>${f}_/" "$f"; done

Other File Utilities

perl

To replace newline+tab with just a tab

 perl -0777pe 's/\n\t/\t/g' xaa_corrected.tsv > temp.tsv

Concatenate

To join files by rows. E.g. concatenate a bunch of fastas into 1 giant fasta

$ cat file1.fa file2.fa file3.fa > allfiles.fa

wc -l

To count the total number of lines

$ wc -l file.tsv

#multiple files
$ wc -l *.tsv
file1.tsv: 80
file2.tsv: 800

Combine with other commands

#count number of hits with scores above 1000
$ awk '{if ($12 >= 1000) print $0}' blastoutput.tsv | wc -l 

Sort

Sort will help you sort a list or .tsv or .csv by particular column. Sort is incredibly useful and probably one of my most used commands. I typically use sort to look get a quick idea of my data, e.g. blast outputs.

#basic and writes to output.txt
$ sort file.txt > output.txt
Apples
Beaches
Cats

#reverse order
$ sort -r file.txt
Cats
Beaches
Apples

#data is numerical
$ sort -n file.txt
10
20
30

#by 2nd column number
$ sort -k 2n file.txt
Apples	100
Cats	200
Beaches	300

#You can also use sort to first sort then deduplicate your rows
$ sort -u file.txt  

Sort the top blast hits by bitscore (12th column in outfmt 6).

#this reverse sorts by 12th column (bitscore) of a blast output file and prints top 10 rows 
$ sort -rk 12n blastoutput.tsv | head 10

Or I use it in a pipe from awk to filter data. E.g. If I want to filter all bitscores > 1000 and have them reverse sorted for me

$ awk '{if ($12 >= 1000) print $0}' blastoutput.tsv | sort -rk 12n > filteredblastout.tsv

Uniq

Sort is often used with uniq to count or remove duplicates like so:

$ sort file.txt | uniq

The above command is the same as "sort -u" but uniq has more options too. It's almost always used with the sort command because it sees the "first" instance and doesn't work on unsorted data.

Uniq is a specialty utility that I don't use very much, I prefer "sort -u".

#count unique lines
$ sort file.txt | uniq -c

#count duplicates
$ sort file.txt | uniq -d #prints first instance
$ sort file.txt | uniq -D #prints all instances

#remove all duplicates
$ sort file.txt | uniq -u

-i : ignore case

Cut

Cut is useful for extracting data from tab-separated files. Cut requires options to be useful

$ cut file.tsv	#this produces an error

# extract by bytes
$ cut -b 1,10	#1 - 10 bytes

# extract by character
$ cut -c 1,10	#1 - 10 characters

#extract by field (-f) and delimited (-d)
$ cut -d ' ' -f 2	#delimit file by space and print 2nd field

#cut in action with other commands to get info from headers
input: >contig1_gene1_annotation
$ grep ">" file.fa | cut -d '_' -f 1	
output: >contig1
$ grep ">" file.fa | cut -d '_' -f 1,2,3 --output-delimiter='	'
output: >contig1	gene1	annotation

Diff

stands for "difference".

Diff tells you which lines have to be changed for the files to become identical. Useful for comparing seemingly identical fasta files or blast outputs.

Diff outputs 3 potential messages. a: add c: change d: delete

$ diff File1.tsv File2.tsv

I don't use this command regularly, I find it easier to compare files by counting rows instead.

Loops

For loop

Sometimes you have to run the same command on many files and you don't want to type it out 100x. You can use a "for loop" to run something on all your files.

You would put the following code into a shell script file (e.g. do_something.sh) and run in via (bash do_something.sh) Example to adapt:

for fn in ./*.fa					# iterate through all .fa files in current directory
do							
    echo $fn							# print fasta file name
    newname=$(basename $fn .fa)			# assign name w/o the extension to variable "newname" 
    echo $newname						# print fasta file name
    sample="${newname:1:-6}" 			#"sample" = particular characters of "newname" 
										#(-6 is from right to left)
    echo $sample
done

output:
Sample1_bin11.fa	# $fn
Sample1_bin11		# $newname
Sample1			# $sample

Then run by doing:

$ bash do_something.sh

The same loop can also be written as a one-liner

for fn in ./*.fa; do echo $fn; newname=$(basename $fn .fa); echo $newname; done

An actual example to run a mapping command using a for loop

for fn in ../01_qc/qc_files/CAT*_R1_001.qc.fastq;
do
    base="${fn:0:-16}"
    newname=$(basename $fn .qc.fastq)
    sample="${newname:4:-7}"
    bbmap.sh ref= in=${base}_R1_001.qc.fastq in2=${base}_R2_001.qc.fastq out=./${sample}_bbmap.sam covstats=./${sample}_bbmap_covstats.txt scafstats=./${sample}_bbmap_scafstats.txt threads=20 minid=0.95 ambiguous=toss
done

While loop

Slightly less useful while loop to loop over the contents of a file

while IFS= read -r line; do echo "$line"; done < file.txt

while IFS= read -r line; do echo "$line"; find ./ -iname ${line}_MSA* -exec mv {} ./files_to_redo \; ; done < files_to_redo.txt