We will learn some usefull commands that will help you to use linux. We will also use some editors to create some text files. These editors can help you to manupilate text files and create bash scripts to execute commands in more organized way.
There are a couple of more commands that is required to learn to be more comfortable in linux environment. In this session we will learn processes in management, pipe usage, grep, awk, sed commands that are usefull for text manupilation and text search in files and vi editor.
When you run a command, it is called a process and there are two ways you can run it −
- Foreground Processes
- Background Processes
By default, every process that you start, runs in the foreground. For example, when you start "sleep 300" command. It will wait 300 seconds and you cannot do anything else while it is running.
$ sleep 300
To send a process to the background, you can start the command and use Ctrl+Z keys to send it to background. When you press these keys, it will stop the run. You need to run bg
command to continue to the run.
$ sleep 300
press Ctrl+Z in your keyboard
$ bg
This command will continue running the command.
Or you can run the process in the background with &
argument.
$ sleep 300 &
This will send the process to the background without Ctrl+Z and bg command.
If you want to get the process, running in the background, to the foreground use fg
command.
$ fg
- In order to exit and stop the command use "Ctrl+C"
To see running processes you can use ps
command or ps -f
to show it with full details.
$ ps
or
$ps -f
UID PID PPID C STIME TTY TIME CMD
501 1384 1383 0 2:29AM ttys004 0:00.23 -bash
501 4373 1384 0 1:56PM ttys004 0:00.00 sleep 300
501 1395 1394 0 2:29AM ttys005 0:00.02 -bash
Column | Description |
---|---|
UID | User ID that this process belongs to (the person running it) |
PID | Process ID |
PPID | Parent process ID (the ID of the process that started it) |
C | CPU utilization of process |
STIME | Process start time |
TTY | Terminal type associated with the process |
TIME | CPU time taken by the process |
CMD | The command that started this process |
If the process is running in the foreground you can use Ctrl+C command to stop it or you can use ps
command to learn PID of that process and use kill
command to stop it. For example to kill sleep 300
command above, first check the PID which is reported at above as 4373. The command below will kill it.
$ kill 4373
Terminated
If a process ignores a regular kill command, you can use kill -9
followed by the process ID as follows.
$ kill -9 4373
Terminated
- Note that, PID or a Process ID is different than JOBID in LSF schedular.
The top command is a very useful tool for quickly showing processes sorted by various criteria.
It is an interactive diagnostic tool that updates frequently and shows information about physical and virtual memory, CPU usage, load averages, and your busy processes.
$ top
- You can exit from the "top" command by typing "Ctrl+C"
You can connect two commands together so that the output from one program becomes the input of the next program. Two or more commands connected in this way form a pipe.
To make a pipe, put a vertical bar |
on the command line between two commands.
If you want to count the # of lines of an output of ls -l
command, you can use the command below
$ ls -l | wc -l
Or if you want to count # of lines in a fastq file that we copied into the directory below; These were paired end files, so the # of lines in between control_rep1.1.fq and control_rep1.2.fq has to be the same. You can check that using the commands below;
$ cd ~/bootcamp/RNA-Seq/reads/
$ cat control_rep1.1.fq | wc -l
99152
$ cat control_rep1.2.fq | wc -l
99152
If they aren't equal, you might need to re-download the files.
The grep command searches a file or files for lines that have a certain pattern.
$ grep pattern file(s)
The simplest use of grep is to look for a pattern consisting of a single word.
$ ls -l > ~/mylist.txt
$ grep Apr ~/mylist.txt
The same command could be run with pipe without writing the outout of the "ls -l" command to a file.
$ ls -l | grep Apr
Option | Description |
---|---|
-v | Prints all lines that do not match pattern. |
-n | Prints the matched line and its line number. |
-l | Prints only the names of files with matching lines (letter "l") |
-c | Prints only the count of matching lines. |
-i | Case insensitive search. Matches either upper or lowercase. |
Let's play with grep on our sequence files.
If it takes more than one page you can use less command.
$ cd ~/bootcamp/RNA-Seq/reads
$ grep AAA --color=always control_rep1.1.fq | more
- Tip: Press
q
to exit from "more".
You can also use --color
in short. Let's search another sequence CAGAGTTC and put the results into a separate file.
$ cd ~/bootcamp/RNA-Seq/reads
$ grep CAGAGTTC --color control_rep1.1.fq > ~/bootcamp/seq.txt
To learn how many reads include CAGAGTTC in control_rep1.1.fq file
You could use "-c" argument;
$ grep CAGAGTTC -c control_rep1.1.fq
or with pipe and wc
command:
$ grep CAGAGTTC control_rep1.1.fq | wc -l
Not every command has -c argument, so learning second method might be useful for other commands like cat
.
If you want to search a motif in your reads starts with CAGAGT and ends with TGA. There could be anything in between. To represent any nucleotide in between these two sequences, you can use .*
. So the pattern will be CAGAGT.*TGA
.
$ grep CAGAGT.*TGA --color control_rep1.1.fq
If you want to eliminate the reads that include the pattern above but include another sequence GCT...CGCG. "..." means any three nucleotides.
$ grep -v CAGAGT.*TGA control_rep1.1.fq | grep GCT...CGCG --color
To learn more about it
https://ryanstutorials.net/linuxtutorial/grep.php
Cheat sheet:
https://ryanstutorials.net/linuxtutorial/cheatsheetvi.php
awk is an interpreted language designed to process text files. awk is especially good for the files and command outputs that have columns. You can access any column of a file easily with awk.
Content | Awk variable |
---|---|
Entire Line | $0 |
Column 1 | $1 |
Column 2 | $2 |
. | |
. | |
Column i | $i |
Line Number | NR |
For example, if we only want to list username and the name of the directories we can use these commands;
$ cd ~/bootcamp/RNA-Seq
$ ls -l
total 38
drwxrwxr-x 2 your_user your_group 51 Apr 7 23:44 mm10
drwxrwxr-x 2 your_user your_group 34 Apr 7 23:46 progs
drwxrwxr-x 2 your_user your_group 408 Apr 7 23:32 reads
Let's list column 3 which is user id and column 9 and put space in between these columns.
$ ls -l > mylist.txt
$ awk '{print $3" "$9}' mylist.txt
Or without writing the list into a file, we could redirect the output of "ls -l" directly to the awk command.
$ ls -l | awk '{print $3" "$9}'
To put a space between columns, I used space in quotes " "
. If you want to put a tab character in between columns you can use "\t"
this usage is called using "escape character" that you might see it in other tutorials;
$ ls -l | awk '{print $3"\t"$9}'
In fastq files, there are 4 lines for each read. The nucleotide sequence of the reads is on the second line for every forth lines. We can get them using a very simple modular arithmetic operation,
$ cd ~/bootcamp/RNA-Seq/reads/
$ head -n 8 control_rep1.1.fq
@HWI-ST333_0273_FC:7:1205:9843:15143#CCGAGT/1
CAAGGAAGCACATGACCGAGCAGAAAATACCCAGTTTGTC
+
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
@HWI-ST333_0273_FC:7:2116:8217:37625#CCGAGT/1
GTCAGCTTCCTGATGTTCTCCAGGCCACTGTACACTACAT
+
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
In order to get the nucleotide sequence of the reads we can use awk command. NR
here is the line number NR % 4 == 2
will use the modulus operator to get second lines of every forth lines in a file.
$ awk '{if(NR % 4 == 2)print($0)}' control_rep1.1.fq | less
Let's submit three dummy jobs
$ bsub -J job1 "sleep 300"
$ bsub -J job2 "sleep 300"
$ bsub -J job3 "sleep 300"
If you need more time than 300 seconds to finish this section, you can increase it. Now we can build the necessary command to kill those dummy jobs.
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
4358754 alper.k RUN interactiv hpcc04 r640c26 job1 Apr 14 17:36
4358755 alper.k RUN interactiv hpcc04 r640c26 job2 Apr 14 17:36
4358756 alper.k RUN interactiv hpcc04 r640c26 job3 Apr 14 17:36
To print the jobs where their job names include "job", we can use a command below;
$ bjobs | grep job
4358754 alper.k RUN interactiv hpcc04 r640c26 job1 Apr 14 17:36
4358755 alper.k RUN interactiv hpcc04 r640c26 job2 Apr 14 17:36
4358756 alper.k RUN interactiv hpcc04 r640c26 job3 Apr 14 17:36
As you can see, it skipped the first line (header line). Now we can print the job IDs.
$ bjobs | grep job| awk '{print $1}'
4358754
4358755
4358756
To prepare kill commands for each job;
$ bjobs | grep job| awk '{print "bkill "$1}'
bkill 4358754
bkill 4358755
bkill 4358756
The commands above will only write the bkill commands to the screen but it won't execute. To execute we need to send the output to bash like below;
$ bjobs | grep job| awk '{print "bkill "$1}'|bash
Job <4358754> is being terminated
Job <4358755> is being terminated
Job <4358756> is being terminated
This is one of the many ways to do the same operation.
You can check other tutorials and cheat sheets online;
https://www.grymoire.com/Unix/Awk.html
http://www.awklang.org/asset/krumnisCheatSheet.pdf
There are many ways to edit text files in linux. vi is one of the most powerful editor you can use from command line. To start the vi editor you use
$ vi testfile
The above command will generate the following output;
|
~
~
~
~
~
~
~
~
~
~
~
~
"testfile" [New File]
A tilde represents an unused line.
- Command mode: As you enter vi, command mode will be active. This mode enables you to perform administrative tasks such as saving the files, executing the commands, moving the cursor, cutting (yanking) and pasting the lines or words, as well as finding and replacing. In this mode, whatever you type is interpreted as a command.
- Insert mode: If you press "i" in the command mode, it will switch to insert mode. This mode enables you to insert text into the file. Everything that's typed in this mode is interpreted as input and placed in the file. Press the Esc key twice to return back to command mode after you complete the editing.
Hint: If you are not sure which mode you are in, press the Esc key twice; this will take you to the command mode.
First you need to return to command mode by press the Esc key twice.
- If you haven't edited the file, type
:q
(press enter to exit) - If you've edited the text and want to quit without saving the changes, type
:q!
(press enter to exit) - Save and exit is
:wq
(press enter to exit) - Only save the text is
:w
To edit the file, you need to be in the insert mode. There are many ways to get in to insert mode but the easiest way to get into insert mode please press i. To exit from insert mode press ESC.
When you are in instert mode you will see -- INSERT -- at the bottom of the page;
|
~
~
~
~
~
~
~
~
~
-- INSERT --
You can start writing now and to save and exit. Press ESC and :wq
Let's write a command ls -l
into a file with vi.
$ cd ~
$ vi myls
Go into insert mode write ls -l
into this file and save and exit (Press ESC and :wq).
Lets make this file executable.
$ chmod +x myls
Now it is ready to execute.
$ ./myls
Or use the full path where the file is located. If this file is created in your home
$ ~/myls
It will run "ls -l" command.
- Tip: If your myls file located somewhere else, you can learn the full path of a directory by running
pwd
command.
There are many commands to search, replace, delete the lines etc. in vi. To learn more about vi, you can use the tutorial below;
https://ryanstutorials.net/linuxtutorial/vi.php
- While using vi, make sure the file is a small text file, bash script or a program. Don't open big files (e.g. fastq) with vi. It can open but takes a lot of time and not advised
-
Submit two dummy jobs to the long queue and three dummy jobs to the short queue. Then get a list of your jobs that have been submitted to the long queue only.
- Hint 1: Use sleep 300 to create a dummy job.
- Hint 2: Use bsub to submit a job. Remember that -q parameter is used to specify the queue.
- Hint 3: Recall that bjobs can be used to list your jobs in the cluster.
- Hint 4: Use what you have learned so far and put the pieces together.
-
Print the "start codon" location of Fgf21 gene using the annotation file(gtf) in the following format and send the command to one of us.
The annotation file location:
/share/data/umw_biocore/genome_data/mouse/mm10/ucsc.gtf
The output format:
chr7 "Fgf21"; start_codon 45615304
- Hint-1: Use grep and awk commands with pipe.
- Hint-2: You can find "start codon" by searching
start_codon
.