For the sake of clarity, the concepts mentioned in this tutorial have been simplified significantly. Linux is not Unix, but it is a Unix-like operating system. Linux system is derived from Unix. Similar commands you will learn in this tutorial may be applicable to other unix systems. Most of the command line tools and programs we explain have many more features that cannot be found here. Linux command line tools are very well documented and their manual pages are freely available. These notes are prepared for the users of UMMS Cluster. UMMS Cluster system has features that might not exist in standard Unix/Linux systems such as loading modules. A rigorous treatment of topics on Unix and Bash can be found in various books in several levels.
To understand the basics of Linux environment, how to use command line and to familiarize yourself with some standard tools and commands using a terminal.
- Introduction
- Class Materials
- Before you start
- Getting Started
- Basic Commands
- File transfer from your laptop to cluster using FileZilla
- Session1 Homework
You can follow the class materials below.
Session 1:
- Bootcamp Overview
- Introduction to Sequencing Analysis and File Formats
- Introduction to Sequencing Analysis and File Formats - Record
Linux is an operation system like Microsoft Windows or OSX. An operating system is a collection of software that manages system resources (such as memory, cpu(s), display and etc.) and provide applications a simpler interface to the system hardware.
A shell is a software that runs inside a terminal and interprets and executes user commands. One of the most popular shells being used today is called BASH (Bourne Again Shell). In this tutorial we will use bash. There are other shells like zsh or sh. They can also be used if you need for a specific functionality that is supported by that shell.
There might be slightly differences according the operating system you use such as Windows, Linux or MacOS.
In order to make an SSH connection to your account, you need to use program like PuTTY. Use the following info to configure your connection and click Open to start the SSH session.
- Host Name: hpc.umassmed.edu
- Port: 22
- Connection Type: SSH
Caution: * If you need to copy and paste any text/command, you should right-click (or use middle mouse button) to paste your text/command.
If you have entered the correct password, the prompt responds with a shell prompt:
[yourusername@hpcc04 ~]$
If you are a MacOS or Linux user, you can make an SSH connection by using Terminal, a console program included with the operating system.
$ denotes a prompt for the command line. It is not meant to be typed while running the commands in this tutorial. All Bash commands will appear in a gray box that you will run in your terminal. In this example below, you will only write "your_command" in the terminal an press enter. It will then execute the command in your terminal. (Please only run commands in the lines starts with "$" symbol.)
$ your_command
Let's start running some commands. First command will be "ssh" to connect to the UMASS cluster. If you're using windows, please use PuTTY or similar program to make SSH connection.
$ ssh [email protected]
Let’s verify that we are at the right place. In this case only run "hostname", not the output of the command (ghpcc06).
$ hostname
hpcc04
You need to see "hpcc04" or "hpcc03" in your terminal as an output. hpcc03 or hpcc04 are our "head nodes". You will learn what head or child nodes are in these tutorials later.
To print something on the screen. We use echo command.
$ echo Hello World
If you just run
$ Hello World
-bash: Hello: command not found
You will get this error "command not found". Because there is no any command called "Hello". As you can see, this command is interpreted by "bash" and when there is an error, bash reports the error.
- List a directory (folder)
- Locating Applications and Software Packages
- Modular system
- Print working directory
- Accessing Directories
- Exploring the Filesystem
- Creating a directory
- Copying Files
- Copying Directories
- Removing a file
- Removing a directory
- Moving a directory or a file
- Rename a directory or a file
- Files
- Redirecting Standard Output to a File
- Viewing text files
- Creating an empty file
- File and Directory Security
- File Permissions
- Wildcards
- Working with Compressed Files
- Downloading a file from web
Before we explore the commands used to manipulate the Linux environment, we should take a quick look at the structure of the environment itself.
You can think of a Linux file system as an upside-down tree. See the diagram below. In this diagram, the boxes are directories or folders. "script.sh" is a file.
At top you will see the following symbol; "/". It is called "root" directory. For example if user1 wants to access Documents directory, user will use "/home/user1/Documents" when it is needed. This directory structure is called "Full Path" or "Absolute Path" of a directory. Full path of script.sh in this example will be "/home/user1/Documents/script.sh".
Probably the most often used command in Linux is the ls command. It is used to list the contents of a directory. Unlike many other operating systems, Linux is case-sensitive. In other words, if you type "LS" instead of "ls", Linux will not recognize the command. This applies to directory and file names, like "home" and "script.sh", as well
Caution: Linux is case-sensitive, in order to avoid typing errors, you might copy and paste commands.
Lets run some commands below one at a time and list the content of the directories.
$ ls /
$ ls /pi
$ ls /pi/alper.kucukural-umw/umw_biocore/class/
For example, last command result will be
$ ls /pi/alper.kucukural-umw/umw_biocore/class/
class.html class.R data.tsv funcs.R ...
We will visit ls command later to learn about the sizes or permissions, creation times, or owner of the files or directories.
Blue: Directory Green: Executable or recognized data file Sky Blue: Symbolic link file Yellow with black background: Device Pink: Graphic image file Red: Archive file Red with black background: Broken link
Depending on the specific distribution, programs and software packages can be installed in various directories. In general, executable programs should live in the following directories
/bin
/usr/bin
/sbin
/usr/sbin
/opt.
One way to locate programs is to employ the which
utility. For example, to find out exactly where the diff program resides on the filesystem:
$ which diff
/usr/bin/diff
If which does not find the program, whereis is a good alternative because it looks for packages in a broader range of system directories:
$ whereis diff
diff: /usr/bin/diff /usr/share/man/man1/diff.1.gz
The software packages in our cluster are installed in a modular structure to support multiple versions of the software in a single system. For example, "STAR" is a splice aware genomic aligner. It is installed as a module and initially it is not loaded in the system.
$ which STAR
/usr/bin/which: no STAR in (...)
which command couldn't find the STAR in any of our paths.
Please write the command below and press "tab button" in your keyboard. It will list all available versions of star. (Tab can also be used for auto-completion of any other commands.)
To load the desired version just type the rest of the version and press enter.
$ module load star/2.7.10a
Run "which" command again to see the location of the command;
$ which STAR
To unload a module (eg. when you want to use another STAR version), you can use unload command;
$ module unload star/2.7.10a
To list all available modules in the cluster;
$ module avail
If a version of a software package you want to use is not installed in our cluster. You can always send an email to hpcc-admins to have it installed. We will use these and other modules, in other sessions.
pwd command will show what directory you are in.
To see the working directory,
$ pwd
The following commands are useful for directory navigation:
Command | Result |
---|---|
cd | Change to your home directory |
cd ~ | Change to your home directory (same like above) |
cd .. | Change to parent directory |
cd - | Change to previous directory |
cd / | Changes your current directory to the root (/) directory |
cd /pi/alper.kucukural-umw/umw_biocore/class | Changes your current directory to a specific directory |
~ : Home directory
. : current directory
.. : parent directory
If a file path begins with /, then it is an absolute path. It doesn’t depend on the current working directory. If a file path begins WITHOUT /, then it is a relative path. The path is defined according to the current directory. So, the path depends on the current working directory.
Please use all the commands in the table above and run "pwd" command after that to see which directory you are in and ls to see the content of the directory.
# Important: Please execute these commands to understand the structure of directory system
$ cd
$ pwd
$ ls
$ cd ..
$ pwd
$ ls
$ cd -
$ pwd
$ ls
$ cd /
$ pwd
$ ls
Now we have to take a time to explain pathnames.
So far we have only been listed files that are in our current directory. Sometimes you might want to manipulate or list files that are not in your current directory. For example, you may be doing a lot of work in the ~/bootcamp directory, but you remember that you wanted to work on "/pi/alper.kucukural-umw/umw_biocore/class". You could accomplish this by using these commands:
$ ls /pi/alper.kucukural-umw/umw_biocore/class
Please remember if it starts with "/". It is called full path (or absolute path). You can use these paths while copying or moving the files or folders.
The tree command is a good way to get a bird’s-eye view of the filesystem tree. The following commands can help in exploring the filesystem:
Command | Result |
---|---|
ls | List the contents of the present working directory |
ls -a | List all files including hidden files and directories |
ls -l | Detailed list of files and directories |
ls -lh | Detailed list of files and directories where the file sizes are reported in human readable format |
tree | Displays a tree view of the filesystem |
tree -d | Just list the directories and suppress listing file names |
Please use the commands in the table using the example directory "/pi/alper.kucukural-umw/umw_biocore". You can use these command with any other directories too.
To see the complete list of options of a command.
$ man ls
To exit from manual of a command, please press "q" button in your keyboard.
Try commands below;
$ ls /share/data/umw_biocore/
$ ls -a /share/data/umw_biocore/
Which file is different in the output of both commands above?
The hidden files start with "."
$ ls -l /pi/alper.kucukural-umw/umw_biocore
$ ls -lh /pi/alper.kucukural-umw/umw_biocore
You can also run -l -a command together
$ ls -a -l /pi/alper.kucukural-umw/umw_biocore
ls command can also recognize the parameters together like in the example ls -lh
, in any order
$ ls -al /pi/alper.kucukural-umw/umw_biocore
tree command shows all files and folders recursively in the directory you want to check. If you run the command in the directory that too many files are located, printing files to terminal might take a long time. In that case, you can press Ctrl+C (Control+C) in your keyboard to stop the execution.
$ tree /pi/alper.kucukural-umw/umw_biocore
Caution:If any of your commands run too long. You can always use Ctrl+C in your keyboard to
stop the execution.
List the tree with only the folder names
$ tree -d /pi/alper.kucukural-umw/umw_biocore/genome_data/human/
Let's first go to our home directory.
$ cd
Create a directory called "bootcamp"
$ mkdir bootcamp
Lets go into this directory and print where we are.
$ cd bootcamp
$ pwd
You can try going to another directory using the full path
$ cd /pi/alper.kucukural-umw/umw_biocore
$ pwd
$ ls
Please go back using full path, or using "~" symbol. "~" means your home directory.
$ cd ~/bootcamp
$ pwd
Or you could have used
$ cd /home/your_user_id/bootcamp
The directory could also be created like below.
$ mkdir ~/bootcamp
Or using the full path
$ mkdir /home/your_user_id/bootcamp
However, if you try the commands above, bash will give an error, since directory is already created.
If you want to create a directory inside of another directory which is not created before, you can use "-p" command. For example, creating ~/bootcamp/dir1/dir2/dir3 without -p, This command will give an error.
$ mkdir ~/bootcamp/dir1/dir2/dir3
mkdir: cannot create directory `/home/your_user/bootcamp/dir1/dir2/dir3': No such file or directory
When you use -p option, it will create all those directories;
$ mkdir -p ~/bootcamp/dir1/dir2/dir3
$ cd ~/bootcamp/dir1/dir2/dir3
$ pwd
You can also create multiple directories with one command. Please execute the command below to create "first" and "second"
$ mkdir ~/bootcamp/first ~/bootcamp/second
If you go to that directory and get a list;
$ cd ~/bootcamp
$ ls
dir1 first second
You can also use tree command
$ tree ~/bootcamp
/home/your_user/bootcamp
├── dir1
│ └── dir2
│ └── dir3
├── first
└── second
You can copy the files from the source to the destination using a command like below.
cp sourcefile(s) destination_path
Lets copy funcs.R file to our ~/bootcamp/first directory and compare them. Let's first create an empty funcs.R file as an example and copy that file.
$ touch ~/bootcamp/funcs.R
$ cp ~/bootcamp/funcs.R ~/bootcamp/first
Now lets check the file sizes:
$ ls -l ~/bootcamp/funcs.R
$ ls -l ~/bootcamp/first
Permissions | # of links | user name | user group | size | date | filename |
---|---|---|---|---|---|---|
-rwxr-xr-x | 1 | alper.kucukural-umw | alper.kucukural-umw | 0 | Mar 1 17:17 | funcs.R |
To copy a directory with the files included, we use -R option
Before copying;
$ tree ~/bootcamp
/home/your_user/bootcamp
├── dir1
│ └── dir2
│ └── dir3
├── first
│ └── funcs.R
└── second
Lets copy first directory into the second
$ cp -R ~/bootcamp/first ~/bootcamp/second
After copying check how second directory content looks like;
$ tree ~/bootcamp
/home/your_user/bootcamp
├── dir1
│ └── dir2
│ └── dir3
├── first
│ └── funcs.R
└── second
└── first
└── funcs.R
You can use rm
command to remove a file. Lets remove funcs.R file under ~/bootcamp/second/first/funcs.R file. You can either use the full path or change the directory and remove the file in that directory.
$ cd ~/bootcamp/second/first
$ ls
$ rm funcs.R
$ ls
If the directory is empty, you can use rmdir function. Let's remove ~/bootcamp/second/first
$ cd ~/bootcamp/second
$ rmdir first
Compare the tree
$ tree ~/bootcamp
/home/your_user/bootcamp
├── dir1
│ └── dir2
│ └── dir3
├── first
│ └── funcs.R
└── second
If the directory is not empty, you can use rm -rf target_directory
command.
Please, be careful using rm commands. When a file or a directory deleted, there is no way to go back unless you have a backup
$ rm -rf ~/bootcamp/first
$ tree ~/bootcamp
/home/your_user/bootcamp
├── dir1
│ └── dir2
│ └── dir3
└── second
All "first" directory is gone!
Lets move all dir1 directory included with all sub directories into "second" folder.
$ mv ~/bootcamp/dir1 ~/bootcamp/second
$ tree ~/bootcamp
/home/your_user/bootcamp
└── second
└── dir1
└── dir2
└── dir3
Lets move dir3 to the same level of dir1 which is two directory above using relative paths. First, I will just go to dir2.
$ cd ~/bootcamp/second/dir1/dir2
When you get the list; you will only see dir3 here;
$ ls
dir3
I will move dir3 to two directory above
$ mv dir3 ../../
$ tree ~/bootcamp
/home/your_user/bootcamp
└── second
├── dir1
│ └── dir2
└── dir3
If it was one level above, I would just use ../
You can use these relative paths with ls, cp or other commands too.
First, please change your working directory to ~/bootcamp
$ cd ~/bootcamp
$ ls -l
You will see only "second" directory. Let's change its name to "first" using mv source destination_path
.
$ mv second first
$ tree
.
└── first
├── dir1
│ └── dir2
└── dir3
Or you could have used full paths (absolute paths) or starting with ~/
Technically, every file is a binary file.
Yet, it is common practice to group them into two:
- Text Files : Contains only printable characters (txt, bed, sam)
- Binary Files : Files that are not text files. (e.g. gz, bam)
You can direct an output of a command into a file using ">" symbol.
Lets put the tree output into a file under bootcamp directory.
$ cd ~/bootcamp
$ tree first > mytree.txt
$ ls
first mytree.txt
$ cat mytree.txt
Lets put ls -l output to "list.txt" file.
$ ls -l > ~/bootcamp/list.txt
$ cat list.txt
Text files are of extreme importance in bioinformatics. There are many tools to view and edit text files. "less" is a very useful program that you can use. You check the manual (man less) for the details.
$ cd ~/bootcamp
$ less mytree.txt
Exit with "q" button in your keyboard.
You can also use "more" command.
$ more mytree.txt
It is called "more" because after it has displayed a page of text, if the text is more than one page, it pauses and puts "-- More --" at the bottom of the screen to let you know that there is more text yet to be shown. To see the next page of text, you just hit the spacebar.
"head" and "tail" commands are also good to print head and tail of a file. If they used without a parameter, they will print 10 lines. When it is used with "-n X" parameter, it will print X # of lines.
$ head -n 5 ~/bootcamp/list.txt
$ tail -n 5 ~/bootcamp/list.txt
"cat" is another program that you can view the files. It dumps all file into the screen. So, if the file is big, prefer using "more" or "less".
$ cat ~/bootcamp/list.txt
"touch" command creates a file.
$ cd ~/bootcamp
$ touch my_new_file.txt
$ touch my_new_file2.txt
$ ls -l
total 31
drwxrwxr-x 4 your_user your_group 44 Apr 7 17:54 first
-rw-rw-r-- 1 your_user your_group 0 Apr 7 18:49 my_new_file2.txt
-rw-rw-r-- 1 your_user your_group 0 Apr 7 18:49 my_new_file.txt
-rw-rw-r-- 1 your_user your_group 83 Apr 7 18:01 mytree.txt
The first column here shows permissions and first character represents the type of the line.
- : file
d : directory
l : symbolic link (check below)
If you don't want to copy a file but use it in another directory you can use "ln" command. The command below will create a link to the current directory. Check how "." is used to represent current directory. This usage can be used in "cp" command too.
$ cd ~/bootcamp
$ ln -s /pi/alper.kucukural-umw/umw_biocore/genome_data/human/hg19/chrName.txt .
When you use ls -l, funcs.R symbolic link will point the location of the original file.
$ ls -l
total 38
lrwxrwxrwx 1 your_user your_group 34 Apr 7 23:36 chrName.txt -> /pi/alper.kucukural-umw/umw_biocore/genome_data/human/hg19/chrName.txt
Now it's time to talk about security. Linux is a multi-user operating system, so it has security to prevent people from accessing each other's confidential files. Please check File Permissions below to learn how to set the permissions for desired files and directories.
Deciphering the security characters will take a bit more work.
First, you must think of those nine characters as three sets of three characters (see below). Each of the three "rwx" characters refers to a different operation you can perform on the file or directory.
$ ls -l
-rw-r--r-- 1 your_user your_group 83 Apr 7 18:01 mytree.txt
- The 'r' means you can "read" the file's contents.
- The 'w' means you can "write", or modify, the file's contents.
- The 'x' means you can "execute" the file. This permission is given only if the file is a program.
- If any of the "rwx" characters is replaced by a '-', then that permission has been revoked.
For example, the owner's permissions for our three primate files are "rw-". This means that the owner of the file ("mytree.txt", i.e. you) can "read" it (look at its contents) and "write" it (modify its contents). You cannot execute it because it is not a program; it is a text file.
Members of the group "your_group" can only read the files ("r--"). We want to change the second permission character from '-' to 'w' so that those people in your group can modify the contents of these files as well.
The final three characters show the permissions allowed to anyone who has a UserID on this Linux system. We prefer to refer to this set as "world" or "others". Our three files are "world-readable", that is, anyone in our Linux world can read their contents, but they cannot modify the contents of the files. This is the way we want to leave it.
chmod: Change file mode bits. $ chmod filemode file
To give only read access to other users,
$ chmod o=r my_new_file.txt
To give read,write and execution access together,
$ chmod o=rwx my_new_file.txt
$ ls -l
total 31
drwxrwxr-x 4 ak97w umw_manuel_garber 44 Apr 7 17:54 first
-rw-rw-r-- 1 ak97w umw_manuel_garber 0 Apr 7 18:49 my_new_file2.txt
-rw-rw-rwx 1 ak97w umw_manuel_garber 0 Apr 7 18:49 my_new_file.txt
-rw-rw-r-- 1 ak97w umw_manuel_garber 83 Apr 7 18:01 mytree.txt
For more details and learning binary system check the link below;
https://www.linux.com/training-tutorials/understanding-linux-file-permissions/
We could get you to type two more "chmod" commands to modify the permissions of all text files one by one, but there's an easier way using "wildcards".
A wildcard allows you to specify more than one file at the same time. The *
matches any number of characters. For example, if you want to execute a command on all txt files in the ~/bootcamp directory, you would specify *
as the filename. If you want to be more selective and match only files which end in "txt", you would use *.txt
. Note that the *
can even match zero characters, so *txt
would match "txt" as well as "mytree.txt".
The other wildcard, ?
, is not used very often, but it can be useful. It matches exactly one character. For example, if you want to match "mytree", but not "my_new_file2.txt", you would use my????.txt
. The first ?
matches the 't' in "tree", but the second 'r' and so on. Since there are 4 characters there in the "tree", it will match that but it will fail to match to my_new_file2.txt file.
So to copy only txt files to dir1 folder.
$ cd ~/bootcamp
$ mkdir -p dir1
$ cp *.txt dir1/.
dir/.
usage also very common, the files will go under dir1 folder. However, it is not mandantory.
We use gzip to compress/decompress files
tar to pack files into one file for archiving purposes without compression.
As an example to create a packed archive.tar without any compression (e.g. first and mytree.txt), use -cvf
argument.
$ cd ~/bootcamp
$ tar -cvf packed.tar first mytree.txt
$ ls -l packed.tar
If you want to pack and compress (with gzip) at the same time, use -czvf
argument.
$ tar -czvf archive.tar.gz first my_new_file.txt mytree.txt
$ ls -l archive.tar.gz
Now, lets create a new directory called back
and move our archive.tar.gz
file into back
directory. Then we can extract the content of archive.tar.gz by using -xvzf
argument.
$ mkdir ~/bootcamp/back
$ cd ~/bootcamp/back
$ mv ~/bootcamp/archive.tar.gz ~/bootcamp/back
$ tar -xvzf archive.tar.gz
$ ls
$ pwd
$ tree
Usually we get the backups using this method. However, making these files too big would be a problematic since, if this file is corrupted (eg. after copying to another place), you cannot unzip the file. So, having backups with smaller portions would reduce backup problems.
To see how much disk space you have left on your system, you need to use the "df" command, which stands for "disk free".
df -h will show the space in human readable format. In this example
$ df -h ~/.
Filesystem Size Used Avail Use% Mounted on
panfs://172.26.44.2/pi/alper.kucukural-umw 19T 12T 6.3T 66% /pi/alper.kucukural-umw
"Size" is the disk allocated to you.
"Used" is the used portion.
"Avail" is the free space.
The "Use%" column shows the percentage of user space that is currently used.
In it’s simplest form when used without any option, "wget" will download the resource specified in the [url] to the current directory.
In the following example we are downloading an annotation file from https://bioinfo.umassmed.edu/pub/data/ucsc.gtf
$ cd ~/bootcamp/
$ mkdir genome
$ cd genome
$ wget --secure-protocol=TLSv1 https://bioinfo.umassmed.edu/pub/data/ucsc.gtf
$ ls -l
Print first 10 lines using head command to see the content of the file.
$ head ucsc.gtf
Please install FileZilla to your computer from the link below;
https://filezilla-project.org/download.php
Please check the link below to learn how to use FileZilla client.
https://wiki.filezilla-project.org/FileZilla_Client_Tutorial_(en)
The information you need is below;
Hostname:
Username: Your_cluster_user
Logon Type: Key File
Key File*: /Users/your_user_in_your_laptop/.ssh/id_rsa
Port: 22
*You need to locate the private key in your laptop.
- Please create RNA-Seq folder under bootcamp. Then you need to download, copy or create links of the files into that directories to get a tree as shown at below.
- mm10.fa file is located /share/data/umw_biocore/genome_data/mousetest/mm10/mm10.fa
- ucsc.gtf file is located /share/data/umw_biocore/genome_data/mousetest/mm10/ucsc.gtf
- rsem.to.table.pl file is located at https://web.dolphinnext.com/umw_biocore/class/bootcamp/RNA-Seq/progs/rsem.to.table.pl
- fastq files are located /share/data/umw_biocore/genome_data/mousetest/mm10/fastq.quantification/
When you run tree function the output should look like below;
$ tree RNA-Seq/
RNA-Seq/
├── mm10
│ ├── mm10.fa -> /share/data/umw_biocore/genome_data/mousetest/mm10/mm10.fa
│ └── ucsc.gtf
├── progs
│ └── rsem.to.table.pl
└── reads
├── control_rep1.1.fq
├── control_rep1.2.fq
├── control_rep2.1.fq
├── control_rep2.2.fq
├── control_rep3.1.fq
├── control_rep3.2.fq
├── exper_rep1.1.fq
├── exper_rep1.2.fq
├── exper_rep2.1.fq
├── exper_rep2.2.fq
├── exper_rep3.1.fq
└── exper_rep3.2.fq
- Please compress the RNA-Seq directory and create RNA-Seq.tar.gz file under your bootcamp directory. To check if it worked, create a ~/bootcamp/test directory, copy this RNA-Seq.tar.gz file there and extract. Use tree command to compare the same tree output like in #1. Get the size of RNA-Seq.tar.gz file and message it to one of us on slack (Alper, Onur or Elisa)