-
Notifications
You must be signed in to change notification settings - Fork 248
05. Toolkit Configuration
If you are using SRA Toolkit version 2.4 or higher, you should run the configuration tool located within the bin subdirectory of the Toolkit package. This page describes how to use version 2.11.2 of vdb-config
.
If you are currently working with dbGaP-data, please update to the latest version if you wish to download data into AWS or GCP buckets.
Go to the "bin" subdirectory for the Toolkit and run the following command line:
./vdb-config -i
A window will open and present the screen below.
The tool has 6 pages for different aspects of settings.
The top line of buttons is visible on all 6 pages.
You operate the buttons by pressing the letter highlighted in red, or by pressing the tab-key until the wanted button is reached and then pressing the space- or the enter-key.
[save] ... save the current settings, but do not exit
[exit] ... exit the tool, asking to save if settings have been changed
[verify] ... not yet implemented
[discard] ... discard all changes made ( since start or last save )
[default] ... set values to default
The first page: MAIN, contains just these 3 settings:
[Enable Remote Access] ... if enabled, instructs the sra-tools to fetch data from remote locations via https. Remote locations are either servers at NCBI, AWS, or GCP. If you turn this checkbox off and you have no local data available ( via download ), the SRA Toolkit cannot find any data.
[Use Site Installation] ... this setting is only visible if your admin has configured a local site-repository.
[Prefer SRA Lite files with simplified base quality scores] ... if enabled, instructs sra-tools to fetch the SRA Lite version of data when available. SRA Lite format is smaller and contains simplified quality scores.
This page controlls the caching behavior.
The checkbox "enable local file-caching" enables caching into files. This should only be disabled in special cases ( e.g. compute-clusters ) if a common location cannot be configured.
There are 2 different locations you can set: "public user-repository" and "process-local." If both are configured, "process-local" is ignored.
The difference is that "public user-repository" is persistent. This means that if a tool has used a particular accession, the cache-file remains at this location. If the same or another tool uses this accession again, there is no need to access the remote location, which improves the speed. There is a special tool "cache-mgr" to remove cache-files if no longer needed or you are running out of space.
The "process-local" cache is automatically removed the moment the tool has finished it's job. This will also improve speed because large chunks are requested from the remote location. Remote locations such as AWS and GCP operate much faster if larger chunks are requested.
Please note that an empty directory can only be choosen as the location of the "public user-repository." The "public user-repository" is also the location where the prefetch-tool puts downloads, if not choosen otherwise. ( See last page "TOOLS" )
In addition to cache-files, the caching code also uses RAM. This is the last setting on this page. You can adjust the amount of RAM used with the [+] and [-] buttons. If local file-caching is disabled, the RAM-cache will still be used. If you select zero Megabytes, the code will use a default value.
This page controls how to access data at AWS.
The first button "accept charges for AWS," if enabled, allows you to access data on AWS if payment is neccessary. The second button "report cloud instance identity," if enabled, allows you to access public data on AWS for free. To make that work, the tool needs to report your location.
If payment is necessary, you need to provide credentials. With the "choose"-button you select a file containing these credentials. If the file contains multiple profiles, you can select the one you want to use in the "profile" input field.
This page controls how to access data at GCP.
The first button "accept charges for GCP," if enabled, allows you to access data on GCP if payment is neccessary. The second button "report cloud instance identity," if enabled, allows you to access public data on GCP for free. To make that work, the tool needs to report your location.
If payment is neccessary, you need to provide credentials. With the "choose"-button you select a file containing these credentials.
On this page you can enter a proxy, if your network configuration requires it.
The button "use http-proxy" enables the use of a proxy. Enter in the proxy-field just the DNS-name or IP-address of your proxy. Enter the port-number in the port-field. Ask your network-admin about the data for your environment.
This page controls settings for specific tools. So far there is only one setting for the prefetch-tool here. You can select where the prefetch-tool downloads accessions to. The default setting is "public user-repository". With this selection the prefetch tool stores downloaded accessions into the directory choosen at the cache-page above. This has the effect that tools will find the downloaded accessions automatically, without the need to specify a specific directory. For instance you can type "fastq-dump SRR000001", ( without .sra ! ) and the tool will find the accession in the file "SRR000001.sra" in the public user-repository for you. If you select "current directory", the prefetch-tool will store the downloaded accessions in the current working directory. You have to specify the correct path to the accession for the tools in this case. For example "fastq-dump /home/user/my_dir/SRR000001.sra". ( with .sra ! )
In case you did not update, the one and only page looks like this:
In order for your commands to complete successfully, the toolkit needs sufficient free space. Genomics datasets are quite large; you may need 100's of GB of free space. This is the primary concern when choosing the Workspace Location. Do you have enough free space there for what you intend to do?
If you need to change the Workspace Location, use the tab key to move the cursor (shown red here) to the change button and press space or enter.
This will bring up the file navigation dialog (see below).
if you already know the path to the directory, you may use the Goto button to directly enter that path. Once you have entered or navigated to the correct directory, press tab to get to the OK button to return to the previous screen.
Once you are happy with the settings, use the tab key to get to the Save button and press enter or space.
Press enter or space one more time, then tab to the Exit button and press enter or space. You will then be returned to your shell command prompt.