-
Notifications
You must be signed in to change notification settings - Fork 248
sratools driver tool
With the move of SRA data to various cloud platforms, there becomes more than one source for data. Depending on the location of the user, some sources may be fast and cheap while others may be slow or costly. Sources are not required to provide the exact same data, e.g. some sources might not provide original spot names and quality scores. This becomes a complicated and arbitrary matrix of choices. Accordingly, the sra-toolkit is changing to respond to this.
Rather than change all the tools, we have created a single tool to deal with these changes and to interact with the user. This new tool determines the proper objects to satisfy users' requests. It drives the worker tools with the correct URLs for the runs they are to work on.
For dbGaP users, if you are accessing the data from the same cloud that the data is stored, you will need no decryption to access your permitted data sets.
The sratools
driver tool is designed to work transparently. So if you wished to run fastq-dump
, you would still type fastq-dump
, but you would actually get sratools
running as fastq-dump
. After sratools
examines the command line, it runs the original fastq-dump
with the information it will need to accomplish its tasks.
The SRA currently supports Amazon's EC2 and Google's GCP platforms. These are the platforms on which we have copies of the SRA. This list is open-ended. Additional cloud providers and/or regions may be added in the future.
You must run configuration at least once; if sratools
can not find your configuration, it will print instructions and quit. If you are running in a cloud environment, you will need to configure your cloud settings, and if you want access to data that is located in the same cloud, you will need to allow the toolkit to send your cloud identity token to NCBI.
There are some new command line options that all tools get and that are handled by sratools
itself. These are related to cloud location and permissions.
-
--ngc <file>
Needed in order to read encrypted dbGaP data that is stored at NCBI. NB. this mutually exclusive with--perm
. -
--perm <file>
Needed in order to access protected data that is stored in the cloud. NB. this mutually exclusive with--ngc
. -
--location <string>
Needed in order to access data that is stored in a different cloud or region, e.g. 's3.us-east-1
', 'gs.us
'. This is a hint. If the data doesn't exist at the requested location, you will get a location at which the data does exist. NB. Accessing data in a different cloud/region may incur additional costs to you. -
--cart <file>
Needed in order to use a cart file you may have downloaded from dbGaP.
Additionally, sratools can handle multiple accessions at once; even if the underlying tool does not support it, sratools will enable it work.
If a parameter is not an option and is not an argument to an option, sratools
treats it as a potential SRA accession and requests information about it from NCBI. This replaces the old (pre-2.10) name resolution process.
Some options may be removed, particularly any options which are processed by sratools
itself.
Since data can now be located in multiple locations, the new name resolution process aims to locate data that is closest to the user. For users running from a cloud location, this means resolving to data that is stored in the same cloud and region. For users not running in a supported cloud and region, this means resolving to data that is stored at NCBI, as before.
If you have permission and are accessing protected data, e.g. data from dbGaP, and are in a supported cloud, and the data is in the same cloud, name resolution will give you direct access and no decryption will be needed. Otherwise, decryption will still be needed. You will need an NGC file from dbGaP to decrypt the data. The toolkit team continues to work on ways to make this easier while still safeguarding the data, so this is subject to change.
Many of the tools in the bin
directory have been replaced by symlinks to sratools
, with the originals having been renamed to *-orig
.
sratools
runs the requested tool as a sub-process. For each command line parameter that is not an option or an option's argument,
- it performs name resolution.
- sets environment variables with any additional information from name resolution.
- runs the requested tool with the appropriate options.
It is not recommended or supported to run the original tools directly, it may work, or it may fail. The purpose of sratools
is handle this for you. Please allow it to do its job. If sratools
is not working for you, it is probably a bug, and we would like the opportunity to fix it.
sratools
pays attention to some environment variables, which may be helpful in certain situations.
-
SRATOOLS_VERBOSE
setting this equal to a number between1
and9
will causesratools
to print verbose messages. NB.sratools
does not process--verbose
itself, it's sent to the child processes.SRATOOLS_VERBOSE
is how to set the verbosity level forsratools
itself. -
SRATOOLS_DRY_RUN
setting this equal to1
is equivalent to settingSRATOOLS_TESTING=3
. -
SRATOOLS_TESTING
setting this will enable various test modes.
- Runs internal tests and quits. There is no output on success.
- Skips name resolution and tersely prints the commands it would have issued.
- Does name resolution and verbosely prints the commands it would have issued along with any environment variables it would have set.
- Does the same as 3 but prints in a format that should be directly executable if put into a shell script.
-
SRATOOLS_IMPERSONATE
will causesratools
to run as-ifargv[0] == $SRATOOLS_IMPERSONATE
.