diff --git a/docs/Bio4jAWSReleases.md b/docs/Bio4jAWSReleases.md index e3fb4c1..1f40bab 100644 --- a/docs/Bio4jAWSReleases.md +++ b/docs/Bio4jAWSReleases.md @@ -1,45 +1,91 @@ -## Pre-built Bio4j releases in AWS - -This is the best way to go if you don't have the resources for importing Bio4j in your cluster or you simply think that you will be getting a better deal by using it as a service. - -There isn't any kind of fee required for using Bio4j in AWS, you will just be paying for the data transfer _(and obviously the instances, volumes, deployed by yourself)_ -If you have any doubt please refer to this section on **[Requester Pays Buckets](http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html)**. - -We provide two different options: - -### Lite version - -This database weighs **~460 GB** and includes the following modules: - -- UniProt (both SwissProt and TrEMBL) -- Gene Ontology (GO) -- NCBI taxonomy -- Enzyme DB -- UniProtGO -- UniProtEnzymeDB -- UniProtNCBITaxonomy -- UniProtInteractions -- UniProtIsoforms - -The database files are provided as a **tar** file in the following address: - -> s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_lite.tar - -### Whole version - -This version includes all modules and weighs **~1.2 TB**. Apart from the aforementioned, the following modules are available: - -- UniRef -- UniProtUniRef -- GenInfo -- NCBITaxonomyGenInfo - -The database files are provided as a **tar** file in the following address: - -> s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_whole.tar - -### Source files for the import - -The **XML/TXT** files from the different data sources used for this specific import were downloaded on the date **12/03/2014** and can be retrieved from the following bucket folder _(also stored in a Requester Pays Bucket)_: - -> s3://eu-west-1.raw.bio4j.com/ +# Pre-built Bio4j releases in AWS + +> This is the recommended way of using bio4j-titan. For this you need a working Amazon Web Services (AWS) account. + +We offer two pre-imported bio4j-titan distributions: + +- **bio4j-lite** the size of the binaries is approximately `500GB`. It includes the following modules: + - UniProt (both SwissProt and TrEMBL) + - Gene Ontology (GO) + - NCBI taxonomy + - Enzyme DB + - UniProtGO + - UniProtEnzymeDB + - UniProtNCBITaxonomy + - UniProtInteractions + - UniProtIsoforms +- **bio4j-full** the size of the binaries is approximately `1.2 TB`. This version includes all modules: + - UniProt (both SwissProt and TrEMBL) + - Gene Ontology (GO) + - NCBI taxonomy + - Enzyme DB + - UniProtGO + - UniProtEnzymeDB + - UniProtNCBITaxonomy + - UniProtInteractions + - UniProtIsoforms + - UniRef + - UniProtUniRef + - GenInfo + - NCBITaxonomyGenInfo + +They are available from S3 through a [requester pays](http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html) bucket `s3://eu-west-1.releases.bio4j.com/`, in the `eu-west-1` (Ireland) region. The object addresses are + +- **bio4j-lite** `s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_all_but_uniref_and_gi_index.tar` +- **bio4j-full** `s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_all_plus_isoforms.tar` + +The way this is expected to be used is: + +1. Launch an EC2 instance in the `eu-west-1` region + - note that you need to use either EBS volumes or an instance type with enough of ephemeral storage (for example `i2.xlarge`) +2. Download the binary files for either bio4j-lite or bio4j-full. You can use aws-cli for that. For example: + + ```bash + aws s3api get-object --request-payer requester --bucket eu-west-1.releases.bio4j.com --key bio4j.tar + ``` + +3. Extract the downloaded archive: + + ```bash + tar xvf bio4j.tar + ``` + +4. Enjoy! Now you should check [TitanDB documentation](http://s3.thinkaurelius.com/docs/titan/0.5.2/) to learn how to connect to the database and query it. + +#### IMPORTANT: AWS cost and fees + +AWS charges fees for downloading S3 objects: [AWS S3 pricing - data transfer](https://aws.amazon.com/s3/pricing/#Data_Transfer_Pricing). However, this is free _if you download it from an EC2 instance within the same region_. Thus, you won't incur in any data transfer cost if you download bio4j from an EC2 instance in the `eu-west-1` region. Your AWS costs would be in this case just those associated to the compute inrastructure: the EC2 instance/s and, if using them, EBS volumes. + +**IMPORTANT:** If you download it from your local computer you will incur in sizable costs: around **$50** for bio4j-lite and **$120** for bio4j-full. + +#### IAM user configuration + +You need to grant permissions to the user/role which you will use to download the bio4j distribution: read access to `s3://eu-west-1.releases.bio4j.com/` is enough. The following is an IAM policy which is more than sufficient for that: + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "Stmt1434711865000", + "Effect": "Allow", + "Action": [ + "s3:*" + ], + "Resource": [ + "arn:aws:s3:::eu-west-1.releases.bio4j.com" + ] + }, + { + "Sid": "Stmt1434711990000", + "Effect": "Allow", + "Action": [ + "s3:*" + ], + "Resource": [ + "arn:aws:s3:::eu-west-1.releases.bio4j.com/*" + ] + } + ] +} +``` diff --git a/docs/raw-input-sources.md b/docs/raw-input-sources.md new file mode 100644 index 0000000..00ad36d --- /dev/null +++ b/docs/raw-input-sources.md @@ -0,0 +1,5 @@ +# Source files for the import + +The **XML/TXT** files from the different data sources used for this specific import were downloaded on the date **12/03/2014** and can be retrieved from the following [requester pays](http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html) bucket: + +- `s3://eu-west-1.raw.bio4j.com/`