Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the AWS distribution docs #66

Merged
merged 4 commits into from
Jul 1, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 91 additions & 45 deletions docs/Bio4jAWSReleases.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,91 @@
## Pre-built Bio4j releases in AWS

This is the best way to go if you don't have the resources for importing Bio4j in your cluster or you simply think that you will be getting a better deal by using it as a service.

There isn't any kind of fee required for using Bio4j in AWS, you will just be paying for the data transfer _(and obviously the instances, volumes, deployed by yourself)_
If you have any doubt please refer to this section on **[Requester Pays Buckets](http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html)**.

We provide two different options:

### Lite version

This database weighs **~460 GB** and includes the following modules:

- UniProt (both SwissProt and TrEMBL)
- Gene Ontology (GO)
- NCBI taxonomy
- Enzyme DB
- UniProtGO
- UniProtEnzymeDB
- UniProtNCBITaxonomy
- UniProtInteractions
- UniProtIsoforms

The database files are provided as a **tar** file in the following address:

> s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_lite.tar

### Whole version

This version includes all modules and weighs **~1.2 TB**. Apart from the aforementioned, the following modules are available:

- UniRef
- UniProtUniRef
- GenInfo
- NCBITaxonomyGenInfo

The database files are provided as a **tar** file in the following address:

> s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_whole.tar

### Source files for the import

The **XML/TXT** files from the different data sources used for this specific import were downloaded on the date **12/03/2014** and can be retrieved from the following bucket folder _(also stored in a Requester Pays Bucket)_:

> s3://eu-west-1.raw.bio4j.com/
# Pre-built Bio4j releases in AWS

> This is the recommended way of using bio4j-titan. For this you need a working Amazon Web Services (AWS) account.

We offer two pre-imported bio4j-titan distributions:

- **bio4j-lite** the size of the binaries is approximately `500GB`. It includes the following modules:
- UniProt (both SwissProt and TrEMBL)
- Gene Ontology (GO)
- NCBI taxonomy
- Enzyme DB
- UniProtGO
- UniProtEnzymeDB
- UniProtNCBITaxonomy
- UniProtInteractions
- UniProtIsoforms
- **bio4j-full** the size of the binaries is approximately `1.2 TB`. This version includes all modules:
- UniProt (both SwissProt and TrEMBL)
- Gene Ontology (GO)
- NCBI taxonomy
- Enzyme DB
- UniProtGO
- UniProtEnzymeDB
- UniProtNCBITaxonomy
- UniProtInteractions
- UniProtIsoforms
- UniRef
- UniProtUniRef
- GenInfo
- NCBITaxonomyGenInfo

They are available from S3 through a [requester pays](http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html) bucket `s3://eu-west-1.releases.bio4j.com/`, in the `eu-west-1` (Ireland) region. The object addresses are

- **bio4j-lite** `s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_all_but_uniref_and_gi_index.tar`
- **bio4j-full** `s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_all_plus_isoforms.tar`

The way this is expected to be used is:

1. Launch an EC2 instance in the `eu-west-1` region
- note that you need to use either EBS volumes or an instance type with enough of ephemeral storage (for example `i2.xlarge`)
2. Download the binary files for either bio4j-lite or bio4j-full. You can use aws-cli for that. For example:

```bash
aws s3api get-object --request-payer requester --bucket eu-west-1.releases.bio4j.com --key <key> bio4j.tar
```

3. Extract the downloaded archive:

```bash
tar xvf bio4j.tar
```

4. Enjoy! Now you should check [TitanDB documentation](http://s3.thinkaurelius.com/docs/titan/0.5.2/) to learn how to connect to the database and query it.

#### IMPORTANT: AWS cost and fees

AWS charges fees for downloading S3 objects: [AWS S3 pricing - data transfer](https://aws.amazon.com/s3/pricing/#Data_Transfer_Pricing). However, this is free _if you download it from an EC2 instance within the same region_. Thus, you won't incur in any data transfer cost if you download bio4j from an EC2 instance in the `eu-west-1` region. Your AWS costs would be in this case just those associated to the compute inrastructure: the EC2 instance/s and, if using them, EBS volumes.

**IMPORTANT:** If you download it from your local computer you will incur in sizable costs: around **$50** for bio4j-lite and **$120** for bio4j-full.

#### IAM user configuration

You need to grant permissions to the user/role which you will use to download the bio4j distribution: read access to `s3://eu-west-1.releases.bio4j.com/` is enough. The following is an IAM policy which is more than sufficient for that:

``` json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1434711865000",
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::eu-west-1.releases.bio4j.com"
]
},
{
"Sid": "Stmt1434711990000",
"Effect": "Allow",
"Action": [
"s3:*"
],
"Resource": [
"arn:aws:s3:::eu-west-1.releases.bio4j.com/*"
]
}
]
}
```
5 changes: 5 additions & 0 deletions docs/raw-input-sources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Source files for the import

The **XML/TXT** files from the different data sources used for this specific import were downloaded on the date **12/03/2014** and can be retrieved from the following [requester pays](http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html) bucket:

- `s3://eu-west-1.raw.bio4j.com/`