From 5551ae4867ad87ea49001a7acc36f8937e64c195 Mon Sep 17 00:00:00 2001 From: Eduardo Pareja-Tobes Date: Wed, 1 Jul 2015 15:25:32 +0200 Subject: [PATCH 1/4] rewrite using bio4j titan with AWS --- docs/Bio4jAWSReleases.md | 124 ++++++++++++++++++++++++-------------- docs/raw-input-sources.md | 5 ++ 2 files changed, 84 insertions(+), 45 deletions(-) create mode 100644 docs/raw-input-sources.md diff --git a/docs/Bio4jAWSReleases.md b/docs/Bio4jAWSReleases.md index e3fb4c1..3bffc13 100644 --- a/docs/Bio4jAWSReleases.md +++ b/docs/Bio4jAWSReleases.md @@ -1,45 +1,79 @@ -## Pre-built Bio4j releases in AWS - -This is the best way to go if you don't have the resources for importing Bio4j in your cluster or you simply think that you will be getting a better deal by using it as a service. - -There isn't any kind of fee required for using Bio4j in AWS, you will just be paying for the data transfer _(and obviously the instances, volumes, deployed by yourself)_ -If you have any doubt please refer to this section on **[Requester Pays Buckets](http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html)**. - -We provide two different options: - -### Lite version - -This database weighs **~460 GB** and includes the following modules: - -- UniProt (both SwissProt and TrEMBL) -- Gene Ontology (GO) -- NCBI taxonomy -- Enzyme DB -- UniProtGO -- UniProtEnzymeDB -- UniProtNCBITaxonomy -- UniProtInteractions -- UniProtIsoforms - -The database files are provided as a **tar** file in the following address: - -> s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_lite.tar - -### Whole version - -This version includes all modules and weighs **~1.2 TB**. Apart from the aforementioned, the following modules are available: - -- UniRef -- UniProtUniRef -- GenInfo -- NCBITaxonomyGenInfo - -The database files are provided as a **tar** file in the following address: - -> s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_whole.tar - -### Source files for the import - -The **XML/TXT** files from the different data sources used for this specific import were downloaded on the date **12/03/2014** and can be retrieved from the following bucket folder _(also stored in a Requester Pays Bucket)_: - -> s3://eu-west-1.raw.bio4j.com/ +# Pre-built Bio4j releases in AWS + +> This is the recommended way of using bio4j-titan. For this you need a working Amazon Web Services (AWS) account. + +We offer two pre-imported bio4j-titan distributions: + +- **bio4j lite** the size of the binaries is approximately `500GB`. It includes the following modules: + - UniProt (both SwissProt and TrEMBL) + - Gene Ontology (GO) + - NCBI taxonomy + - Enzyme DB + - UniProtGO + - UniProtEnzymeDB + - UniProtNCBITaxonomy + - UniProtInteractions + - UniProtIsoforms +- **bio4j full** the size of the binaries is approximately `1.2 TB`. This version includes all modules: + - UniProt (both SwissProt and TrEMBL) + - Gene Ontology (GO) + - NCBI taxonomy + - Enzyme DB + - UniProtGO + - UniProtEnzymeDB + - UniProtNCBITaxonomy + - UniProtInteractions + - UniProtIsoforms + - UniRef + - UniProtUniRef + - GenInfo + - NCBITaxonomyGenInfo + +They are available from S3 through a [requester pays](http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html) bucket `s3://eu-west-1.releases.bio4j.com/`, in the `eu-west-1` (Ireland) region. The object addresses are + +- **bio4j lite** `s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_all_but_uniref_and_gi_index.tar` +- **bio4j full** `s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_all_plus_isoforms.tar` + +The way this is expected to be used is: + +1. launch an EC2 instance in the `eu-west-1` region +2. download the binary files for either bio4j lite or bio4j full. +3. enjoy! + +#### IMPORTANT: AWS cost and fees + +AWS charges fees for downloading S3 objects: [AWS S3 pricing - data transfer](https://aws.amazon.com/s3/pricing/#Data_Transfer_Pricing). However, this is free _if you download it from an EC2 instance within the same region_. Thus, you won't incur in any data transfer cost if you download bio4j from an EC2 instance in the `eu-west-1` region. Your AWS costs would be in this case just those associated to the compute inrastructure: the EC2 instance/s and, if using them, EBS volumes. + +**IMPORTANT:** If you download it from your local computer you will incur in sizable costs: around **50$** for bio4j lite and **120$** for bio4j full. + +#### IAM user configuration + +You need to grant permissions to the user/role which you will use to download the bio4j distribution: read access to `s3://eu-west-1.releases.bio4j.com/` is enough. The following is an IAM policy which is more than sufficient for that: + +``` json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "Stmt1434711865000", + "Effect": "Allow", + "Action": [ + "s3:*" + ], + "Resource": [ + "arn:aws:s3:::eu-west-1.releases.bio4j.com" + ] + }, + { + "Sid": "Stmt1434711990000", + "Effect": "Allow", + "Action": [ + "s3:*" + ], + "Resource": [ + "arn:aws:s3:::eu-west-1.releases.bio4j.com/*" + ] + } + ] +} +``` diff --git a/docs/raw-input-sources.md b/docs/raw-input-sources.md new file mode 100644 index 0000000..00ad36d --- /dev/null +++ b/docs/raw-input-sources.md @@ -0,0 +1,5 @@ +# Source files for the import + +The **XML/TXT** files from the different data sources used for this specific import were downloaded on the date **12/03/2014** and can be retrieved from the following [requester pays](http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html) bucket: + +- `s3://eu-west-1.raw.bio4j.com/` From b869f64682d82c11422c248e2bb952a707313a2b Mon Sep 17 00:00:00 2001 From: Alexey Alekhin Date: Wed, 1 Jul 2015 15:58:36 +0200 Subject: [PATCH 2/4] Minor improvements --- docs/Bio4jAWSReleases.md | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/docs/Bio4jAWSReleases.md b/docs/Bio4jAWSReleases.md index 3bffc13..9823b1b 100644 --- a/docs/Bio4jAWSReleases.md +++ b/docs/Bio4jAWSReleases.md @@ -4,7 +4,7 @@ We offer two pre-imported bio4j-titan distributions: -- **bio4j lite** the size of the binaries is approximately `500GB`. It includes the following modules: +- **bio4j-lite** the size of the binaries is approximately `500GB`. It includes the following modules: - UniProt (both SwissProt and TrEMBL) - Gene Ontology (GO) - NCBI taxonomy @@ -14,7 +14,7 @@ We offer two pre-imported bio4j-titan distributions: - UniProtNCBITaxonomy - UniProtInteractions - UniProtIsoforms -- **bio4j full** the size of the binaries is approximately `1.2 TB`. This version includes all modules: +- **bio4j-full** the size of the binaries is approximately `1.2 TB`. This version includes all modules: - UniProt (both SwissProt and TrEMBL) - Gene Ontology (GO) - NCBI taxonomy @@ -31,20 +31,25 @@ We offer two pre-imported bio4j-titan distributions: They are available from S3 through a [requester pays](http://docs.aws.amazon.com/AmazonS3/latest/dev/RequesterPaysBuckets.html) bucket `s3://eu-west-1.releases.bio4j.com/`, in the `eu-west-1` (Ireland) region. The object addresses are -- **bio4j lite** `s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_all_but_uniref_and_gi_index.tar` -- **bio4j full** `s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_all_plus_isoforms.tar` +- **bio4j-lite** `s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_all_but_uniref_and_gi_index.tar` +- **bio4j-full** `s3://eu-west-1.releases.bio4j.com/2014_12_03/bio4j_all_plus_isoforms.tar` The way this is expected to be used is: -1. launch an EC2 instance in the `eu-west-1` region -2. download the binary files for either bio4j lite or bio4j full. -3. enjoy! +1. Launch an EC2 instance in the `eu-west-1` region +2. Download the binary files for either bio4j-lite or bio4j-full. Using aws-cli you can do it like this: + + ```bash + aws s3api get-object --request-payer requester --bucket --key + ``` + +3. Enjoy! Now you should check [TitanDB documentation](http://s3.thinkaurelius.com/docs/titan/0.5.2/) to learn how to connect to the database and query it. #### IMPORTANT: AWS cost and fees AWS charges fees for downloading S3 objects: [AWS S3 pricing - data transfer](https://aws.amazon.com/s3/pricing/#Data_Transfer_Pricing). However, this is free _if you download it from an EC2 instance within the same region_. Thus, you won't incur in any data transfer cost if you download bio4j from an EC2 instance in the `eu-west-1` region. Your AWS costs would be in this case just those associated to the compute inrastructure: the EC2 instance/s and, if using them, EBS volumes. -**IMPORTANT:** If you download it from your local computer you will incur in sizable costs: around **50$** for bio4j lite and **120$** for bio4j full. +**IMPORTANT:** If you download it from your local computer you will incur in sizable costs: around **$50** for bio4j-lite and **$120** for bio4j-full. #### IAM user configuration From d559f224fea8c5e8efd4190a237be7b50032097e Mon Sep 17 00:00:00 2001 From: Alexey Alekhin Date: Wed, 1 Jul 2015 16:12:43 +0200 Subject: [PATCH 3/4] Added the tar step --- docs/Bio4jAWSReleases.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/docs/Bio4jAWSReleases.md b/docs/Bio4jAWSReleases.md index 9823b1b..ab03d42 100644 --- a/docs/Bio4jAWSReleases.md +++ b/docs/Bio4jAWSReleases.md @@ -37,13 +37,19 @@ They are available from S3 through a [requester pays](http://docs.aws.amazon.com The way this is expected to be used is: 1. Launch an EC2 instance in the `eu-west-1` region -2. Download the binary files for either bio4j-lite or bio4j-full. Using aws-cli you can do it like this: +2. Download the binary files for either bio4j-lite or bio4j-full. You can use aws-cli for that. For example: ```bash - aws s3api get-object --request-payer requester --bucket --key + aws s3api get-object --request-payer requester --bucket eu-west-1.releases.bio4j.com --key bio4j.tar ``` -3. Enjoy! Now you should check [TitanDB documentation](http://s3.thinkaurelius.com/docs/titan/0.5.2/) to learn how to connect to the database and query it. +3. Extract the downloaded archive: + + ```bash + tar xvf bio4j.tar + ``` + +4. Enjoy! Now you should check [TitanDB documentation](http://s3.thinkaurelius.com/docs/titan/0.5.2/) to learn how to connect to the database and query it. #### IMPORTANT: AWS cost and fees From bc819ec0fd861676471cdd1518433e43822c11cc Mon Sep 17 00:00:00 2001 From: Alexey Alekhin Date: Wed, 1 Jul 2015 16:16:13 +0200 Subject: [PATCH 4/4] Added a note about instance storage --- docs/Bio4jAWSReleases.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/Bio4jAWSReleases.md b/docs/Bio4jAWSReleases.md index ab03d42..1f40bab 100644 --- a/docs/Bio4jAWSReleases.md +++ b/docs/Bio4jAWSReleases.md @@ -37,6 +37,7 @@ They are available from S3 through a [requester pays](http://docs.aws.amazon.com The way this is expected to be used is: 1. Launch an EC2 instance in the `eu-west-1` region + - note that you need to use either EBS volumes or an instance type with enough of ephemeral storage (for example `i2.xlarge`) 2. Download the binary files for either bio4j-lite or bio4j-full. You can use aws-cli for that. For example: ```bash