Skip to content

Commit

Permalink
Merge pull request #419 from jmmshn/master
Browse files Browse the repository at this point in the history
updated documentaion.
  • Loading branch information
Shyam Dwaraknath authored Mar 22, 2021
2 parents 1897d9a + 6a9b7f5 commit 672c6ae
Show file tree
Hide file tree
Showing 3 changed files with 71 additions and 4 deletions.
67 changes: 67 additions & 0 deletions docs/getting_started/advanced_stores.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Configurations and Usage of Advanced `store`'s

## S3Store

### Configuration

The S3Store interfaces with S3 object storage via [boto3](https://pypi.org/project/boto3/).
For this to work properly, you have to set your basic configuration in `~/.aws/config`
```buildoutcfg
[default]
source_profile = default
```

Then, you have to set up your credentials in `~/.aws/credentials`
```buildoutcfg
[default]
aws_access_key_id = YOUR_KEY
aws_secret_access_key = YOUR_SECRET
```

For more information on the configuration please see the following [documentation](https://docs.aws.amazon.com/credref/latest/refdocs/settings-global.html).
Note that while these configurations are in the `~/.aws` folder, they are shared by other similar services like the self-hosted [minio](https://min.io/) service.

### Basic Usage

MongoDB is not designed to handle large object storage.
As such, we created an abstract object that combines the large object storage capabilities of Amazon S3 and the easy, python-friendly query language of MongoDB.
These `S3Store`s all include an `index` store that only stores specific queryable data and the object key for retrieving the data from an S3 bucket using the `key` attribute (called `'fs_id'` by default).

An entry of in the `index` may look something like this:
```
{
fs_id : "5fc6b87e99071dfdf04ca871"
task_id : "mp-12345"
}
```
Please note that since we are giving users the ability to reconstruct the index store using the object metadata, the object size in the `index` is limited by the metadata and not MongoDB.
Different S3 services might have different rules, but the limit is typically smaller: 8 KB for [aws](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html)

The `S3Store` should be constructed as follows:

```python
from maggma.stores import MongograntStore, S3Store
index = MongograntStore("ro:mongodb03/js_cathodes",
"atomate_aeccar0_fs_index",
key="fs_id")
s3store = S3Store(index=index,
bucket="<<BUCKET_NAME>>",
s3_profile="<<S3_PROFILE_NAME>>",
compress= True,
endpoint_url= "<<S3_URL>>",
sub_dir= "atomate_aeccar0_fs",
s3_workers=4
)
```

The `subdir` field creates subdirectories in the bucket to help the user organize their data.

### Parallelism

Once you start working with large quantities of data, the speed at which you process this data will often be limited by database I/O.
For the most time-consuming upload part of the process, we have implemented thread-level parallelism in the `update` member function.
The `update` function received an entire chunk of processed data as defined by `chunk_size`,
however since `Store.update` is typically called in the `update_targets` part of a builder, where builder execution is not longer multi-threaded.
As such, we multithread the execution inside of `update` using `s3_workers` threads to perform the database write operation.
As a general rule of thumb, if you notice that your update step is taking too long, you should change the `s3_worker` field which is
optimized differently based on server-side resources.
6 changes: 3 additions & 3 deletions docs/getting_started/stores.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ Current working and tested Stores include:
- VaulStore: uses Vault to get credentials for a MongoDB database
- AliasingStore: aliases keys from the underlying store to new names
- SandboxStore: provides permission control to documents via a `_sbxn` sandbox key
- S3Store: provides an interface to an S3 Bucket either on AWS or self-hosted solutions
- S3Store: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md))
- JointStore: joins several MongoDB collections together, merging documents with the same `key`, so they look like one collection
- ConcatStore: concatenates several MongoDB collections in series so they look like one collection

## The `Store` interface

### Initializing a Store

All `Store`s have a few basic arguments that are critical to understand. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents part. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into obeys some schema.
All `Store`s have a few basic arguments that are critical for basic usage. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents apart. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into obeys some schema.

### Using a Store

Expand All @@ -40,7 +40,7 @@ Stores provide a number of basic methods that make easy to use:
- query: Standard mongo style `find` method that lets you search the store.
- query_one: Same as above but limits returned results to just the first document that matches your query.
- update: Update the documents into the collection. This will override documents if the key field matches.
- ensure_index: This creates an index the underlying data-source for fast querying.
- ensure_index: This creates an index for the underlying data-source for fast querying.
- distinct: Gets distinct values of a field.
- groupby: Similar to query but performs a grouping operation and returns sets of documents.
- remove_docs: Removes documents from the underlying data source.
Expand Down
2 changes: 1 addition & 1 deletion src/maggma/stores/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def __init__(
endpoint_url: str = None,
sub_dir: str = None,
s3_workers: int = 1,
key: str = "task_id",
key: str = "fs_id",
searchable_fields: Optional[List[str]] = None,
**kwargs,
):
Expand Down

0 comments on commit 672c6ae

Please sign in to comment.