Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated documentaion. #419

Merged
merged 5 commits into from
Mar 22, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions docs/getting_started/advanced_stores.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Configurations and Usage of Advanced `store`'s

## S3Store

### Configuration

The S3Store interfaces with S3 object storage via [boto3](https://pypi.org/project/boto3/).
For this to work properly, you have to set your basic configuration in `~/.aws/config`
```buildoutcfg
[default]
source_profile = default
```

Then, you have to set up your credentials in `~/.aws/credentials`
```buildoutcfg
[default]
aws_access_key_id = YOUR_KEY
aws_secret_access_key = YOUR_SECRET
```

For more information on the configuration please see the following [documentation](https://docs.aws.amazon.com/credref/latest/refdocs/settings-global.html).
Note that while these configurations are in the `~/.aws` folder, they are shared by other similar services like the self-hosted [minio](https://min.io/) service.

### Basic Usage

MongoDB is not designed to handle large object storage.
As such, we created an abstract object that combines the large object storage capabilities of Amazon S3 and the easy, python-friendly query language of MongoDB.
These `S3Store`s all include an `index` store that only stores specific queryable data and the object key for retrieving the data from an S3 bucket using the `key` attribute (called `'fs_id'` by default).

An entry of in the `index` may look something like this:
```
{
fs_id : "5fc6b87e99071dfdf04ca871"
task_id : "mp-12345"
}
```
Please note that since we are giving users the ability to reconstruct the index store using the object metadata, the object size in the `index` is limited by the metadata and not MongoDB.
Different S3 services might have different rules, but the limit is typically smaller: 8 KB for [aws](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html)

The `S3Store` should be constructed as follows:

```python
from maggma.stores import MongograntStore, S3Store
index = MongograntStore("ro:mongodb03/js_cathodes",
"atomate_aeccar0_fs_index",
key="fs_id")
s3store = S3Store(index=index,
bucket="<<BUCKET_NAME>>",
s3_profile="<<S3_PROFILE_NAME>>",
compress= True,
endpoint_url= "<<S3_URL>>",
sub_dir= "atomate_aeccar0_fs",
s3_workers=4
)
```

The `subdir` field creates subdirectories in the bucket to help the user organize their data.

### Parallelism

Once you start working with large quantities of data, the speed at which you process this data will often be limited by database I/O.
For the most time-consuming upload part of the process, we have implemented thread-level parallelism in the `update` member function.
The `update` function received an entire chunk of processed data as defined by `chunk_size`,
however since `Store.update` is typically called in the `update_targets` part of a builder, where builder execution is not longer multi-threaded.
As such, we multithread the execution inside of `update` using `s3_workers` threads to perform the database write operation.
As a general rule of thumb, if you notice that your update step is taking too long, you should change the `s3_worker` field which is
optimized differently based on server-side resources.
6 changes: 3 additions & 3 deletions docs/getting_started/stores.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ Current working and tested Stores include:
- VaulStore: uses Vault to get credentials for a MongoDB database
- AliasingStore: aliases keys from the underlying store to new names
- SandboxStore: provides permission control to documents via a `_sbxn` sandbox key
- S3Store: provides an interface to an S3 Bucket either on AWS or self-hosted solutions
- S3Store: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md))
- JointStore: joins several MongoDB collections together, merging documents with the same `key`, so they look like one collection
- ConcatStore: concatenates several MongoDB collections in series so they look like one collection

## The `Store` interface

### Initializing a Store

All `Store`s have a few basic arguments that are critical to understand. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents part. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into obeys some schema.
All `Store`s have a few basic arguments that are critical for basic usage. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents apart. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into obeys some schema.

### Using a Store

Expand All @@ -40,7 +40,7 @@ Stores provide a number of basic methods that make easy to use:
- query: Standard mongo style `find` method that lets you search the store.
- query_one: Same as above but limits returned results to just the first document that matches your query.
- update: Update the documents into the collection. This will override documents if the key field matches.
- ensure_index: This creates an index the underlying data-source for fast querying.
- ensure_index: This creates an index for the underlying data-source for fast querying.
- distinct: Gets distinct values of a field.
- groupby: Similar to query but performs a grouping operation and returns sets of documents.
- remove_docs: Removes documents from the underlying data source.
Expand Down
2 changes: 1 addition & 1 deletion src/maggma/stores/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def __init__(
endpoint_url: str = None,
sub_dir: str = None,
s3_workers: int = 1,
key: str = "task_id",
key: str = "fs_id",
searchable_fields: Optional[List[str]] = None,
**kwargs,
):
Expand Down