diff --git a/docs/getting_started/advanced_stores.md b/docs/getting_started/advanced_stores.md new file mode 100644 index 000000000..409806408 --- /dev/null +++ b/docs/getting_started/advanced_stores.md @@ -0,0 +1,67 @@ +# Configurations and Usage of Advanced `store`'s + +## S3Store + +### Configuration + +The S3Store interfaces with S3 object storage via [boto3](https://pypi.org/project/boto3/). +For this to work properly, you have to set your basic configuration in `~/.aws/config` +```buildoutcfg +[default] +source_profile = default +``` + +Then, you have to set up your credentials in `~/.aws/credentials` +```buildoutcfg +[default] +aws_access_key_id = YOUR_KEY +aws_secret_access_key = YOUR_SECRET +``` + +For more information on the configuration please see the following [documentation](https://docs.aws.amazon.com/credref/latest/refdocs/settings-global.html). +Note that while these configurations are in the `~/.aws` folder, they are shared by other similar services like the self-hosted [minio](https://min.io/) service. + +### Basic Usage + +MongoDB is not designed to handle large object storage. +As such, we created an abstract object that combines the large object storage capabilities of Amazon S3 and the easy, python-friendly query language of MongoDB. +These `S3Store`s all include an `index` store that only stores specific queryable data and the object key for retrieving the data from an S3 bucket using the `key` attribute (called `'fs_id'` by default). + +An entry of in the `index` may look something like this: +``` +{ + fs_id : "5fc6b87e99071dfdf04ca871" + task_id : "mp-12345" +} +``` +Please note that since we are giving users the ability to reconstruct the index store using the object metadata, the object size in the `index` is limited by the metadata and not MongoDB. +Different S3 services might have different rules, but the limit is typically smaller: 8 KB for [aws](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html) + +The `S3Store` should be constructed as follows: + +```python +from maggma.stores import MongograntStore, S3Store +index = MongograntStore("ro:mongodb03/js_cathodes", + "atomate_aeccar0_fs_index", + key="fs_id") +s3store = S3Store(index=index, + bucket="<>", + s3_profile="<>", + compress= True, + endpoint_url= "<>", + sub_dir= "atomate_aeccar0_fs", + s3_workers=4 + ) +``` + +The `subdir` field creates subdirectories in the bucket to help the user organize their data. + +### Parallelism + +Once you start working with large quantities of data, the speed at which you process this data will often be limited by database I/O. +For the most time-consuming upload part of the process, we have implemented thread-level parallelism in the `update` member function. +The `update` function received an entire chunk of processed data as defined by `chunk_size`, +however since `Store.update` is typically called in the `update_targets` part of a builder, where builder execution is not longer multi-threaded. +As such, we multithread the execution inside of `update` using `s3_workers` threads to perform the database write operation. +As a general rule of thumb, if you notice that your update step is taking too long, you should change the `s3_worker` field which is +optimized differently based on server-side resources. diff --git a/docs/getting_started/stores.md b/docs/getting_started/stores.md index 76cddf654..e21b65311 100644 --- a/docs/getting_started/stores.md +++ b/docs/getting_started/stores.md @@ -15,7 +15,7 @@ Current working and tested Stores include: - VaulStore: uses Vault to get credentials for a MongoDB database - AliasingStore: aliases keys from the underlying store to new names - SandboxStore: provides permission control to documents via a `_sbxn` sandbox key -- S3Store: provides an interface to an S3 Bucket either on AWS or self-hosted solutions +- S3Store: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md)) - JointStore: joins several MongoDB collections together, merging documents with the same `key`, so they look like one collection - ConcatStore: concatenates several MongoDB collections in series so they look like one collection @@ -23,7 +23,7 @@ Current working and tested Stores include: ### Initializing a Store -All `Store`s have a few basic arguments that are critical to understand. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents part. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into obeys some schema. +All `Store`s have a few basic arguments that are critical for basic usage. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents apart. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into obeys some schema. ### Using a Store @@ -40,7 +40,7 @@ Stores provide a number of basic methods that make easy to use: - query: Standard mongo style `find` method that lets you search the store. - query_one: Same as above but limits returned results to just the first document that matches your query. - update: Update the documents into the collection. This will override documents if the key field matches. -- ensure_index: This creates an index the underlying data-source for fast querying. +- ensure_index: This creates an index for the underlying data-source for fast querying. - distinct: Gets distinct values of a field. - groupby: Similar to query but performs a grouping operation and returns sets of documents. - remove_docs: Removes documents from the underlying data source. diff --git a/src/maggma/stores/aws.py b/src/maggma/stores/aws.py index 9e1b050bc..64e690675 100644 --- a/src/maggma/stores/aws.py +++ b/src/maggma/stores/aws.py @@ -40,7 +40,7 @@ def __init__( endpoint_url: str = None, sub_dir: str = None, s3_workers: int = 1, - key: str = "task_id", + key: str = "fs_id", searchable_fields: Optional[List[str]] = None, **kwargs, ):