materialsproject · shyamd · Jan 20, 2020 · Jan 14, 2020 · Jan 14, 2020 · Jan 16, 2020
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -14,6 +14,7 @@ nav:
     - Advanced Builders: getting_started/advanced_builder.md
     - Running a Builder Pipeline: getting_started/running_builders.md
     - Working with MapBuilder: getting_started/map_builder.md
+    - Working with GroupBuilder: getting_started/group_builder.md
   - Reference:
       Core:
         Store: reference/core_store.md

diff --git a/src/docs/concepts.md b/src/docs/concepts.md
@@ -1,12 +1,23 @@
+# Concepts
 
 ## MSONable
 
-Maggma objects implement the `MSONable` pattern which enables these objects to serialize and deserialize to python dictionaries or even JSON. The MSONable encoder injects in `@module` and `@class` info so that the object can be deserialized without the manual. This enables much of Maggma to operate like a plugin system.
+One challenge in building complex data-transformation codes is keeping track of all the settings necessary to make some output database. One bad solution is to hard-code these settings, but then any modification is difficult to keep track of.
+
+Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the origianl Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object.
 
 ## Store
 
-Stores are document-based data sources and data sinks. They are modeled around the MongoDB collection although they can represent more complex data sources as well. Stores implement methods to `connect`, `query`, find `distinct` values, `groupby` fields, `update` documents, and `remove` documents. Stores also implement a number of critical fields for Maggma: the `key` and the `last_updated_field`. `key` is the field that is used to index the underlying data source. `last_updated_field` is the timestamp of when that document.
+Another challenge is dealing with all the different types of databases out there. Maggma was originally built off MongoDB, so it's interface looks a lot like `PyMongo`. Still, there are a number of usefull new `object` databases that can be used to store large quantities of data you don't need to search in such as Amazon S3 and Google Cloud. It would be nice to have a single interface to all of these so you could write your datapipeline only once.
+
+Stores are databases containing organized document-based data. They represent either a data source or a data sink. They are modeled around the MongoDB collection although they can represent more complex data sources that auto-alias keys without the user knowing, or even providing concatenation or joining of Stores. Stores implement methods to `connect`, `query`, find `distinct` values, `groupby` fields, `update` documents, and `remove` documents. Stores also implement a number of critical fields for Maggma that help in efficient document processing: the `key` and the `last_updated_field`. `key` is the field that is used to uniquely index the underlying data source. `last_updated_field` is the timestamp of when that document was last modified.
 
 ## Builder
 
-Builders represent a data transformation step. Builders break down each transformation into 3 key steps: `get_items`, `process_item`, and `update_targets`. Both `get_items` and `update_targets` can perform IO to the data stores. `process_item` is expected to not perform any IO so that it can be parallelized by Maggma. Builders can be chained together into a array and then saved as a JSON file to be run on a production system.
+Builders represent a data processing step. Builders break down each transformation into 3 phases: `get_items`, `process_item`, and `update_targets`:
+
+1. `get_items`: Retrieve items from the source Store(s) for processing by the next phase
+2. `process_item`: Manipulate the input item and create an output document that is sent to the next phase for storage.
+3. `update_target`: Add the processed item to the target Store(s).
+
+Both `get_items` and `update_targets` can perform IO (input/output) to the data stores. `process_item` is expected to not perform any IO so that it can be parallelized by Maggma. Builders can be chained together into an array and then saved as a JSON file to be run on a production system.
diff --git a/src/docs/getting_started/advanced_builder.md b/src/docs/getting_started/advanced_builder.md
@@ -22,9 +22,9 @@ One of the most important features in a builder is incremental building which al
         new_ids = self.target.newer_in(self.source)
 ```
 
-## Speeding up IO
+## Speeding up Data Transfers
 
-Since `maggma` is designed around Mongo style data sources and sinks, building indexes or in-memory copies of fields you want to search on is critical to get the fastest possible IO. Since this is very builder and document style dependent, `maggma` provides a direct interface to `ensure_indexes` on a Store. A common paradigm is to do this in the beginning of `get_items`:
+Since `maggma` is designed around Mongo style data sources and sinks, building indexes or in-memory copies of fields you want to search on is critical to get the fastest possible data input/output (IO). Since this is very builder and document style dependent, `maggma` provides a direct interface to `ensure_indexes` on a Store. A common paradigm is to do this in the beginning of `get_items`:
 
 ``` python
     def ensure_indexes(self):
@@ -37,8 +37,9 @@ Since `maggma` is designed around Mongo style data sources and sinks, building i
 ```
 
 
-## Getting Advanced Features for Free
+## Built in Templates for Advanced Builders
 
-`maggma` implements standard builders that implement many of these advanced features:
+`maggma` implements templates for builders that have many of these advanced features listed above:
 
-- [MapBuilder](map_builder.md)
+- [MapBuilder](map_builder.md) Creates one-to-one document mapping of items in the source Store to the transformed documents in the target Store.
+- [GroupBuilder](group_builder.md) Creates many-to-one document mapping of items in the source Store to transformed documents in the traget Store
diff --git a/src/docs/getting_started/group_builder.md b/src/docs/getting_started/group_builder.md
@@ -0,0 +1,84 @@
+# Group Builder
+
+Another advanced template in `maggma` is the `GroupBuilder`, which groups documents together before applying your function on the group of items. Just like `MapBuilder`, `GroupBuilder` also handles incremental building, keeping track of errors, getting only the data you need, and managing timeouts. GroupBuilder won't delete orphaned documents since that reverse relationshop isn't valid.
+
+Let's create a simple `ResupplyBuilder`, which will look at the inventory of items and determine what items need resupply. The source document will look something like this:
+
+``` JSON
+{
+    "name": "Banana",
+    "type": "fruit",
+    "quantity": 20,
+    "minimum": 10,
+    "last_updated": "2019-11-3T19:09:45"
+}
+```
+
+Our builder should give us documents that look like this:
+
+``` JSON
+{
+    "names": ["Grapes", "Apples", "Bananas"],
+    "type": "fruit",
+    "resupply": {
+        "Apples": 10,
+        "Bananes": 0,
+        "Grapes": 5
+    },
+    "last_updated": "2019-11-3T19:09:45"
+}
+```
+
+To begin, we define our `GroupBuilder`:
+
+``` python
+
+from maggma.builders import GroupBuilder
+from maggma.core import Store
+
+class ResupplyBuilder(GroupBuilder):
+    """
+    Simple builder that determines which items to resupply
+    """
+
+    def __init__(inventory: Store, resupply: Store,resupply_percent : int = 100, **kwargs):
+        """
+        Arguments:
+            inventory: current inventory information
+            resupply: target resupply information
+            resupply_percent: the percent of the minimum to include in the resupply
+        """
+        self.inventory = inventory
+        self.resupply = resupply
+        self.resupply_percent = resupply_percent
+        self.kwargs = kwargs
+
+        super().__init__(source=inventory, target=resupply, grouping_properties=["type"], **kwargs)
+```
+
+Note that unlike the previous `MapBuilder` example, we didn't call the source and target stores as such. Providing more usefull names is a good idea in writing builders to make it clearer what the underlying data should look like.
+
+`GroupBuilder` inherits from `MapBuilder` so it has the same configurational parameters.
+
+- projection: list of the fields you want to project. This can reduce the data transfer load if you only need certain fields or sub-documents from the source documents
+- timeout: optional timeout on the process function
+- store_process_timeout: adds the process time into the target document for profiling
+- retry_failed: retries running the process function on previously failed documents
+
+One parameter that doens't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationshop makes determining orphaned documents very difficult.
+
+Finally let's get to the hard part which is running our function. We do this by defining `unary_function`
+
+``` python
+    def unary_function(self, items: List[Dict]) -> Dict:
+        resupply = {}
+
+        for item in items:
+            if item["quantity"] > item["minimum"]:
+                resupply[item["name"]] = int(item["minimum"] * self.resupply_percent )
+            else:
+                resupply[item["name"]] = 0
+        return {"resupply": resupply}
+```
+
+Just as in `MapBuilder`, we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `GroupBuilder` takes care of this, while also recording errors, processing time, and the Builder version.`GroupBuilder` also keeps a plural version of the `source.key` field, so in this example, all the `name` values wil be put together and kept in `names`
diff --git a/src/docs/getting_started/map_builder.md b/src/docs/getting_started/map_builder.md
@@ -2,7 +2,7 @@
 
 `maggma` has a built in builder called the `MapBuilder` which handles a number of tedious tasks in writing a builder. This class is designed to be used similar to a map operator in any other framework in even the map function in python. `MapBuilder` will take each document in the source store, apply the function you give it, and then store that in the target store. It handles incremental building, keeping track of errors, getting only the data you need, managing timeouts, and deleting orphaned documents through configurational options.
 
-Let's rebuild the `MultiplierBuilder` we wrote earlier using `MapBuilder`:
+Let's create the same `MultiplierBuilder` we wrote earlier using `MapBuilder`:
 
 ``` python
 from maggma.builders import MapBuilder
@@ -41,7 +41,7 @@ Just like before we define a new class, but this time it should inherit from `Ma
 
 MapBuilder has a number of configurational options that you can hardcode as above or expose as properties for the user through **kwargs:
 
-- projection: list of the fields you want to project. This can reduce the IO load if you only need certain keys from the source documents
+- projection: list of the fields you want to project. This can reduce the data transfer load if you only need certain fields or sub-documents from the source documents
 - delete_orphans: this will delete documents in the target which don't have a corresponding document in the source
 - timeout: optional timeout on the process function
 - store_process_timeout: adds the process time into the target document for profiling
@@ -55,3 +55,5 @@ Finally let's get to the hard part which is running our function. We do this by
         return {"a": item["a"] * self.mulitplier}
 
 ```
+
+Note that we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `MapBuilder` takes care of this, while also recording errors, processing time, and the Builder version.
diff --git a/src/docs/getting_started/simple_builder.md b/src/docs/getting_started/simple_builder.md
@@ -106,6 +106,8 @@ We could have also returned a list of items:
         docs = list(self.source.query())
 ```
 
+One advantage of using the generator approach is it is less memory intensive than the approach where a list of items returned. For large datasets, returning a list of all items for processing may be prohibitive due to memory constraints.
+
 ## `process_item`
 
 `process_item` just has to do the parallelizable work on each item. Since the item is whatever comes out of `get_items`, you know exactly what it should be. It may be a single document, a list of documents, a mapping, a set, etc.

diff --git a/src/docs/getting_started/stores.md b/src/docs/getting_started/stores.md
@@ -1,6 +1,6 @@
 # Using `Store`
 
-A `Store` is just a wrapper to access data from somewhere. That somewhere is typically a MongoDB collection, but it could also be GridFS which lets you keep large binary objects. `maggma` makes GridFS and MongoDB collections feel the same. Beyond that it adds in something that looks like GridFS but is actually using AWS S3 as the storage space. Finally, `Store` can actually perform logic, concatenating two or more `Stores` together to make them look like one data source for instance. This means you only have to write a `Builder` for one scenario and the choice of `Store` lets you control how and where the data goes.
+A `Store` is just a wrapper to access data from somewhere. That somewhere is typically a MongoDB collection, but it could also be GridFS which lets you keep large binary objects. `maggma` makes GridFS and MongoDB collections feel the same. Beyond that it adds in something that looks like GridFS but is actually using AWS S3 as the storage space. Finally, `Store` can actually perform logic, concatenating two or more `Stores` together to make them look like one data source for instance. This means you only have to write a `Builder` for one scenario of how to transform data and the choice of `Store` lets you control where the data comes from and goes.
 
 ## List of Stores
 
@@ -22,15 +22,15 @@ Current working and tested Stores include:
 
 ### Initializing a Store
 
-All `Store`s have a few basic arguments that are critical to understand. Any `Store` has two special fields: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents part. Typically this is `_id` in MongoDB, but you could use your own field. `last_updated_field` tells `Store` how to order the documents by a date field.  `Store`s can also take a `Validator` object to make sure the data going into obeys some schema.
+All `Store`s have a few basic arguments that are critical to understand. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents part. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be a ISO 8601-format (ex: `2009-05-28T16:15:00`)  `Store`s can also take a `Validator` object to make sure the data going into obeys some schema.
 
 ### Using a Store
 
 Stores provide a number of basic methods that make easy to use:
 
 - query: Standard mongo style `find` method that lets you search the store.
-- query_one: Same as above but limits to the first document.
-- update: Update the documents into the collection. This will override documents if the key field matches. You can temporarily provide extra fields to key these documents if you don't to maintain duplicates, for instance keying on both `key` and `last_udpated_field`.
+- query_one: Same as above but limits returned results to just the first document that matches your query.
+- update: Update the documents into the collection. This will override documents if the key field matches.
 - ensure_index: This creates an index the underlying data-source for fast querying.
 - distinct: Gets distinct values of a field.
 - groupby: Similar to query but performs a grouping operation and returns sets of documents.

diff --git a/src/docs/index.md b/src/docs/index.md
@@ -17,7 +17,7 @@ versions of Python.
 Open your terminal and run the following command.
 
 ``` shell
-pip install -U maggma
+pip install --upgrade maggma
 ```
 
 ## Installation from source

diff --git a/src/maggma/builders/__init__.py b/src/maggma/builders/__init__.py
@@ -0,0 +1,2 @@
+from maggma.builders.map_builder import MapBuilder, CopyBuilder
+from maggma.builders.group_builder import GroupBuilder
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		from maggma.builders.map_builder import MapBuilder, CopyBuilder
		from maggma.builders.group_builder import GroupBuilder