-
Notifications
You must be signed in to change notification settings - Fork 31
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #79 from materialsproject/refactor
Updated GroupBuilder
- Loading branch information
Showing
17 changed files
with
409 additions
and
113 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,23 @@ | ||
# Concepts | ||
|
||
## MSONable | ||
|
||
Maggma objects implement the `MSONable` pattern which enables these objects to serialize and deserialize to python dictionaries or even JSON. The MSONable encoder injects in `@module` and `@class` info so that the object can be deserialized without the manual. This enables much of Maggma to operate like a plugin system. | ||
One challenge in building complex data-transformation codes is keeping track of all the settings necessary to make some output database. One bad solution is to hard-code these settings, but then any modification is difficult to keep track of. | ||
|
||
Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the origianl Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object. | ||
|
||
## Store | ||
|
||
Stores are document-based data sources and data sinks. They are modeled around the MongoDB collection although they can represent more complex data sources as well. Stores implement methods to `connect`, `query`, find `distinct` values, `groupby` fields, `update` documents, and `remove` documents. Stores also implement a number of critical fields for Maggma: the `key` and the `last_updated_field`. `key` is the field that is used to index the underlying data source. `last_updated_field` is the timestamp of when that document. | ||
Another challenge is dealing with all the different types of databases out there. Maggma was originally built off MongoDB, so it's interface looks a lot like `PyMongo`. Still, there are a number of usefull new `object` databases that can be used to store large quantities of data you don't need to search in such as Amazon S3 and Google Cloud. It would be nice to have a single interface to all of these so you could write your datapipeline only once. | ||
|
||
Stores are databases containing organized document-based data. They represent either a data source or a data sink. They are modeled around the MongoDB collection although they can represent more complex data sources that auto-alias keys without the user knowing, or even providing concatenation or joining of Stores. Stores implement methods to `connect`, `query`, find `distinct` values, `groupby` fields, `update` documents, and `remove` documents. Stores also implement a number of critical fields for Maggma that help in efficient document processing: the `key` and the `last_updated_field`. `key` is the field that is used to uniquely index the underlying data source. `last_updated_field` is the timestamp of when that document was last modified. | ||
|
||
## Builder | ||
|
||
Builders represent a data transformation step. Builders break down each transformation into 3 key steps: `get_items`, `process_item`, and `update_targets`. Both `get_items` and `update_targets` can perform IO to the data stores. `process_item` is expected to not perform any IO so that it can be parallelized by Maggma. Builders can be chained together into a array and then saved as a JSON file to be run on a production system. | ||
Builders represent a data processing step. Builders break down each transformation into 3 phases: `get_items`, `process_item`, and `update_targets`: | ||
|
||
1. `get_items`: Retrieve items from the source Store(s) for processing by the next phase | ||
2. `process_item`: Manipulate the input item and create an output document that is sent to the next phase for storage. | ||
3. `update_target`: Add the processed item to the target Store(s). | ||
|
||
Both `get_items` and `update_targets` can perform IO (input/output) to the data stores. `process_item` is expected to not perform any IO so that it can be parallelized by Maggma. Builders can be chained together into an array and then saved as a JSON file to be run on a production system. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,84 @@ | ||
# Group Builder | ||
|
||
Another advanced template in `maggma` is the `GroupBuilder`, which groups documents together before applying your function on the group of items. Just like `MapBuilder`, `GroupBuilder` also handles incremental building, keeping track of errors, getting only the data you need, and managing timeouts. GroupBuilder won't delete orphaned documents since that reverse relationshop isn't valid. | ||
|
||
Let's create a simple `ResupplyBuilder`, which will look at the inventory of items and determine what items need resupply. The source document will look something like this: | ||
|
||
``` JSON | ||
{ | ||
"name": "Banana", | ||
"type": "fruit", | ||
"quantity": 20, | ||
"minimum": 10, | ||
"last_updated": "2019-11-3T19:09:45" | ||
} | ||
``` | ||
|
||
Our builder should give us documents that look like this: | ||
|
||
``` JSON | ||
{ | ||
"names": ["Grapes", "Apples", "Bananas"], | ||
"type": "fruit", | ||
"resupply": { | ||
"Apples": 10, | ||
"Bananes": 0, | ||
"Grapes": 5 | ||
}, | ||
"last_updated": "2019-11-3T19:09:45" | ||
} | ||
``` | ||
|
||
To begin, we define our `GroupBuilder`: | ||
|
||
``` python | ||
|
||
from maggma.builders import GroupBuilder | ||
from maggma.core import Store | ||
|
||
class ResupplyBuilder(GroupBuilder): | ||
""" | ||
Simple builder that determines which items to resupply | ||
""" | ||
|
||
def __init__(inventory: Store, resupply: Store,resupply_percent : int = 100, **kwargs): | ||
""" | ||
Arguments: | ||
inventory: current inventory information | ||
resupply: target resupply information | ||
resupply_percent: the percent of the minimum to include in the resupply | ||
""" | ||
self.inventory = inventory | ||
self.resupply = resupply | ||
self.resupply_percent = resupply_percent | ||
self.kwargs = kwargs | ||
|
||
super().__init__(source=inventory, target=resupply, grouping_properties=["type"], **kwargs) | ||
``` | ||
|
||
Note that unlike the previous `MapBuilder` example, we didn't call the source and target stores as such. Providing more usefull names is a good idea in writing builders to make it clearer what the underlying data should look like. | ||
|
||
`GroupBuilder` inherits from `MapBuilder` so it has the same configurational parameters. | ||
|
||
- projection: list of the fields you want to project. This can reduce the data transfer load if you only need certain fields or sub-documents from the source documents | ||
- timeout: optional timeout on the process function | ||
- store_process_timeout: adds the process time into the target document for profiling | ||
- retry_failed: retries running the process function on previously failed documents | ||
|
||
One parameter that doens't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationshop makes determining orphaned documents very difficult. | ||
|
||
Finally let's get to the hard part which is running our function. We do this by defining `unary_function` | ||
|
||
``` python | ||
def unary_function(self, items: List[Dict]) -> Dict: | ||
resupply = {} | ||
|
||
for item in items: | ||
if item["quantity"] > item["minimum"]: | ||
resupply[item["name"]] = int(item["minimum"] * self.resupply_percent ) | ||
else: | ||
resupply[item["name"]] = 0 | ||
return {"resupply": resupply} | ||
``` | ||
|
||
Just as in `MapBuilder`, we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `GroupBuilder` takes care of this, while also recording errors, processing time, and the Builder version.`GroupBuilder` also keeps a plural version of the `source.key` field, so in this example, all the `name` values wil be put together and kept in `names` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
from maggma.builders.map_builder import MapBuilder, CopyBuilder | ||
from maggma.builders.group_builder import GroupBuilder |
Oops, something went wrong.