Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated GroupBuilder #79

Merged
merged 17 commits into from
Jan 20, 2020
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ nav:
- Advanced Builders: getting_started/advanced_builder.md
- Running a Builder Pipeline: getting_started/running_builders.md
- Working with MapBuilder: getting_started/map_builder.md
- Working with GroupBuilder: getting_started/group_builder.md
- Reference:
Core:
Store: reference/core_store.md
Expand Down
17 changes: 14 additions & 3 deletions src/docs/concepts.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,23 @@
# Concepts

## MSONable

Maggma objects implement the `MSONable` pattern which enables these objects to serialize and deserialize to python dictionaries or even JSON. The MSONable encoder injects in `@module` and `@class` info so that the object can be deserialized without the manual. This enables much of Maggma to operate like a plugin system.
One challenge in building complex data-transformation codes is keeping track of all the settings necessary to make some output database. One bad solution is to hard-code these settings, but then any modification is difficult to keep track of.

Maggma solves this by putting the configuration with the pipeline definition in JSON or YAML files. This is done using the `MSONable` pattern, which requires that any Maggma object (the databases and transformation steps) can convert itself to a python dictionary with it's configuration parameters in a process called serialization. These dictionaries can then be converted back to the origianl Maggma object without having to know what class it belonged. `MSONable` does this by injecting in `@class` and `@module` keys that tell it where to find the original python code for that Maggma object.

## Store

Stores are document-based data sources and data sinks. They are modeled around the MongoDB collection although they can represent more complex data sources as well. Stores implement methods to `connect`, `query`, find `distinct` values, `groupby` fields, `update` documents, and `remove` documents. Stores also implement a number of critical fields for Maggma: the `key` and the `last_updated_field`. `key` is the field that is used to index the underlying data source. `last_updated_field` is the timestamp of when that document.
Another challenge is dealing with all the different types of databases out there. Maggma was originally built off MongoDB, so it's interface looks a lot like `PyMongo`. Still, there are a number of usefull new `object` databases that can be used to store large quantities of data you don't need to search in such as Amazon S3 and Google Cloud. It would be nice to have a single interface to all of these so you could write your datapipeline only once.

Stores are databases containing organized document-based data. They represent either a data source or a data sink. They are modeled around the MongoDB collection although they can represent more complex data sources that auto-alias keys without the user knowing, or even providing concatenation or joining of Stores. Stores implement methods to `connect`, `query`, find `distinct` values, `groupby` fields, `update` documents, and `remove` documents. Stores also implement a number of critical fields for Maggma that help in efficient document processing: the `key` and the `last_updated_field`. `key` is the field that is used to uniquely index the underlying data source. `last_updated_field` is the timestamp of when that document was last modified.

## Builder

Builders represent a data transformation step. Builders break down each transformation into 3 key steps: `get_items`, `process_item`, and `update_targets`. Both `get_items` and `update_targets` can perform IO to the data stores. `process_item` is expected to not perform any IO so that it can be parallelized by Maggma. Builders can be chained together into a array and then saved as a JSON file to be run on a production system.
Builders represent a data processing step. Builders break down each transformation into 3 phases: `get_items`, `process_item`, and `update_targets`:

1. `get_items`: Retrieve items from the source Store(s) for processing by the next phase
2. `process_item`: Manipulate the input item and create an output document that is sent to the next phase for storage.
3. `update_target`: Add the processed item to the target Store(s).

Both `get_items` and `update_targets` can perform IO (input/output) to the data stores. `process_item` is expected to not perform any IO so that it can be parallelized by Maggma. Builders can be chained together into an array and then saved as a JSON file to be run on a production system.
11 changes: 6 additions & 5 deletions src/docs/getting_started/advanced_builder.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ One of the most important features in a builder is incremental building which al
new_ids = self.target.newer_in(self.source)
```

## Speeding up IO
## Speeding up Data Transfers

Since `maggma` is designed around Mongo style data sources and sinks, building indexes or in-memory copies of fields you want to search on is critical to get the fastest possible IO. Since this is very builder and document style dependent, `maggma` provides a direct interface to `ensure_indexes` on a Store. A common paradigm is to do this in the beginning of `get_items`:
Since `maggma` is designed around Mongo style data sources and sinks, building indexes or in-memory copies of fields you want to search on is critical to get the fastest possible data input/output (IO). Since this is very builder and document style dependent, `maggma` provides a direct interface to `ensure_indexes` on a Store. A common paradigm is to do this in the beginning of `get_items`:

``` python
def ensure_indexes(self):
Expand All @@ -37,8 +37,9 @@ Since `maggma` is designed around Mongo style data sources and sinks, building i
```


## Getting Advanced Features for Free
## Built in Templates for Advanced Builders

`maggma` implements standard builders that implement many of these advanced features:
`maggma` implements templates for builders that have many of these advanced features listed above:

- [MapBuilder](map_builder.md)
- [MapBuilder](map_builder.md) Creates one-to-one document mapping of items in the source Store to the transformed documents in the target Store.
- [GroupBuilder](group_builder.md) Creates many-to-one document mapping of items in the source Store to transformed documents in the traget Store
84 changes: 84 additions & 0 deletions src/docs/getting_started/group_builder.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Group Builder

Another advanced template in `maggma` is the `GroupBuilder`, which groups documents together before applying your function on the group of items. Just like `MapBuilder`, `GroupBuilder` also handles incremental building, keeping track of errors, getting only the data you need, and managing timeouts. GroupBuilder won't delete orphaned documents since that reverse relationshop isn't valid.

Let's create a simple `ResupplyBuilder`, which will look at the inventory of items and determine what items need resupply. The source document will look something like this:

``` JSON
{
"name": "Banana",
"type": "fruit",
"quantity": 20,
"minimum": 10,
"last_updated": "2019-11-3T19:09:45"
}
```

Our builder should give us documents that look like this:

``` JSON
{
"names": ["Grapes", "Apples", "Bananas"],
"type": "fruit",
"resupply": {
"Apples": 10,
"Bananes": 0,
"Grapes": 5
},
"last_updated": "2019-11-3T19:09:45"
}
```

To begin, we define our `GroupBuilder`:

``` python

from maggma.builders import GroupBuilder
from maggma.core import Store

class ResupplyBuilder(GroupBuilder):
"""
Simple builder that determines which items to resupply
"""

def __init__(inventory: Store, resupply: Store,resupply_percent : int = 100, **kwargs):
"""
Arguments:
inventory: current inventory information
resupply: target resupply information
resupply_percent: the percent of the minimum to include in the resupply
"""
self.inventory = inventory
self.resupply = resupply
self.resupply_percent = resupply_percent
self.kwargs = kwargs

super().__init__(source=inventory, target=resupply, grouping_properties=["type"], **kwargs)
```

Note that unlike the previous `MapBuilder` example, we didn't call the source and target stores as such. Providing more usefull names is a good idea in writing builders to make it clearer what the underlying data should look like.

`GroupBuilder` inherits from `MapBuilder` so it has the same configurational parameters.

- projection: list of the fields you want to project. This can reduce the data transfer load if you only need certain fields or sub-documents from the source documents
- timeout: optional timeout on the process function
- store_process_timeout: adds the process time into the target document for profiling
- retry_failed: retries running the process function on previously failed documents

One parameter that doens't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationshop makes determining orphaned documents very difficult.

Finally let's get to the hard part which is running our function. We do this by defining `unary_function`

``` python
def unary_function(self, items: List[Dict]) -> Dict:
resupply = {}

for item in items:
if item["quantity"] > item["minimum"]:
resupply[item["name"]] = int(item["minimum"] * self.resupply_percent )
else:
resupply[item["name"]] = 0
return {"resupply": resupply}
```

Just as in `MapBuilder`, we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `GroupBuilder` takes care of this, while also recording errors, processing time, and the Builder version.`GroupBuilder` also keeps a plural version of the `source.key` field, so in this example, all the `name` values wil be put together and kept in `names`
6 changes: 4 additions & 2 deletions src/docs/getting_started/map_builder.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

`maggma` has a built in builder called the `MapBuilder` which handles a number of tedious tasks in writing a builder. This class is designed to be used similar to a map operator in any other framework in even the map function in python. `MapBuilder` will take each document in the source store, apply the function you give it, and then store that in the target store. It handles incremental building, keeping track of errors, getting only the data you need, managing timeouts, and deleting orphaned documents through configurational options.

Let's rebuild the `MultiplierBuilder` we wrote earlier using `MapBuilder`:
Let's create the same `MultiplierBuilder` we wrote earlier using `MapBuilder`:

``` python
from maggma.builders import MapBuilder
Expand Down Expand Up @@ -41,7 +41,7 @@ Just like before we define a new class, but this time it should inherit from `Ma

MapBuilder has a number of configurational options that you can hardcode as above or expose as properties for the user through **kwargs:

- projection: list of the fields you want to project. This can reduce the IO load if you only need certain keys from the source documents
- projection: list of the fields you want to project. This can reduce the data transfer load if you only need certain fields or sub-documents from the source documents
- delete_orphans: this will delete documents in the target which don't have a corresponding document in the source
- timeout: optional timeout on the process function
- store_process_timeout: adds the process time into the target document for profiling
Expand All @@ -55,3 +55,5 @@ Finally let's get to the hard part which is running our function. We do this by
return {"a": item["a"] * self.mulitplier}

```

Note that we're not returning all the extra information typically kept in the originally item. Normally, we would have to write code that copies over the source `key` and convert it to the target `key`. Same goes for the `last_updated_field`. `MapBuilder` takes care of this, while also recording errors, processing time, and the Builder version.
2 changes: 2 additions & 0 deletions src/docs/getting_started/simple_builder.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ We could have also returned a list of items:
docs = list(self.source.query())
```

One advantage of using the generator approach is it is less memory intensive than the approach where a list of items returned. For large datasets, returning a list of all items for processing may be prohibitive due to memory constraints.

## `process_item`

`process_item` just has to do the parallelizable work on each item. Since the item is whatever comes out of `get_items`, you know exactly what it should be. It may be a single document, a list of documents, a mapping, a set, etc.
Expand Down
8 changes: 4 additions & 4 deletions src/docs/getting_started/stores.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Using `Store`

A `Store` is just a wrapper to access data from somewhere. That somewhere is typically a MongoDB collection, but it could also be GridFS which lets you keep large binary objects. `maggma` makes GridFS and MongoDB collections feel the same. Beyond that it adds in something that looks like GridFS but is actually using AWS S3 as the storage space. Finally, `Store` can actually perform logic, concatenating two or more `Stores` together to make them look like one data source for instance. This means you only have to write a `Builder` for one scenario and the choice of `Store` lets you control how and where the data goes.
A `Store` is just a wrapper to access data from somewhere. That somewhere is typically a MongoDB collection, but it could also be GridFS which lets you keep large binary objects. `maggma` makes GridFS and MongoDB collections feel the same. Beyond that it adds in something that looks like GridFS but is actually using AWS S3 as the storage space. Finally, `Store` can actually perform logic, concatenating two or more `Stores` together to make them look like one data source for instance. This means you only have to write a `Builder` for one scenario of how to transform data and the choice of `Store` lets you control where the data comes from and goes.

## List of Stores

Expand All @@ -22,15 +22,15 @@ Current working and tested Stores include:

### Initializing a Store

All `Store`s have a few basic arguments that are critical to understand. Any `Store` has two special fields: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents part. Typically this is `_id` in MongoDB, but you could use your own field. `last_updated_field` tells `Store` how to order the documents by a date field. `Store`s can also take a `Validator` object to make sure the data going into obeys some schema.
All `Store`s have a few basic arguments that are critical to understand. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents part. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be a ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into obeys some schema.

### Using a Store

Stores provide a number of basic methods that make easy to use:

- query: Standard mongo style `find` method that lets you search the store.
- query_one: Same as above but limits to the first document.
- update: Update the documents into the collection. This will override documents if the key field matches. You can temporarily provide extra fields to key these documents if you don't to maintain duplicates, for instance keying on both `key` and `last_udpated_field`.
- query_one: Same as above but limits returned results to just the first document that matches your query.
- update: Update the documents into the collection. This will override documents if the key field matches.
- ensure_index: This creates an index the underlying data-source for fast querying.
- distinct: Gets distinct values of a field.
- groupby: Similar to query but performs a grouping operation and returns sets of documents.
Expand Down
2 changes: 1 addition & 1 deletion src/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ versions of Python.
Open your terminal and run the following command.

``` shell
pip install -U maggma
pip install --upgrade maggma
```

## Installation from source
Expand Down
2 changes: 2 additions & 0 deletions src/maggma/builders/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from maggma.builders.map_builder import MapBuilder, CopyBuilder
from maggma.builders.group_builder import GroupBuilder
Loading