diff --git a/docs/getting_started/group_builder.md b/docs/getting_started/group_builder.md index daf862dd4..b088085d0 100644 --- a/docs/getting_started/group_builder.md +++ b/docs/getting_started/group_builder.md @@ -1,6 +1,6 @@ # Group Builder -Another advanced template in `maggma` is the `GroupBuilder`, which groups documents together before applying your function on the group of items. Just like `MapBuilder`, `GroupBuilder` also handles incremental building, keeping track of errors, getting only the data you need, and managing timeouts. GroupBuilder won't delete orphaned documents since that reverse relationshop isn't valid. +Another advanced template in `maggma` is the `GroupBuilder`, which groups documents together before applying your function on the group of items. Just like `MapBuilder`, `GroupBuilder` also handles incremental building, keeping track of errors, getting only the data you need, and managing timeouts. GroupBuilder won't delete orphaned documents since that reverse relationship isn't valid. Let's create a simple `ResupplyBuilder`, which will look at the inventory of items and determine what items need resupply. The source document will look something like this: @@ -65,7 +65,7 @@ Note that unlike the previous `MapBuilder` example, we didn't call the source an - store_process_timeout: adds the process time into the target document for profiling - retry_failed: retries running the process function on previously failed documents -One parameter that doesn't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationshop makes determining orphaned documents very difficult. +One parameter that doesn't work in `GroupBuilder` is `delete_orphans`, since the Many-to-One relationship makes determining orphaned documents very difficult. Finally let's get to the hard part which is running our function. We do this by defining `unary_function` diff --git a/docs/getting_started/running_builders.md b/docs/getting_started/running_builders.md index 14ce3cae9..1483ff4e2 100644 --- a/docs/getting_started/running_builders.md +++ b/docs/getting_started/running_builders.md @@ -64,7 +64,7 @@ There are progress bars for each of the three steps, which lets you understand w `maggma` can distribute work across multiple computers. There are two steps to this: -1. Run a `mrun` manager by providing it with a `--url` to listen for workers on and `--num-chunks`(`-N`) which tells `mrun` how many sub-pieces to break up the work into. You can can run fewer workers then chunks. This will cause `mrun` to call the builder's `prechunk` to get the distribution of work and run distributd work on all workers +1. Run a `mrun` manager by providing it with a `--url` to listen for workers on and `--num-chunks`(`-N`) which tells `mrun` how many sub-pieces to break up the work into. You can can run fewer workers then chunks. This will cause `mrun` to call the builder's `prechunk` to get the distribution of work and run distributed work on all workers 2. Run `mrun` workers b y providing it with a `--url` to listen for a manager and `--num-workers` (`-n`) to tell it how many processes to run in this worker. The `url` argument takes a fully qualified url including protocol. `tcp` is recommended: @@ -112,7 +112,7 @@ mrun -n 32 -vv my_first_builder.json builder_2_and_3.py last_builder.ipynb ## Reporting Build State -`mrun` has the ability to report the status of the build pipeline to a user-provided `Store`. To do this, you first have to save the `Store` as a JSON or YAML file. Then you can use the `-r` option to give this to `mrun`. It will then periodicially add documents to the `Store` for one of 3 different events: +`mrun` has the ability to report the status of the build pipeline to a user-provided `Store`. To do this, you first have to save the `Store` as a JSON or YAML file. Then you can use the `-r` option to give this to `mrun`. It will then periodically add documents to the `Store` for one of 3 different events: * `BUILD_STARTED` - This event tells us that a new builder started, the names of the `sources` and `targets` as well as the `total` number of items the builder expects to process * `UPDATE` - This event tells us that a batch of items was processed and is going to `update_targets`. The number of items is stored in `items`. diff --git a/docs/getting_started/simple_builder.md b/docs/getting_started/simple_builder.md index 62394a0cd..199218c96 100644 --- a/docs/getting_started/simple_builder.md +++ b/docs/getting_started/simple_builder.md @@ -71,7 +71,7 @@ Calling the parent class `__init__` is a good practice as sub-classing builders ## `get_items` -`get_items` is conceptually a simple method to implement, but in practice can easily be more code than the rest of the builder. All of the logic for getting data from the sources has to happen here, which requires some planning. `get_items` should also sort all of the data into induvidual **items** to process. This simple builder has a very easy `get_items`: +`get_items` is conceptually a simple method to implement, but in practice can easily be more code than the rest of the builder. All of the logic for getting data from the sources has to happen here, which requires some planning. `get_items` should also sort all of the data into individual **items** to process. This simple builder has a very easy `get_items`: ``` python diff --git a/setup.py b/setup.py index 24bf4b0f7..ada6f3356 100644 --- a/setup.py +++ b/setup.py @@ -13,7 +13,7 @@ name="maggma", use_scm_version=True, setup_requires=["setuptools_scm"], - description="Framework to develop datapipelines from files on disk to full dissemenation API", + description="Framework to develop datapipelines from files on disk to full dissemination API", long_description=long_desc, long_description_content_type="text/markdown", url="https://github.com/materialsproject/maggma", diff --git a/src/maggma/api/query_operator/pagination.py b/src/maggma/api/query_operator/pagination.py index d6b2151e6..cbfc49dbf 100644 --- a/src/maggma/api/query_operator/pagination.py +++ b/src/maggma/api/query_operator/pagination.py @@ -7,7 +7,7 @@ class PaginationQuery(QueryOperator): - """Query opertators to provides Pagination""" + """Query operators to provides Pagination""" def __init__(self, default_limit: int = 100, max_limit: int = 1000): """ diff --git a/src/maggma/api/utils.py b/src/maggma/api/utils.py index 6c8cbebd8..18acb3096 100644 --- a/src/maggma/api/utils.py +++ b/src/maggma/api/utils.py @@ -64,7 +64,7 @@ def attach_signature(function: Callable, defaults: Dict, annotations: Dict): Args: function: callable function to attach the signature to defaults: dictionary of parameters -> default values - annotations: dictionary of type annoations for the parameters + annotations: dictionary of type annotations for the parameters """ required_params = [ @@ -167,7 +167,7 @@ def validate_monty(cls, v, _): if len(errors) > 0: raise ValueError( - "Missing Monty seriailzation fields in dictionary: {errors}" + "Missing Monty serialization fields in dictionary: {errors}" ) return v diff --git a/src/maggma/builders/group_builder.py b/src/maggma/builders/group_builder.py index 01a6f1b25..0feed4cf3 100644 --- a/src/maggma/builders/group_builder.py +++ b/src/maggma/builders/group_builder.py @@ -91,7 +91,7 @@ def ensure_indexes(self): def prechunk(self, number_splits: int) -> Iterator[Dict]: """ - Generic prechunk for group builder to perform domain-decompostion + Generic prechunk for group builder to perform domain-decomposition by the grouping keys """ self.ensure_indexes() diff --git a/src/maggma/builders/map_builder.py b/src/maggma/builders/map_builder.py index 0ea1fd8d6..0561cd62e 100644 --- a/src/maggma/builders/map_builder.py +++ b/src/maggma/builders/map_builder.py @@ -86,7 +86,7 @@ def ensure_indexes(self): def prechunk(self, number_splits: int) -> Iterator[Dict]: """ - Generic prechunk for map builder to perform domain-decompostion + Generic prechunk for map builder to perform domain-decomposition by the key field """ self.ensure_indexes() diff --git a/src/maggma/core/builder.py b/src/maggma/core/builder.py index 051ca8135..e7a8c823b 100644 --- a/src/maggma/core/builder.py +++ b/src/maggma/core/builder.py @@ -54,7 +54,7 @@ def connect(self): def prechunk(self, number_splits: int) -> Iterable[Dict]: """ Part of a domain-decomposition paradigm to allow the builder to operate on - multiple nodes by divinding up the IO as well as the compute + multiple nodes by dividing up the IO as well as the compute This function should return an iterator of dictionaries that can be distributed to multiple instances of the builder to get/process/update on @@ -62,11 +62,11 @@ def prechunk(self, number_splits: int) -> Iterable[Dict]: number_splits: The number of groups to split the documents to work on """ self.logger.info( - f"{self.__class__.__name__} doesn't have distributed processing capabillities." + f"{self.__class__.__name__} doesn't have distributed processing capabilities." " Instead this builder will run on just one worker for all processing" ) raise NotImplementedError( - f"{self.__class__.__name__} doesn't have distributed processing capabillities." + f"{self.__class__.__name__} doesn't have distributed processing capabilities." " Instead this builder will run on just one worker for all processing" ) diff --git a/src/maggma/stores/aws.py b/src/maggma/stores/aws.py index 0b5be5a27..385683883 100644 --- a/src/maggma/stores/aws.py +++ b/src/maggma/stores/aws.py @@ -179,7 +179,7 @@ def query( yield {p: doc[p] for p in properties if p in doc} else: try: - # TODO: THis is ugly and unsafe, do some real checking before pulling data + # TODO: This is ugly and unsafe, do some real checking before pulling data data = self.s3_bucket.Object(self.sub_dir + str(doc[self.key])).get()["Body"].read() except botocore.exceptions.ClientError as e: # If a client error is thrown, then check that it was a NoSuchKey or NoSuchBucket error. diff --git a/src/maggma/stores/gridfs.py b/src/maggma/stores/gridfs.py index 9e907c834..69764fb17 100644 --- a/src/maggma/stores/gridfs.py +++ b/src/maggma/stores/gridfs.py @@ -38,7 +38,7 @@ class GridFSStore(Store): """ - A Store for GrdiFS backend. Provides a common access method consistent with other stores + A Store for GridFS backend. Provides a common access method consistent with other stores """ def __init__( @@ -58,7 +58,7 @@ def __init__( **kwargs, ): """ - Initializes a GrdiFS Store for binary data + Initializes a GridFS Store for binary data Args: database: database name collection_name: The name of the collection. @@ -447,7 +447,7 @@ def __init__( **kwargs, ): """ - Initializes a GrdiFS Store for binary data + Initializes a GridFS Store for binary data Args: uri: MongoDB+SRV URI database: database to connect to