Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chunking for json normalization #914

Merged
merged 1 commit into from
Feb 5, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion src/maggma/stores/open_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

from maggma.core.store import Sort, Store
from maggma.stores.aws import S3Store
from maggma.utils import grouper


class PandasMemoryStore(Store):
Expand Down Expand Up @@ -602,6 +603,13 @@ def update(
raise NotImplementedError("updating store with additional metadata is not supported")
super().update(docs=docs, key=key)

@staticmethod
def json_normalize_and_filter(docs: List[Dict], object_grouping: List[str]) -> pd.DataFrame:
dfs = []
for chunk in grouper(iterable=docs, n=1000):
dfs.append(pd.json_normalize(chunk, sep="_")[object_grouping])
return pd.concat(dfs)

def _write_to_s3_and_index(self, docs: List[Dict], search_keys: List[str]):
"""Implements updating of the provided documents in S3 and the index.

Expand All @@ -611,7 +619,7 @@ def _write_to_s3_and_index(self, docs: List[Dict], search_keys: List[str]):
"""
# group docs to update by object grouping
og = list(set(self.object_grouping) | set(search_keys))
df = pd.json_normalize(docs, sep="_")[og]
df = OpenDataStore.json_normalize_and_filter(docs=docs, object_grouping=og)
df_grouped = df.groupby(self.object_grouping)
existing = self.index._data
docs_df = pd.DataFrame(docs)
Expand Down
Loading