You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sorry for the verbosity, just throwing more spaghetti at the wall
Issue:
I enjoy the composition and dataset chaining that we offer job writers in this repo but lately I've been thinking about the tradeoffs (probably b/c I'm in the unlucky position of looking at memory graphs for Flink 😆). My gut tells me that fsspec openers should be transient and therefore as close as possible to the source of loading.
I've noticed in some workflows (namely beam.Create() | OpenURLWithFSSpec | OpenWithXarray | WhateverTransform) that we open fsspec file-like handlers outside of context managers and pass them to downstream transforms via xr.Dataset(s). This makes sense given that memory is cheap and that we don't know if downstream transforms (WhateverTransform for example) will be operating on them lazily or eagerly.
But the memory bloat can be gross for workflows that need to only work on metadata (see the example below). Personally I'd call this type of interaction a memory leak not in the strong sense of "we lost a reference and cannot deallocate memory" but in the mild sense that for the duration of the job we are holding onto "memory which is no longer needed but not released". I see how it depends on what the next downstream transform is doing. I realize this might also be the only way to build chaining for xr.Dataset(s) between steps across apache.beam. And I'm wondering out loud here about possible other alternatives.
For example, here is a simple recipe that doesn't do much but beam.Map over some lazily loaded xr.Dataset(s) (no context manager version and with context manager version). Yet with a small number of yearly inputs it hogs memory when the file-like handlers are left open compared to when closed as the image below shows:
Ideas:
If downstream transforms stop expecting xr.Dataset(s) as input and start expecting something else that can force them to choose how to set the fsspec context duration then that "might" help. It definitely will force other tradeoffs and might break the chaining of datasets. But I don't know xarray well. I keep thinking about lazy ORMs and how they'd handle this situation comparatively. I don't see much that allows me to manipulate the opening/closing of xarray's internal fsspec backend while maintaining the previous modifications to the dataset and probably for very good reasons that I am naive about. Below is an example allowing the transform to make the decision about the fsspec context duration. But, yes, this pattern would get tedious fast if we can't chain:
@dataclassclassFSXRWrapper:
"""i suck at naming things"""filepath: Optional[str] =Nonefsspec_default_kwargs: Dict[Any, Any] =field(default_factory=dict)
xarray_default_kwargs: Dict[Any, Any] =field(default_factory=dict)
@contextmanagerdefopen_with_xarray(self, fsspec_override_kwargs=None, xarray_override_kwargs=None) ->Iterator[xr.Dataset]:
fsspec_kwargs=fsspec_override_kwargsorself.fsspec_default_kwargsxarray_kwargs=xarray_override_kwargsorself.xarray_default_kwargs# yes, two .open() calls :facepalm:of=fsspec.open(self.filepath, mode="rb", **fsspec_kwargs).open()
ds=xr.Dataset(of, **xarray_kwargs)
yielddsof.close()
@dataclassclassFSXRFactory(beam.PTransform):
"""i suck at naming things"""fsspec_kwargs: Dict[Any, Any] =field(default_factory=dict)
xarray_kwargs: Dict[Any, Any] =field(default_factory=dict)
@staticmethoddefcreate(item, fsspec_kwargs, xarray_kwargs) ->FSXRWrapper:
index, url=itemreturnindex, FSXRWrapper(url, fsspec_kwargs, xarray_kwargs)
defexpand(
self,
urls: beam.PCollection[Tuple[Index, str]],
) ->beam.PCollection[FSXRWrapper]:
returnurls|beam.Map(self.create, self.fsspec_kwargs, self.xarray_kwargs)
@dataclassclassWhateverOnMetadata(beam.PTransform):
@staticmethoddefpar_meta(item) ->Tuple[Index, Tuple[xr.Dataset, FSXRWrapper]]:
"""warned you i was bad at naming"""index, fsxr=itemnew_values= []
withfsxr.open_with_xarray() asds:
# do stuff on metadata and but also pass along fsxr if needednew_values.append([ds_meta, fsxr])
returnindex, new_valuesdefexpand(
self,
urls: beam.PCollection[Tuple[Index, str]],
) ->beam.PCollection[Tuple[Index, Tuple[xr.Dataset, FSXRWrapper]]]:
x=urls|beam.Map(self.par_meta)
returnxbeam.Create(pattern.items())
|FSXRFactory(fsspec_kwargs={}, xarray_kwargs={})
|WhateverOnMetadata()
Even if this is just me talking to myself this was useful to think about b/c I see how the tradeoff here is about chaining vs memory and I'm looking for the middle way. Let me know where I'm not thinking clearly here or what I've missed please 🙇♂️
The text was updated successfully, but these errors were encountered:
Sorry for the verbosity, just throwing more spaghetti at the wall
Issue:
I enjoy the composition and dataset chaining that we offer job writers in this repo but lately I've been thinking about the tradeoffs (probably b/c I'm in the unlucky position of looking at memory graphs for Flink 😆). My gut tells me that
fsspec
openers should be transient and therefore as close as possible to the source of loading.I've noticed in some workflows (namely
beam.Create() | OpenURLWithFSSpec | OpenWithXarray | WhateverTransform
) that we openfsspec
file-like handlers outside of context managers and pass them to downstream transforms viaxr.Dataset
(s). This makes sense given that memory is cheap and that we don't know if downstream transforms (WhateverTransform
for example) will be operating on them lazily or eagerly.But the memory bloat can be gross for workflows that need to only work on metadata (see the example below). Personally I'd call this type of interaction a memory leak not in the strong sense of "we lost a reference and cannot deallocate memory" but in the mild sense that for the duration of the job we are holding onto "memory which is no longer needed but not released". I see how it depends on what the next downstream transform is doing. I realize this might also be the only way to build chaining for
xr.Dataset
(s) between steps acrossapache.beam
. And I'm wondering out loud here about possible other alternatives.For example, here is a simple recipe that doesn't do much but
beam.Map
over some lazily loadedxr.Dataset
(s) (no context manager version and with context manager version). Yet with a small number of yearly inputs it hogs memory when the file-like handlers are left open compared to when closed as the image below shows:Ideas:
If downstream transforms stop expecting
xr.Dataset
(s) as input and start expecting something else that can force them to choose how to set thefsspec
context duration then that "might" help. It definitely will force other tradeoffs and might break the chaining of datasets. But I don't knowxarray
well. I keep thinking about lazy ORMs and how they'd handle this situation comparatively. I don't see much that allows me to manipulate the opening/closing of xarray's internalfsspec
backend while maintaining the previous modifications to the dataset and probably for very good reasons that I am naive about. Below is an example allowing the transform to make the decision about thefsspec
context duration. But, yes, this pattern would get tedious fast if we can't chain:Even if this is just me talking to myself this was useful to think about b/c I see how the tradeoff here is about chaining vs memory and I'm looking for the middle way. Let me know where I'm not thinking clearly here or what I've missed please 🙇♂️
The text was updated successfully, but these errors were encountered: