Caching a polars dataframe into parquet fails #1240

poldpold · 2024-11-23T22:34:16Z

Current behavior

When trying to cache a node whose output is a polars DataFrame, an exception is raised.

Stack Traces

********************************************************************************
>[post-node-execute] hello [test_module.hello()] encountered an error          <
> Node inputs:
{}
********************************************************************************
Traceback (most recent call last):
  File "/home/eisler/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/execution/graph_functions.py", line 318, in execute_lifecycle_for_node
    __adapter.call_all_lifecycle_hooks_sync(
  File "/home/eisler/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/lifecycle/base.py", line 915, in call_all_lifecycle_hooks_sync
    getattr(adapter, hook_name)(**kwargs)
  File "/home/eisler/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/caching/adapter.py", line 1446, in post_node_execute
    self.result_store.set(
  File "/home/eisler/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/caching/stores/file.py", line 71, in set
    saver = saver_cls(path=str(materialized_path.absolute()))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: PolarsParquetWriter.__init__() got an unexpected keyword argument 'path'
-------------------------------------------------------------------

{
	"name": "TypeError",
	"message": "PolarsParquetWriter.__init__() got an unexpected keyword argument 'path'",
	"stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 dr.execute(final_vars=[\"hello\"])

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/driver.py:637, in Driver.execute(self, final_vars, overrides, display_graph, inputs)
    635     error_execution = e
    636     error_telemetry = telemetry.sanitize_error(*sys.exc_info())
--> 637     raise e
    638 finally:
    639     if self.adapter.does_hook(\"post_graph_execute\", is_async=False):

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/driver.py:623, in Driver.execute(self, final_vars, overrides, display_graph, inputs)
    614     self.adapter.call_all_lifecycle_hooks_sync(
    615         \"pre_graph_execute\",
    616         run_id=run_id,
   (...)
    620         overrides=overrides,
    621     )
    622 try:
--> 623     outputs = self.__raw_execute(
    624         _final_vars, overrides, display_graph, inputs=inputs, _run_id=run_id
    625     )
    626     if self.adapter.does_method(\"do_build_result\", is_async=False):
    627         # Build the result if we have a result builder
    628         outputs = self.adapter.call_lifecycle_method_sync(
    629             \"do_build_result\", outputs=outputs
    630         )

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/driver.py:834, in Driver.__raw_execute(self, final_vars, overrides, display_graph, inputs, _fn_graph, _run_id)
    832     return results
    833 except Exception as e:
--> 834     raise e

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/driver.py:825, in Driver.__raw_execute(self, final_vars, overrides, display_graph, inputs, _fn_graph, _run_id)
    823 results = None
    824 try:
--> 825     results = self.graph_executor.execute(
    826         function_graph,
    827         final_vars,
    828         overrides if overrides is not None else {},
    829         inputs if inputs is not None else {},
    830         run_id,
    831     )
    832     return results
    833 except Exception as e:

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/driver.py:175, in DefaultGraphExecutor.execute(self, fg, final_vars, overrides, inputs, run_id)
    173 memoized_computation = dict()  # memoized storage
    174 nodes = [fg.nodes[node_name] for node_name in final_vars if node_name in fg.nodes]
--> 175 fg.execute(nodes, memoized_computation, overrides, inputs, run_id=run_id)
    176 outputs = {
    177     # we do this here to enable inputs to also be used as outputs
    178     # putting inputs into memoized before execution doesn't work due to some graphadapter assumptions.
    179     final_var: memoized_computation.get(final_var, inputs.get(final_var))
    180     for final_var in final_vars
    181 }  # only want request variables in df.
    182 del memoized_computation  # trying to cleanup some memory

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/graph.py:1099, in FunctionGraph.execute(self, nodes, computed, overrides, inputs, run_id)
   1097     run_id = str(uuid.uuid4())
   1098 inputs = graph_functions.combine_config_and_inputs(self.config, inputs)
-> 1099 return graph_functions.execute_subdag(
   1100     nodes=nodes,
   1101     inputs=inputs,
   1102     adapter=self.adapter,
   1103     computed=computed,
   1104     overrides=overrides,
   1105     run_id=run_id,
   1106 )

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/execution/graph_functions.py:250, in execute_subdag(nodes, inputs, adapter, computed, overrides, run_id, task_id)
    247     if final_var_node.user_defined:
    248         # from the top level, we don't know if this UserInput is required. So mark as optional.
    249         dep_type = node.DependencyType.OPTIONAL
--> 250     dfs_traverse(final_var_node, dep_type)
    251 return computed

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/execution/graph_functions.py:222, in execute_subdag.<locals>.dfs_traverse(node_, dependency_type)
    215         result = adapter.call_lifecycle_method_sync(
    216             \"do_remote_execute\",
    217             node=node_,
    218             execute_lifecycle_for_node=execute_lifecycle_for_node_partial,
    219             **kwargs,
    220         )
    221     else:
--> 222         result = execute_lifecycle_for_node_partial(**kwargs)
    224 computed[node_.name] = result
    225 # > pruning the graph
    226 # This doesn't narrow it down to the entire space of the graph
    227 # E.G. if something is not needed by this current execution due to
    228 # the selection of nodes to run it might not prune everything.
    229 # to do this we'd need to first determine all nodes on the path, then prune
    230 # We may also want to use a reference counter for slightly cleaner/more efficient memory management

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/execution/graph_functions.py:318, in execute_lifecycle_for_node(__node_, __adapter, __run_id, __task_id, **__kwargs)
    314 if not pre_node_execute_errored and __adapter.does_hook(
    315     \"post_node_execute\", is_async=False
    316 ):
    317     try:
--> 318         __adapter.call_all_lifecycle_hooks_sync(
    319             \"post_node_execute\",
    320             run_id=__run_id,
    321             node_=__node_,
    322             kwargs=__kwargs,
    323             success=success,
    324             error=error,
    325             result=result,
    326             task_id=__task_id,
    327         )
    328     except Exception:
    329         message = create_error_message(__kwargs, __node_, \"[post-node-execute]\")

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/lifecycle/base.py:915, in LifecycleAdapterSet.call_all_lifecycle_hooks_sync(self, hook_name, **kwargs)
    909 \"\"\"Calls all the lifecycle hooks in this group, by hook name (stage)
    910 
    911 :param hook_name: Name of the hooks to call
    912 :param kwargs: Keyword arguments to pass into the hook
    913 \"\"\"
    914 for adapter in self.sync_hooks.get(hook_name, []):
--> 915     getattr(adapter, hook_name)(**kwargs)

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/caching/adapter.py:1446, in HamiltonCacheAdapter.post_node_execute(self, run_id, node_, result, success, error, task_id, **future_kwargs)
   1444 result_missing = not self.result_store.exists(data_version)
   1445 if result_missing or materialized_path_missing:
-> 1446     self.result_store.set(
   1447         data_version=data_version,
   1448         result=result,
   1449         saver_cls=saver_cls,
   1450         loader_cls=loader_cls,
   1451     )
   1452     self._log_event(
   1453         run_id=run_id,
   1454         node_name=node_name,
   (...)
   1458         value=data_version,
   1459     )

File ~/miniconda3/envs/test/lib/python3.11/site-packages/hamilton/caching/stores/file.py:71, in FileResultStore.set(self, data_version, result, saver_cls, loader_cls)
     68 if saver_cls is not None:
     69     # materialized_path
     70     materialized_path = self._materialized_path(data_version, saver_cls)
---> 71     saver = saver_cls(path=str(materialized_path.absolute()))
     72     loader = loader_cls(path=str(materialized_path.absolute()))
     73 else:

TypeError: PolarsParquetWriter.__init__() got an unexpected keyword argument 'path'"
}

Steps to replicate behavior

Write and run a jupyter notebook with the following cells:

%load_ext hamilton.plugins.jupyter_magic

%%cell_to_module -m test_module --display --rebuild-drivers

import polars as pl
from hamilton.function_modifiers import cache

@cache(format="parquet")
def hello() -> pl.DataFrame:
    return pl.DataFrame({"a": [1,2]})

from hamilton import driver
import test_module

dr = (
    driver
    .Builder()
    .with_config({})
    .with_modules(test_module)
    .with_cache(path=".")
    .build()
)

dr.execute(final_vars=["hello"])

Library & System Information

python=3.11.8, sf-hamilton=1.81.0 and 1.83.2, polars=1.10.0

Expected behavior

I expected the node output to be persisted to disk in a parquet format.

The text was updated successfully, but these errors were encountered:

skrawcz · 2024-11-24T00:44:06Z

@poldpold thanks for the issue! Will take a look.

The cache store assumed that every persister took a `path` argument. That is not the case because the savers / loaders wrap external APIs and we decided to not try to create our own abstraction layer around them, and instead mirror them. E.g. polars takes `file`, but pandas takes `path`.

skrawcz · 2024-11-24T01:04:46Z

@poldpold if you could try installing my fix and giving it a go please:
Assuming you use SSH: pip install "git+ssh://[email protected]/dagworks-inc/hamilton.git@fix_1240"

poldpold · 2024-11-24T21:14:46Z

@skrawcz, this works great, thank you! Looking forward to seeing it in an upcoming release!

skrawcz · 2024-11-25T05:40:26Z

@skrawcz, this works great, thank you! Looking forward to seeing it in an upcoming release!

it will be out this week.

poldpold added the triage label for issues that need to be triaged. label Nov 23, 2024

skrawcz linked a pull request Nov 24, 2024 that will close this issue

Fixes #1240 #1241

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching a polars dataframe into parquet fails #1240

Caching a polars dataframe into parquet fails #1240

poldpold commented Nov 23, 2024

skrawcz commented Nov 24, 2024

skrawcz commented Nov 24, 2024 •

edited

Loading

poldpold commented Nov 24, 2024

skrawcz commented Nov 25, 2024

Caching a polars dataframe into parquet fails #1240

Caching a polars dataframe into parquet fails #1240

Comments

poldpold commented Nov 23, 2024

Current behavior

Stack Traces

Steps to replicate behavior

Library & System Information

Expected behavior

skrawcz commented Nov 24, 2024

skrawcz commented Nov 24, 2024 • edited Loading

poldpold commented Nov 24, 2024

skrawcz commented Nov 25, 2024

skrawcz commented Nov 24, 2024 •

edited

Loading