-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong interaction of DefinePerSample with multiple executions #12043
Comments
A surprisingly related problem appeared sporadically in Jenkins CI builds after the patch to avoid re-jitting distributed RDataFrame tasks. See for example https://lcgapp-services.cern.ch/root-jenkins/job/root-pullrequests-build/186294/testReport/projectroot.roottest.python.distrdf/dask/roottest_python_distrdf_dask_test_all/ . Here is a copy-paste of the failure for when the CI log will be deleted **=================================== FAILURES ===================================
_______________ TestDefinePerSample.test_definepersample_simple ________________
self = <check_definepersample.TestDefinePerSample object at 0x139017700>
connection = <Client: 'tcp://127.0.0.1:58532' processes=2 threads=2, memory=4.00 GiB>
def test_definepersample_simple(self, connection):
"""
Test DefinePerSample operation on three samples using a predefined
string of operations.
"""
df = Dask.RDataFrame(self.maintreename, self.filenames, daskclient=connection)
# Associate a number to each sample
definepersample_code = """
if(rdfsampleinfo_.Contains(\"{}\")) return 1;
else if (rdfsampleinfo_.Contains(\"{}\")) return 2;
else if (rdfsampleinfo_.Contains(\"{}\")) return 3;
else return 0;
""".format(*self.samples)
df1 = df.DefinePerSample("sampleid", definepersample_code)
# Filter by the sample number. Each filtered dataframe should contain
# 10 entries, equal to the number of entries per sample
samplescounts = [df1.Filter("sampleid == {}".format(id)).Count() for id in [1, 2, 3]]
for count in samplescounts:
> assert count.GetValue() == 10
E AssertionError
../../../../../roottest/python/distrdf/dask/check_definepersample.py:62: AssertionError
---------------------------- Captured stderr setup -----------------------------
RDataFrame::Run: event loop was interrupted
2023-09-30 20:12:08,054 - distributed.worker - WARNING - Compute Failed
Key: dask_mapper-2d1d1d8c-3a72-43e4-9753-d94b58f79b62
Function: execute_task
args: ((<function DaskBackend.dask_mapper at 0x13277bb80>, EmptySourceRange(exec_id=ExecutionIdentifier(rdf_uuid=UUID('3fb6f445-a73d-47db-9f12-af184ca535cd'), graph_uuid=UUID('3edfdf66-5f8c-428b-8862-6e21ac68f9b5')), id=0, start=0, end=50), (<class 'set'>, []), (<class 'set'>, []), functools.partial(<function distrdf_mapper at 0x12a9c4160>, build_rdf_from_range=<function EmptySourceHeadNode._generate_rdf_creator.<locals>.build_rdf_from_range at 0x134902e50>, computation_graph_callable=functools.partial(<function trigger_computation_graph at 0x12a9b6a60>, {0: <DistRDF.HeadNode.EmptySourceHeadNode object at 0x13490b1c0>, 1: <DistRDF.Node.Node object at 0x13490b220>, 2: <DistRDF.Node.Node object at 0x13490b340>, 3: <DistRDF.Node.Node object at 0x13490b430>}), initialization_fn=functools.partial(<function TestInitialization.test_initialization_method.<locals>.init at 0x134902ee0>, 123))))
kwargs: {}
Exception: "RuntimeError('C++ exception thrown:\\n\\truntime_error: Graph was applied to a mix of scalar values and collections. This is not supported.') That type of failures in fact boils down to this same issue. It can happen that two tasks get assigned to the same Dask worker process in the CI. The first task runs normally the DefinePerSample operation (after having jitted the computation graph). The second task clones the actions of the first in order to re-use the already jitted nodes. This also means that the root/tree/dataframe/src/RLoopManager.cxx Line 764 in fa46203
|
From this simple reproducer:
The
DefinePerSample
operation defines a column of 20 entries, 10 should have value2
, the following 10 should have value0.5
. The first set ofSum
andDisplay
operations show the correct behaviour, then the second set of operations reports a wrong result. All the 20 entries of the column are0.5
:The text was updated successfully, but these errors were encountered: