Fix interactions in RDF machinery with the DefinePerSample operation #13787

vepadulano · 2023-10-03T12:12:25Z

This PR fixes #12043 . It should also address sporadic failures seen in our jenkins CI due to sometimes Dask assigning two tasks to the same worker process, thus the second task using the same DefinePerSample node of the first task and collapsing into the same situation as the linked issue.

The first commit contains the actual fix, then tests.

Sample callbacks can be registered by an RAction or an RDefinePerSample instance. In both cases, the lifetime of the callback is tied to the lifetime of the object itself. Avoid eager clearing of the callbacks so to not interfer with the normal functioning.

phsft-bot · 2023-10-03T12:12:34Z

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default
How to customize builds

vepadulano · 2023-10-03T12:20:45Z

As a completely anecdotical evidence, I have been running on the root-ubuntu-2004-1 machine the distributed RDF test which used to fail, which included calls to DefinePerSample. After applying this patch, currently ~1800 iterations of the test passed without failures

test_all.py::TestPropagateExceptions::test_runtime_error_is_propagated <- check_backend.py PASSED
test_all.py::TestDefinePerSample::test_definepersample_simple <- check_definepersample.py PASSED
test_all.py::TestDefinePerSample::test_definepersample_withinitialization <- check_definepersample.py PASSED
=============================== warnings summary ===============================
test_all.py::TestPropagateExceptions::test_runtime_error_is_propagated
  /home/sftnight/vpadulan/rootproject/rootbuild/master-like-jenkins/lib/ROOT/_facade.py:154: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    return _orig_ihook(name, *args, **kwds)

test_all.py::TestPropagateExceptions::test_runtime_error_is_propagated
  /usr/local/lib/python3.8/dist-packages/dask_jobqueue/core.py:20: FutureWarning: tmpfile is deprecated and will be removed in a future release. Please use dask.utils.tmpfile instead.
    from distributed.utils import tmpfile

-- Docs: https://docs.pytest.org/en/latest/warnings.html
==================== 3 passed, 2 warnings in 32.72 seconds =====================

Running test 1864

phsft-bot · 2023-10-03T12:40:50Z

Build failed on ROOT-ubuntu2204/nortcxxmod.
Running on root-ubuntu-2204-1.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

Warnings:

[2023-10-03T12:24:07.908Z] /home/sftnight/build/workspace/root-pullrequests-build/root/tree/dataframe/test/dataframe_definepersample.cxx:195:29: warning: comparison of integer expressions of different signedness: ‘long int’ and ‘std::vector<std::__cxx11::basic_string<char> >::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]

phsft-bot · 2023-10-03T12:51:23Z

Build failed on windows10/default.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Errors:

[2023-10-03T12:48:49.974Z] C:\build\workspace\root-pullrequests-build\root\core\imt\src\TTaskGroup.cxx(18,10): fatal error C1083: Cannot open include file: 'tbb/task_group.h': No such file or directory [C:\build\workspace\root-pullrequests-build\build\core\imt\Imt.vcxproj]
[2023-10-03T12:48:50.252Z] C:\build\workspace\root-pullrequests-build\root\core\imt\src\ROpaqueTaskArena.hxx(1,10): fatal error C1083: Cannot open include file: 'tbb/task_arena.h': No such file or directory [C:\build\workspace\root-pullrequests-build\build\core\imt\Imt.vcxproj]
[2023-10-03T12:48:51.398Z] C:\build\workspace\root-pullrequests-build\root\core\imt\src\ROpaqueTaskArena.hxx(1,10): fatal error C1083: Cannot open include file: 'tbb/task_arena.h': No such file or directory [C:\build\workspace\root-pullrequests-build\build\core\imt\Imt.vcxproj]

phsft-bot · 2023-10-03T13:16:18Z

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default
How to customize builds

dpiparo

Thanks a lot for this change, which is minimal but results from a series of long debugging sessions! I added 2 minimal comments. Besides those, LGTM.

tree/dataframe/test/dataframe_cloning.cxx

tree/dataframe/test/dataframe_definepersample.cxx

phsft-bot · 2023-10-03T14:01:55Z

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default
How to customize builds

eguiraud

Thanks @vepadulano !

My only comment is that the tests seem to be too complicated compared to the root cause of the failure. The regression test for #12043, for example, could be simply:

bool flag = false;
auto df = ROOT::RDataFrame(1)
    .DefinePerSample("x", [&](unsigned int, const ROOT::RDF::RSampleInfo &) { flag = true; return 0; });
df.Count().GetValue();
EXPECT_TRUE(flag);
flag = false;
df.Count().GetValue();
EXPECT_TRUE(flag);

Simpler tests that go straight to the point are easier to debug when they break.

vepadulano · 2023-10-03T14:50:29Z

Simpler tests that go straight to the point are easier to debug when they break.

That's a good point yes. I will modify the reproducer for the linked issue. I prefer to keep the reproducer of the cloning issue because it's also mimicking the extra machinery involved in creating different tasks, changing the RDF spec and cloning the actions in a specific way

This is a reproducer test for some sporadic CI failures, e.g. ```python ========================================================================== FAILURES =========================================================================== _______________________________________________________ TestDefinePerSample.test_definepersample_simple _______________________________________________________ self = <check_definepersample.TestDefinePerSample object at 0x13e0c6190>, connection = <Client: 'tcp://127.0.0.1:55253' processes=2 threads=2, memory=4.00 GiB> def test_definepersample_simple(self, connection): """ Test DefinePerSample operation on three samples using a predefined string of operations. """ df = Dask.RDataFrame(self.maintreename, self.filenames, daskclient=connection) # Associate a number to each sample definepersample_code = """ if(rdfsampleinfo_.Contains(\"{}\")) return 1; else if (rdfsampleinfo_.Contains(\"{}\")) return 2; else if (rdfsampleinfo_.Contains(\"{}\")) return 3; else return 0; """.format(*self.samples) df1 = df.DefinePerSample("sampleid", definepersample_code) # Filter by the sample number. Each filtered dataframe should contain # 10 entries, equal to the number of entries per sample samplescounts = [df1.Filter("sampleid == {}".format(id)).Count() for id in [1, 2, 3]] for count in samplescounts: > assert count.GetValue() == 10 E AssertionError check_definepersample.py:62: AssertionError -------------------------------------------------------------------- Captured stderr setup -------------------------------------------------------------------- RDataFrame::Run: event loop was interrupted 2023-09-08 14:51:57,002 - distributed.worker - WARNING - Compute Failed Key: dask_mapper-a92ac090-9407-4849-921a-d187ceffd3ed Function: dask_mapper args: (EmptySourceRange(exec_id=ExecutionIdentifier(rdf_uuid=UUID('5d67c0a7-58f4-488d-8e44-bb5aa0fac480'), graph_uuid=UUID('69353465-0a90-4eef-b101-a1eb93f0c13a')), id=0, start=0, end=50)) kwargs: {} Exception: "RuntimeError('C++ exception thrown:\\n\\truntime_error: Graph was applied to a mix of scalar values and collections. This is not supported.')" ``` Which is due to Dask assigning two tasks to the same worker for the test with the DefinePeSample calls. The Count operation would fail to report the correct amount of entries due to the fact that the DefinePerSample callback was previously deleted at the end of every event loop, specifically at the end of the first task's event loop. Consequently, when the second task starts and it picks up the same RDataFrame to clone the action, the DefinePerSample would never be actually called.

dpiparo

Thanks a lot Vincenzo! For me it's ready to be merged.

phsft-bot · 2023-10-03T15:15:56Z

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default
How to customize builds

phsft-bot · 2023-10-03T17:31:48Z

Build failed on windows10/default.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Failing tests:

vepadulano requested review from dpiparo and martamaja10 October 3, 2023 12:12

vepadulano requested a review from eguiraud as a code owner October 3, 2023 12:12

vepadulano self-assigned this Oct 3, 2023

vepadulano force-pushed the rdf-definepersample-fix-callbacks branch from 20ddc0b to 4314008 Compare October 3, 2023 13:16

dpiparo approved these changes Oct 3, 2023

View reviewed changes

tree/dataframe/test/dataframe_cloning.cxx Outdated Show resolved Hide resolved

tree/dataframe/test/dataframe_definepersample.cxx Outdated Show resolved Hide resolved

vepadulano force-pushed the rdf-definepersample-fix-callbacks branch from 4314008 to 5f33910 Compare October 3, 2023 14:01

eguiraud approved these changes Oct 3, 2023

View reviewed changes

vepadulano added 2 commits October 3, 2023 16:51

[df][NFC] Adapt test to using RAII for file creation

369f9c7

dpiparo approved these changes Oct 3, 2023

View reviewed changes

[df] Add regression test for root-project#12043

1748a80

vepadulano force-pushed the rdf-definepersample-fix-callbacks branch from 5f33910 to 1748a80 Compare October 3, 2023 15:15

vepadulano merged commit 07872d9 into root-project:master Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix interactions in RDF machinery with the DefinePerSample operation #13787

Fix interactions in RDF machinery with the DefinePerSample operation #13787

vepadulano commented Oct 3, 2023 •

edited

Loading

phsft-bot commented Oct 3, 2023

vepadulano commented Oct 3, 2023 •

edited

Loading

phsft-bot commented Oct 3, 2023

phsft-bot commented Oct 3, 2023

phsft-bot commented Oct 3, 2023

dpiparo left a comment

phsft-bot commented Oct 3, 2023

eguiraud left a comment

vepadulano commented Oct 3, 2023

dpiparo left a comment

phsft-bot commented Oct 3, 2023

phsft-bot commented Oct 3, 2023

Fix interactions in RDF machinery with the DefinePerSample operation #13787

Fix interactions in RDF machinery with the DefinePerSample operation #13787

Conversation

vepadulano commented Oct 3, 2023 • edited Loading

phsft-bot commented Oct 3, 2023

vepadulano commented Oct 3, 2023 • edited Loading

phsft-bot commented Oct 3, 2023

Warnings:

phsft-bot commented Oct 3, 2023

Errors:

phsft-bot commented Oct 3, 2023

dpiparo left a comment

Choose a reason for hiding this comment

phsft-bot commented Oct 3, 2023

eguiraud left a comment

Choose a reason for hiding this comment

vepadulano commented Oct 3, 2023

dpiparo left a comment

Choose a reason for hiding this comment

phsft-bot commented Oct 3, 2023

phsft-bot commented Oct 3, 2023

Failing tests:

vepadulano commented Oct 3, 2023 •

edited

Loading

vepadulano commented Oct 3, 2023 •

edited

Loading