Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix interactions in RDF machinery with the DefinePerSample operation #13787

Merged

Conversation

vepadulano
Copy link
Member

@vepadulano vepadulano commented Oct 3, 2023

This PR fixes #12043 . It should also address sporadic failures seen in our jenkins CI due to sometimes Dask assigning two tasks to the same worker process, thus the second task using the same DefinePerSample node of the first task and collapsing into the same situation as the linked issue.

The first commit contains the actual fix, then tests.

Sample callbacks can be registered by an RAction or an RDefinePerSample
instance. In both cases, the lifetime of the callback is tied to the lifetime of
the object itself. Avoid eager clearing of the callbacks so to not interfer with
the normal functioning.
@vepadulano vepadulano requested a review from eguiraud as a code owner October 3, 2023 12:12
@vepadulano vepadulano self-assigned this Oct 3, 2023
@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default
How to customize builds

@vepadulano
Copy link
Member Author

vepadulano commented Oct 3, 2023

As a completely anecdotical evidence, I have been running on the root-ubuntu-2004-1 machine the distributed RDF test which used to fail, which included calls to DefinePerSample. After applying this patch, currently ~1800 iterations of the test passed without failures

test_all.py::TestPropagateExceptions::test_runtime_error_is_propagated <- check_backend.py PASSED
test_all.py::TestDefinePerSample::test_definepersample_simple <- check_definepersample.py PASSED
test_all.py::TestDefinePerSample::test_definepersample_withinitialization <- check_definepersample.py PASSED
=============================== warnings summary ===============================
test_all.py::TestPropagateExceptions::test_runtime_error_is_propagated
  /home/sftnight/vpadulan/rootproject/rootbuild/master-like-jenkins/lib/ROOT/_facade.py:154: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    return _orig_ihook(name, *args, **kwds)

test_all.py::TestPropagateExceptions::test_runtime_error_is_propagated
  /usr/local/lib/python3.8/dist-packages/dask_jobqueue/core.py:20: FutureWarning: tmpfile is deprecated and will be removed in a future release. Please use dask.utils.tmpfile instead.
    from distributed.utils import tmpfile

-- Docs: https://docs.pytest.org/en/latest/warnings.html
==================== 3 passed, 2 warnings in 32.72 seconds =====================

Running test 1864

@phsft-bot
Copy link
Collaborator

Build failed on ROOT-ubuntu2204/nortcxxmod.
Running on root-ubuntu-2204-1.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

Warnings:

  • [2023-10-03T12:24:07.908Z] /home/sftnight/build/workspace/root-pullrequests-build/root/tree/dataframe/test/dataframe_definepersample.cxx:195:29: warning: comparison of integer expressions of different signedness: ‘long int’ and ‘std::vector<std::__cxx11::basic_string<char> >::size_type’ {aka ‘long unsigned int’} [-Wsign-compare]

@phsft-bot
Copy link
Collaborator

Build failed on windows10/default.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Errors:

  • [2023-10-03T12:48:49.974Z] C:\build\workspace\root-pullrequests-build\root\core\imt\src\TTaskGroup.cxx(18,10): fatal error C1083: Cannot open include file: 'tbb/task_group.h': No such file or directory [C:\build\workspace\root-pullrequests-build\build\core\imt\Imt.vcxproj]
  • [2023-10-03T12:48:50.252Z] C:\build\workspace\root-pullrequests-build\root\core\imt\src\ROpaqueTaskArena.hxx(1,10): fatal error C1083: Cannot open include file: 'tbb/task_arena.h': No such file or directory [C:\build\workspace\root-pullrequests-build\build\core\imt\Imt.vcxproj]
  • [2023-10-03T12:48:51.398Z] C:\build\workspace\root-pullrequests-build\root\core\imt\src\ROpaqueTaskArena.hxx(1,10): fatal error C1083: Cannot open include file: 'tbb/task_arena.h': No such file or directory [C:\build\workspace\root-pullrequests-build\build\core\imt\Imt.vcxproj]

@vepadulano vepadulano force-pushed the rdf-definepersample-fix-callbacks branch from 20ddc0b to 4314008 Compare October 3, 2023 13:16
@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default
How to customize builds

Copy link
Member

@dpiparo dpiparo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this change, which is minimal but results from a series of long debugging sessions! I added 2 minimal comments. Besides those, LGTM.

tree/dataframe/test/dataframe_cloning.cxx Outdated Show resolved Hide resolved
tree/dataframe/test/dataframe_definepersample.cxx Outdated Show resolved Hide resolved
@vepadulano vepadulano force-pushed the rdf-definepersample-fix-callbacks branch from 4314008 to 5f33910 Compare October 3, 2023 14:01
@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default
How to customize builds

Copy link
Member

@eguiraud eguiraud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vepadulano !

My only comment is that the tests seem to be too complicated compared to the root cause of the failure. The regression test for #12043, for example, could be simply:

bool flag = false;
auto df = ROOT::RDataFrame(1)
    .DefinePerSample("x", [&](unsigned int, const ROOT::RDF::RSampleInfo &) { flag = true; return 0; });
df.Count().GetValue();
EXPECT_TRUE(flag);
flag = false;
df.Count().GetValue();
EXPECT_TRUE(flag);

Simpler tests that go straight to the point are easier to debug when they break.

@vepadulano
Copy link
Member Author

Simpler tests that go straight to the point are easier to debug when they break.

That's a good point yes. I will modify the reproducer for the linked issue. I prefer to keep the reproducer of the cloning issue because it's also mimicking the extra machinery involved in creating different tasks, changing the RDF spec and cloning the actions in a specific way

This is a reproducer test for some sporadic CI failures, e.g.

```python
========================================================================== FAILURES ===========================================================================
_______________________________________________________ TestDefinePerSample.test_definepersample_simple _______________________________________________________

self = <check_definepersample.TestDefinePerSample object at 0x13e0c6190>, connection = <Client: 'tcp://127.0.0.1:55253' processes=2 threads=2, memory=4.00 GiB>

    def test_definepersample_simple(self, connection):
        """
        Test DefinePerSample operation on three samples using a predefined
        string of operations.
        """

        df = Dask.RDataFrame(self.maintreename, self.filenames, daskclient=connection)

        # Associate a number to each sample
        definepersample_code = """
        if(rdfsampleinfo_.Contains(\"{}\")) return 1;
        else if (rdfsampleinfo_.Contains(\"{}\")) return 2;
        else if (rdfsampleinfo_.Contains(\"{}\")) return 3;
        else return 0;
        """.format(*self.samples)

        df1 = df.DefinePerSample("sampleid", definepersample_code)

        # Filter by the sample number. Each filtered dataframe should contain
        # 10 entries, equal to the number of entries per sample
        samplescounts = [df1.Filter("sampleid == {}".format(id)).Count() for id in [1, 2, 3]]

        for count in samplescounts:
>           assert count.GetValue() == 10
E           AssertionError

check_definepersample.py:62: AssertionError
-------------------------------------------------------------------- Captured stderr setup --------------------------------------------------------------------
RDataFrame::Run: event loop was interrupted
2023-09-08 14:51:57,002 - distributed.worker - WARNING - Compute Failed
Key:       dask_mapper-a92ac090-9407-4849-921a-d187ceffd3ed
Function:  dask_mapper
args:      (EmptySourceRange(exec_id=ExecutionIdentifier(rdf_uuid=UUID('5d67c0a7-58f4-488d-8e44-bb5aa0fac480'), graph_uuid=UUID('69353465-0a90-4eef-b101-a1eb93f0c13a')), id=0, start=0, end=50))
kwargs:    {}
Exception: "RuntimeError('C++ exception thrown:\\n\\truntime_error: Graph was applied to a mix of scalar values and collections. This is not supported.')"
```

Which is due to Dask assigning two tasks to the same worker for the test with
the DefinePeSample calls. The Count operation would fail to report the correct
amount of entries due to the fact that the DefinePerSample callback was
previously deleted at the end of every event loop, specifically at the end of
the first task's event loop. Consequently, when the second task starts and it
picks up the same RDataFrame to clone the action, the DefinePerSample would
never be actually called.
Copy link
Member

@dpiparo dpiparo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Vincenzo! For me it's ready to be merged.

@vepadulano vepadulano force-pushed the rdf-definepersample-fix-callbacks branch from 5f33910 to 1748a80 Compare October 3, 2023 15:15
@phsft-bot
Copy link
Collaborator

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac11/noimt, mac12arm/cxx20, windows10/default
How to customize builds

@vepadulano vepadulano merged commit 07872d9 into root-project:master Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong interaction of DefinePerSample with multiple executions
4 participants