Add streaming concurrency tests #5437

svotaw · 2022-08-23T22:37:17Z

This PR adds concurrency testing to new streaming APIs added in PR #5299.

It also attempts to fix a concurrency issue in SparseBin PushRow API. WIthout this fix, concurrent calls often fail to the LGBM_DatasetPushRowsByCSRWithMetadata API.

svotaw · 2022-08-24T22:31:06Z

This PR is 99% new CPP tests, which pass on the pipeline. I'm not sure why there are so many other pipeline failures (first commit showed no such failures). The 1-line change in production code seems to be innocuous for python tests, but feel free to tell me otherwise.

Although this helps with concurrency failures locally, we'll need to test this change further once checked in. We have a concurrency failure that we only see in our CI Linux tests.

jameslamb

Thanks for your continued work on LightGBM, @svotaw !

we'll need to test this change further once checked in. We have a concurrency failure that we only see in our CI Linux tests

I'm not sure who "we" / "our" refers to in this statement or what you mean by "once checked in", so I want to be sure to set the right expectation...we will not merge pull requests that do not pass all CI jobs into master in this repository.

This PR will not be merged until all of the existing CI jobs are passing.

The 1-line change ... seems to be innocuous for python tests, but feel free to tell me otherwise

When working on LightGBM, it's important to understand that the Python package's tests are the most comprehensive tests this project has for its C/C++ code. Those tests cover more of the project's C/C++ code than the dedicated tests in https://github.com/microsoft/LightGBM/tree/master/tests/cpp_tests.

The R package tests also cover a significant portion of the C/C++ code, and as of #5312 at least one code path not covered by the Python tests (LGBM_BoosterPredictForMatSingleRowFast() and related code).

So when you push a change to C/C++ code in this project and see most of the R and Python tests failing, that is a strong indication that the change you've pushed is a breaking change to LightGBM.

Maybe you've uncovered a correctness bug while trying to fix a source of runtime exceptions (which would be great!) or maybe this change has a bigger impact on model training than just fixing a runtime exception related to concurrency issues. I'm not experienced enough with C++ or the implementation of SparseBin to say for sure. @guolinke or @shiyu1994 should be able to help.

Please let me know if you need help with how to run only the failing Python test cases locally (so you can work on this with a faster feedback cycle than waiting for all of this project's CI jobs to run).

src/io/sparse_bin.hpp

svotaw · 2022-08-25T21:30:21Z

I'm not sure who "we" / "our" refers to in this statement or what you mean by "once checked in", so I want to be sure to set the right expectation...we will not merge pull requests that do not pass all CI jobs into master in this repository.

@jameslamb By "we", I mean our SynapseML wrapper. And yes, of course I would never expect anything to be checked in to LightGBM without passing all tests. :) I just thought I'd check and see if there were any current CI problems that might have caused those other things to fail. They were passing on an earlier iteration of this that included the only production code change that is left. But if there are no other issues, I will investigate further. Probably this change then.

guolinke · 2022-08-25T23:01:48Z

@svotaw I am quite confused about this PR, what does "concurrency" mean?
The PushRow is already designed for multi-threading, and LightGBM also uses a multi-threading solution to construct Dataset.

svotaw · 2022-08-26T04:39:57Z

@svotaw I am quite confused about this PR, what does "concurrency" mean? The PushRow is already designed for multi-threading, and LightGBM also uses a multi-threading solution to construct Dataset.

You can also say multi-threading instead of concurrency. I mean the same thing.

The multi-threaded tests in this PR were just added since someone asked for them in the previous PR. I said that I would add them, and this is the followup.

And by the way, the external LGBM_DatasetPushRowsByCSR is NOT thread safe, which is why I modified LGBM_DatasetPushRowsByCSRWithMetadata to be able to handle being called by multiple threads. The old LGBM_DatasetPushRowsByCSR API will crash if you call it from multiple threads. We found that out empirically. Each external thread will generate the same set of OpenMP tid's, which means they will attempt to push to the same sparse buffers and eventually fail. LGBM_DatasetPushRowsByCSRWithMetadata is supposed to be completely thread safe (although see next).

As far as changing that one production code line in SparseBin.hpp, that is an experiment. Even with the thread-safe design, we still see some crash failures in SparseBin.PushRow() in our own tests using the top of LightGBM main (we made a maven package of it for SynapseML). They have been hard to debug, because they aren't reproducible locally and have only occurred in our CI pipeline runs (making it hard to get crash dumps). We only know the method that fails. I was using this PR to help find the issue, and making that change seemed to make the tests pass reliably. But admittedly, I'm not convinced yet that it is needed, so will experiment some more before asking to check this in.

The changes to tests (99% of this PR) should have no production impact, as long as they pass reliably and don't cause CI issues. They are just to have the extra multi-threaded coverage. Note that these tests only work for LGBM_DatasetPushRowsByCSRWithMetadata. LGBM_DatasetPushRowsByCSR would fail if tested the same way.

Note that I manually made threads in the test since OpenMP does not seem to work for tests. I tried, but settled on manual thread creation.

svotaw · 2022-08-26T04:51:56Z

@guolinke here is the error we see with multi-threaded sparse data in our SynapseML tests:
JRE version: OpenJDK Runtime Environment (8.0_345-b01) (build 1.8.0_345-b01)
Java VM: OpenJDK 64-Bit Server VM (25.345-b01 mixed mode linux-amd64 compressed oops)
Problematic frame:
C [lib_lightgbm.so+0x18a72b] LightGBM::SparseBin::Push(int, int, unsigned int)+0x3b

It might be something other than a threading issue but given the debugging I've done it seems highly likely.

jameslamb · 2022-08-31T22:09:10Z

@shiyu1994 or @guolinke can you please re-review this, now that it's only testing changes and CI is passing?

passing on an earlier iteration of this that included the only code change that is left

I don't think that's true.

The first commit in this PR, 62a97c2, which was passing all CI jobs except lint, contained only this change + testing changes:

Then cc3f431 moved that code to a different location and introduced another change.

From that commit until you removed the SparseBin changes in e796f5e, most of the Python and R jobs were failing.

Just pointing it out in case it helps you in your investigation.

svotaw · 2022-08-31T22:30:32Z

Yes, apologies, was working on something else for last week.

Yes, indeed it looks like you were right and the production code change was responsible for the failures. I have removed it and this PR can be reviewed as just test changes.

The SparseBin:Push method still crashes on our Synapse CI linux machines when run multithreaded, and not sure why yet. But at least now we have tests here in this repo that prove it's not a general LightGBM problem and is something else.

jameslamb

Thanks for the changes!

In the future, please don't push PRs with whitespace-only changes like this

unless those changes are directly related to the goal of the PR.

We rely on the git blame view as a source of history for changes to files, and whitespace-only changes that are not related to functionality add friction to the use of that view.

I'm approving this one since these tests are new and under such active development, and to avoid delaying you further.

svotaw · 2022-09-02T20:33:29Z

@jameslamb Ah, my apologies. The teams I've been on have always had the policy to do simple style fixes whenever they saw them, but I will avoid doing that here from now on.
Do you just make separate PRs for this kind of change?

jameslamb · 2022-09-02T20:37:00Z

Do you just make separate PRs for this kind of change?

thanks very much! Yes, separate PRs are preferred for unrelated whitespace changes, typo/grammar fixes, etc. I personally have found that having all the changes in a PR's diff relate only to the stated purpose of the PR (in this example, "add more tests") makes PRs much easier to review.

And I personally use this project's git blame regularly to trace back to the relevant PRs that changed a specific section of the code.

StrikerRUS · 2022-09-04T17:22:37Z

tests/cpp_tests/test_stream.cpp

@@ -320,7 +320,7 @@ TEST(Stream, PushSparseRowsWithMetadata) {
  TestUtils::CreateRandomSparseData(nrows, ncols, nclasses, sparse_percent, &indptr, &indices, &vals, &labels, &weights, &init_scores, &groups);

  const std::vector<int32_t> batch_counts = { 1, nrows / 100, nrows / 10, nrows };
-  const std::vector<int8_t> creation_types = { 0, 1 };
+  const std::vector<int8_t> creation_types = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };


@svotaw
Why there are only zeros here now and no any ones?
Previously creation_types = { 0, 1 }; made sense to me, but now I'm very confused by such test parametrization.

@StrikerRUS oops, yes need to revert this. It was a quick and dirty experiment just to force many repeats. Some of the early testing failures were only sporadic, so I added this to give me better confidence on no failures. I will fix.

I had planned to do a last pass over the PR to look for things like this, but it got checked in. Not used to someone else being in control of the checkin. :)

No problem, thanks a lot for your awesome work on LightGBM!

Luckily, we need to revert just one line of the code. 🙂

github-actions · 2023-08-19T03:23:04Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

Add streaming concurrency tests

62a97c2

svotaw requested review from guolinke and shiyu1994 as code owners August 23, 2022 22:37

svotaw added 3 commits August 23, 2022 16:52

linting fixes

892f4cc

testing fix

cc3f431

added some comments

76a917f

svotaw marked this pull request as draft August 24, 2022 20:49

removing thread vector dynamic resizing

43555dd

svotaw marked this pull request as ready for review August 24, 2022 22:24

jameslamb self-requested a review August 25, 2022 04:44

jameslamb requested changes Aug 25, 2022

View reviewed changes

jameslamb mentioned this pull request Aug 25, 2022

[R-package] [ci] unit test failures don't cause MSVC CI jobs to fail #5439

Closed

jameslamb added the in progress label Aug 25, 2022

guolinke reviewed Aug 25, 2022

View reviewed changes

src/io/sparse_bin.hpp Outdated Show resolved Hide resolved

reverting SparseBin change

e796f5e

StrikerRUS added the maintenance label Aug 28, 2022

jameslamb added awaiting review and removed in progress labels Aug 31, 2022

guolinke approved these changes Sep 2, 2022

View reviewed changes

svotaw requested review from jameslamb and removed request for shiyu1994 September 2, 2022 19:56

jameslamb approved these changes Sep 2, 2022

View reviewed changes

jameslamb removed the awaiting review label Sep 3, 2022

jameslamb changed the title ~~fix: Add streaming concurrency tests~~ Add streaming concurrency tests Sep 3, 2022

jameslamb merged commit d0ea321 into microsoft:master Sep 3, 2022

StrikerRUS reviewed Sep 4, 2022

View reviewed changes

svotaw deleted the streaming-tests branch September 4, 2022 20:22

jameslamb mentioned this pull request Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add streaming concurrency tests #5437

Add streaming concurrency tests #5437

svotaw commented Aug 23, 2022 •

edited

Loading

svotaw commented Aug 24, 2022

jameslamb left a comment

svotaw commented Aug 25, 2022

guolinke commented Aug 25, 2022

svotaw commented Aug 26, 2022

svotaw commented Aug 26, 2022 •

edited

Loading

jameslamb commented Aug 31, 2022

svotaw commented Aug 31, 2022

jameslamb left a comment

svotaw commented Sep 2, 2022

jameslamb commented Sep 2, 2022

StrikerRUS Sep 4, 2022

svotaw Sep 4, 2022

StrikerRUS Sep 4, 2022

github-actions bot commented Aug 19, 2023

Add streaming concurrency tests #5437

Add streaming concurrency tests #5437

Conversation

svotaw commented Aug 23, 2022 • edited Loading

svotaw commented Aug 24, 2022

jameslamb left a comment

Choose a reason for hiding this comment

svotaw commented Aug 25, 2022

guolinke commented Aug 25, 2022

svotaw commented Aug 26, 2022

svotaw commented Aug 26, 2022 • edited Loading

jameslamb commented Aug 31, 2022

svotaw commented Aug 31, 2022

jameslamb left a comment

Choose a reason for hiding this comment

svotaw commented Sep 2, 2022

jameslamb commented Sep 2, 2022

StrikerRUS Sep 4, 2022

Choose a reason for hiding this comment

svotaw Sep 4, 2022

Choose a reason for hiding this comment

StrikerRUS Sep 4, 2022

Choose a reason for hiding this comment

github-actions bot commented Aug 19, 2023

svotaw commented Aug 23, 2022 •

edited

Loading

svotaw commented Aug 26, 2022 •

edited

Loading