-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add streaming concurrency tests #5437
Conversation
This PR is 99% new CPP tests, which pass on the pipeline. I'm not sure why there are so many other pipeline failures (first commit showed no such failures). The 1-line change in production code seems to be innocuous for python tests, but feel free to tell me otherwise. Although this helps with concurrency failures locally, we'll need to test this change further once checked in. We have a concurrency failure that we only see in our CI Linux tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your continued work on LightGBM, @svotaw !
we'll need to test this change further once checked in. We have a concurrency failure that we only see in our CI Linux tests
I'm not sure who "we" / "our" refers to in this statement or what you mean by "once checked in", so I want to be sure to set the right expectation...we will not merge pull requests that do not pass all CI jobs into master
in this repository.
This PR will not be merged until all of the existing CI jobs are passing.
The 1-line change ... seems to be innocuous for python tests, but feel free to tell me otherwise
When working on LightGBM, it's important to understand that the Python package's tests are the most comprehensive tests this project has for its C/C++ code. Those tests cover more of the project's C/C++ code than the dedicated tests in https://github.com/microsoft/LightGBM/tree/master/tests/cpp_tests.
The R package tests also cover a significant portion of the C/C++ code, and as of #5312 at least one code path not covered by the Python tests (LGBM_BoosterPredictForMatSingleRowFast()
and related code).
So when you push a change to C/C++ code in this project and see most of the R and Python tests failing, that is a strong indication that the change you've pushed is a breaking change to LightGBM.
Maybe you've uncovered a correctness bug while trying to fix a source of runtime exceptions (which would be great!) or maybe this change has a bigger impact on model training than just fixing a runtime exception related to concurrency issues. I'm not experienced enough with C++ or the implementation of SparseBin
to say for sure. @guolinke or @shiyu1994 should be able to help.
Please let me know if you need help with how to run only the failing Python test cases locally (so you can work on this with a faster feedback cycle than waiting for all of this project's CI jobs to run).
@jameslamb By "we", I mean our SynapseML wrapper. And yes, of course I would never expect anything to be checked in to LightGBM without passing all tests. :) I just thought I'd check and see if there were any current CI problems that might have caused those other things to fail. They were passing on an earlier iteration of this that included the only production code change that is left. But if there are no other issues, I will investigate further. Probably this change then. |
@svotaw I am quite confused about this PR, what does "concurrency" mean? |
You can also say multi-threading instead of concurrency. I mean the same thing. The multi-threaded tests in this PR were just added since someone asked for them in the previous PR. I said that I would add them, and this is the followup. And by the way, the external LGBM_DatasetPushRowsByCSR is NOT thread safe, which is why I modified LGBM_DatasetPushRowsByCSRWithMetadata to be able to handle being called by multiple threads. The old LGBM_DatasetPushRowsByCSR API will crash if you call it from multiple threads. We found that out empirically. Each external thread will generate the same set of OpenMP tid's, which means they will attempt to push to the same sparse buffers and eventually fail. LGBM_DatasetPushRowsByCSRWithMetadata is supposed to be completely thread safe (although see next). As far as changing that one production code line in SparseBin.hpp, that is an experiment. Even with the thread-safe design, we still see some crash failures in SparseBin.PushRow() in our own tests using the top of LightGBM main (we made a maven package of it for SynapseML). They have been hard to debug, because they aren't reproducible locally and have only occurred in our CI pipeline runs (making it hard to get crash dumps). We only know the method that fails. I was using this PR to help find the issue, and making that change seemed to make the tests pass reliably. But admittedly, I'm not convinced yet that it is needed, so will experiment some more before asking to check this in. The changes to tests (99% of this PR) should have no production impact, as long as they pass reliably and don't cause CI issues. They are just to have the extra multi-threaded coverage. Note that these tests only work for LGBM_DatasetPushRowsByCSRWithMetadata. LGBM_DatasetPushRowsByCSR would fail if tested the same way. Note that I manually made threads in the test since OpenMP does not seem to work for tests. I tried, but settled on manual thread creation. |
@guolinke here is the error we see with multi-threaded sparse data in our SynapseML tests: It might be something other than a threading issue but given the debugging I've done it seems highly likely. |
@shiyu1994 or @guolinke can you please re-review this, now that it's only testing changes and CI is passing?
I don't think that's true. The first commit in this PR, 62a97c2, which was passing all CI jobs except Then cc3f431 moved that code to a different location and introduced another change. From that commit until you removed the Just pointing it out in case it helps you in your investigation. |
Yes, apologies, was working on something else for last week. Yes, indeed it looks like you were right and the production code change was responsible for the failures. I have removed it and this PR can be reviewed as just test changes. The SparseBin:Push method still crashes on our Synapse CI linux machines when run multithreaded, and not sure why yet. But at least now we have tests here in this repo that prove it's not a general LightGBM problem and is something else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes!
In the future, please don't push PRs with whitespace-only changes like this
unless those changes are directly related to the goal of the PR.
We rely on the git blame view as a source of history for changes to files, and whitespace-only changes that are not related to functionality add friction to the use of that view.
I'm approving this one since these tests are new and under such active development, and to avoid delaying you further.
@jameslamb Ah, my apologies. The teams I've been on have always had the policy to do simple style fixes whenever they saw them, but I will avoid doing that here from now on. |
thanks very much! Yes, separate PRs are preferred for unrelated whitespace changes, typo/grammar fixes, etc. I personally have found that having all the changes in a PR's diff relate only to the stated purpose of the PR (in this example, "add more tests") makes PRs much easier to review. And I personally use this project's git blame regularly to trace back to the relevant PRs that changed a specific section of the code. |
@@ -320,7 +320,7 @@ TEST(Stream, PushSparseRowsWithMetadata) { | |||
TestUtils::CreateRandomSparseData(nrows, ncols, nclasses, sparse_percent, &indptr, &indices, &vals, &labels, &weights, &init_scores, &groups); | |||
|
|||
const std::vector<int32_t> batch_counts = { 1, nrows / 100, nrows / 10, nrows }; | |||
const std::vector<int8_t> creation_types = { 0, 1 }; | |||
const std::vector<int8_t> creation_types = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svotaw
Why there are only zeros here now and no any ones?
Previously creation_types = { 0, 1 };
made sense to me, but now I'm very confused by such test parametrization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS oops, yes need to revert this. It was a quick and dirty experiment just to force many repeats. Some of the early testing failures were only sporadic, so I added this to give me better confidence on no failures. I will fix.
I had planned to do a last pass over the PR to look for things like this, but it got checked in. Not used to someone else being in control of the checkin. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, thanks a lot for your awesome work on LightGBM!
Luckily, we need to revert just one line of the code. 🙂
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This PR adds concurrency testing to new streaming APIs added in PR #5299.
It also attempts to fix a concurrency issue in
SparseBin
PushRow API. WIthout this fix, concurrent calls often fail to theLGBM_DatasetPushRowsByCSRWithMetadata
API.