-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARN JsonImpl: Error while sending REST request. Temporary disabling JSON requests #563
Comments
@edsonaoki this debug line here indicates you are running everything on just one node:
You should see multiple comma-separated nodes here. Maybe we could debug over skype (or some other conference call system where you can share desktop)? I'm sure it's a bug that we should fix, but with different cluster configurations it's difficult to reproduce some issues.
I also see this in the logs:
However you requested a max of 3 executors. I wonder why you are only getting one. |
Hi @imatiach-msft, the reason that it only gets one executor is that the reduce operation has only 1 partition. However, when the training starts, it uses the prescribed number of partitions (800). Is there some step at LightGBM that involves shuffling / coalescing / repartitioning / collecting, that explains the number of partitions changing from 800 to 1? Meanwhile I will try to extract some additional logs / screenshots. Unfortunately I cannot show the full logs as they contain confidential information. |
btw my data is extremely unbalanced (say 1 positive label for hundreds of negative labels), does this have any impact on how LightGBM generates the partitions? |
Another piece of information: if I do a repartition on the training dataset immediately before calling the fit function, LightGBM will use more than 1 partition (but still much less than the repartition number). |
@edsonaoki yes, there is a known issue where lightgbm gets stuck in training if some partition has only a single label: |
@edsonaoki ah, I should have noticed this earlier, I know this issue, you are using limit function from above - I just saw this a couple weeks ago from another dev: "yeah... so the limit seems to pull everything into a single partition" |
@edsonaoki it sounds like the real problem we need to resolve is the null pointer exception you are getting |
@imatiach-msft you are right, I will take out the limit and do the label redistribution you suggested. However, note that even if I repartition just before the training, the number of partitions used by LightGBM is much smaller than the number of partitions of the dataframe. Is there are any explanation for that? |
@edsonaoki yes, that is by design, since lightgbm has to run all workers at the same time: |
That was useful, I will try again, thanks! |
@imatiach-msft I removed the "limit" and split the dataset into 240 partitions (say partition set A), making sure all partitions have the same fraction (about 1/600) of positive and negative labels. When LightGBM starts the training, it coalesces to 224 partitions (say partition set B). Assuming it's an actual coalescence and not a repartition, each new partition of partition set B should contain 1+ partitions of partition set A, and thus positive and negative labels in all partitions. Could you confirm this understanding? Unfortunately, the executors still gets "frozen" after the following debug messages:
From the driver's side, nothing interesting either:
Configuration is as folllows:
You mentioned that LightGBM would create 1 partition per executor, but in this case it created 224 partitions, when there are only 15 executors. Is that something abnormal? |
@edsonaoki it coalesces to executor-cores, so if you have 15 executors then there must be about 15 or 16 cores per executor. I'm not exactly sure how the 224 is calculated in your case. You may also be running into a special case where you have some empty partitions, which was fixed in this PR and should be in v0.17, so it shouldn't be a problem: |
hi @imatiach-msft sorry for the delay. I run a test today with verbosity=4, here is what I get from the executor's side:
And from the driver's side:
Spark configuration (Cloudera stack):
|
I disabled the connection to the Unravel server and the corresponding error messages disappear:
Unfortunately, the worker nodes are still stuck at "19/05/27 09:12:40 INFO LightGBMClassifier: send current worker info to driver: ..." Note that I also set verbosity to 4. |
@imatiach-msft I noticed something else. 19/05/31 14:17:25 INFO LightGBMClassifier: driver expecting 36 connections... I understand that until the driver gets all 36 connections, it is not going to pass through "driver writing back to all connections: ..." and the task will hang? |
It's strange because I set the number of executors to 25 and the number of cores per executor to 1... so it creates only 25 executors. However when I look at the number of tasks in the Spark UI it says 36 executors, and the Driver also expects 36 executors... |
I found that the problem is related to the Yarn dynamic allocation. When dynamic allocation is turned on, LightGBM for some reason always creates more tasks than the actual number of executors (by default Yarn uses 1 core per executor). As consequence, the driver hangs on forever since if there are X executors and X+Y tasks:
When I turn off dynamic allocation the number of created tasks is not greater than the number of executors, so that the training starts. |
@edsonaoki sorry, yes, dynamic allocation is not supported, please see this issue: |
@imatiach-msft is there a way to not have all the tasks starting at the same time, i.e. having some tasks in queue while other run, as in Spark MLLib GB? Currently I can't allocate more than a few dozen cores and a few GB per memory per core, which makes it sort of impossible to run LightGBM training over the amount of data that I have if all the tasks start at the same time... |
Finally, I managed to run LightGBM for a reasonably large dataset. So these are my guidelines (for Cloudera Spark 2.2 + Yarn + CentOS 7 environment):
|
hi @edsonaoki , really sorry for all the trouble you had to go through to make this work. For issue 4: I recently fixed an issue where if there are multiple cores per executor the driver may expect more workers than there are and it will wait until a timeout error which surfaces as a java.lang.NullPointerException, but the fix is not yet released; there is a private build here: #586 with the fix for this issue. For this issue:
so there seems to be something wrong with that logic on your cluster. LightGBM needs to know all of the workers ahead of time for the NetworkInit call (and provide their host, port to form a network ring where the lightgbm distributed processing logic can take over): https://github.com/microsoft/LightGBM/blob/master/include/LightGBM/c_api.h#L1000 I'm not quite sure how to fix the issue on your cluster though. I need to make this initialization part of the logic much more robust on different cluster configurations. |
hi @imatiach-msft, thanks, I understand, I will close this issue such that we can focus on other problems. |
Again, CentOS7 problem, this time with the mmlspark_2.11:0.17.dev1+1.g5e0b2a0 version that @imatiach-msft posted on #514. Running Pyspark 3 on Yarn-Cluster mode over Cloudera (LightGBM does not work at all in Yarn-Client mode).
Code:
Dataframe size about 3GB.
Spark configuration:
A lot of weird stuff happens so let me try to go one by one:
Code works normally when applied to a small set of data (i.e. much smaller than 3GB)
The piece of code
looks completely useless. However, if I don't use it, LightGBM will start training the model with a very small number of partitions, eventually leading to the same NullPointerException error described in #514
The training seems to start with a "reduce" operation with 0.8 x 1000 = 800 partitions, as prescribed. However, afterwards, it will invariably start another "reduce" operation with a single partition and with the entire 3GB of data as input to the executor. This was the reason I used a small number of executors with a large executor memory, otherwise I will get an OutOfMemory error.
The last lines of log in the executor which runs on the single partition are:
afterwards, it would get into a connectivity timeout error of some sort (that's why I increased "spark.network.timeout" and "spark.executor.heartbeatInterval"). When I increase both timeouts, it finally fails with:
Note that from the driver side, the final lines of the log are:
Any help would be appreciated.
The text was updated successfully, but these errors were encountered: