Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite connection retry when starting two training processes #3839

Closed
wjsi opened this issue Jan 24, 2021 · 2 comments · Fixed by #3840
Closed

Infinite connection retry when starting two training processes #3839

wjsi opened this issue Jan 24, 2021 · 2 comments · Fixed by #3840
Labels

Comments

@wjsi
Copy link
Contributor

wjsi commented Jan 24, 2021

How you are using LightGBM?

We are using LightGBM in our project Mars to train models distributedly. When we tests our modules with separate processes to mock what happens in distributed environment, sometimes two processes cannot connect with each other and LightGBM retrys connection in an infinite loop.

We discover that all ports are opened successfully. The cause of the connection failure is that in

TcpSocket cur_socket;
int connect_fail_delay_time = connect_fail_retry_first_delay_interval;
for (int i = 0; i < connect_fail_retry_cnt; ++i) {
if (cur_socket.Connect(client_ips_[out_rank].c_str(), client_ports_[out_rank])) {
break;
} else {
Log::Warning("Connecting to rank %d failed, waiting for %d milliseconds", out_rank, connect_fail_delay_time);
std::this_thread::sleep_for(std::chrono::milliseconds(connect_fail_delay_time));
connect_fail_delay_time = static_cast<int>(connect_fail_delay_time * connect_fail_retry_delay_factor);
}
}

when a connection attempt fails, the socket handle is reused again and OS reports bad fle descriptor, and connection attempt can never be successful.

We creates a PR by recreating the socket handle on every connection attempt.

Choose one of the following components

  • Python package

Environment info

Operating System: MacOS 10.15.7

CPU/GPU model: Intel Core i7

Python version: 3.8.5

LightGBM version or commit hash: ac706e1

Error message and / or logs

[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 36838...
[LightGBM] [Info] Binding port 36838 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 36839...
[LightGBM] [Info] Binding port 36839 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 260 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 338 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 439 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 570 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 741 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 963 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 1251 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 1626 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 2113 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 2746 milliseconds

Reproducible example(s)

We make a minimal example to demonstrate the issue.

import random
from concurrent.futures import ProcessPoolExecutor

import lightgbm
import numpy as np

N_ROWS = 10000
N_COLS = 10


def fit_part(x, y, ports, idx):
    params = dict(
        machines=','.join([f'127.0.0.1:{port}' for port in ports]),
        time_out=3600,
        num_machines=len(ports),
        local_listen_port=ports[idx],
        tree_learner='data',
    )
    model = lightgbm.LGBMRegressor(**params)
    model.fit(x, y)
    return model


def main():
    rs = np.random.RandomState(0)
    start_port = random.randint(10000, 60000)
    ports = [start_port, start_port + 1]

    X = rs.rand(N_ROWS, N_COLS)
    y = rs.rand(N_ROWS)

    proc_pool = ProcessPoolExecutor(2)

    try:
        f1 = proc_pool.submit(fit_part, X[:N_ROWS // 2, :], y[:N_ROWS // 2], ports, 0)
        f2 = proc_pool.submit(fit_part, X[N_ROWS // 2:N_ROWS, :], y[N_ROWS // 2:N_ROWS], ports, 1)

        f1.result()
        f2.result()
    except KeyboardInterrupt:
        pass


if __name__ == '__main__':
    main()
@jameslamb
Copy link
Collaborator

jameslamb commented Jan 24, 2021

Thank you for this report! I have seen similar issues on Mac (but not in Linux environmemts), but hadn't been able to reproduce it cleanly or describe it this clearly.

@StrikerRUS the fix for this MIGHT help with #3782

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants