Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Flight] pyarrow.flight.FlightServerBase use threading instead of multiprocessing? #236

Open
tianjiqx opened this issue Jul 29, 2022 · 12 comments

Comments

@tianjiqx
Copy link

I tried to use arrow flight, but when testing the performance (do_exchange), I found that the cpu usage can only be around 100%, maybe it is the GIL problem of python multi-threading.

I didn't find anything about setting the thread pool for FlightServerBase, so it seems that if you want to improve performance, it seems that you can only use multiprocessing.Pool internally to handle the internal working logic.

Do you have any other suggestions? Thanks!

@tianjiqx
Copy link
Author

I've searched for similar questions and maybe learned something, but anyone with more suggestions is welcome.

@tianjiqx tianjiqx changed the title [Python] pyarrow.flight.FlightServerBase use threading instead of multiprocessing? [Python][Flight] pyarrow.flight.FlightServerBase use threading instead of multiprocessing? Jul 29, 2022
@lidavidm
Copy link
Member

Yes, if you have heavy internal processing, your only real choice is to use multiprocessing. Depending on what you want to do, you could also set the SO_REUSEPORT gRPC option and spawn many processes listening to the same port.

There's no option to set the thread pool because 1) it's provided by gRPC and 2) it wouldn't help anyways, you're still limited by the GIL.

@tianjiqx
Copy link
Author

@lidavidm thank you for your reply. My program still have some IO. Considering the copy cost of multi-process communication, it may be better to start more FlightServer.

@lidavidm
Copy link
Member

Yeah, it does make it hard to build a CPU-intensive service in Python.

Maybe the SO_REUSEPORT option can be an Arrow Cookbook recipe - is that a good solution here?

@tianjiqx
Copy link
Author

Probably didn't understand how to use SO_REUSEPORT. Sometimes it is possible to bind to the same port, but the created clients always only connect to the same server.

@contextlib.contextmanager
def _init_grpc_port(grpc_port):
    """Initialize grpc port for multiprocessing."""
    # Reference code.
    # https://github.com/grpc/grpc/issues/17659

    socket = py_socket.socket(py_socket.AF_INET6, py_socket.SOCK_STREAM)
    # in fact, py_socket.SOL_SOCKET = 65535
    # print(py_socket.SOL_SOCKET)

    socket.setsockopt(py_socket.SOL_SOCKET, py_socket.SO_REUSEADDR, 1)
    if socket.getsockopt(py_socket.SOL_SOCKET, py_socket.SO_REUSEADDR) == 0:
        raise RuntimeError("Failed to set SO_REUSEADDR.")
    socket.setsockopt(py_socket.SOL_SOCKET, py_socket.SO_REUSEPORT, 1)
    if socket.getsockopt(py_socket.SOL_SOCKET, py_socket.SO_REUSEPORT) == 0:
        raise RuntimeError("Failed to set SO_REUSEPORT.")
    socket.bind(('', grpc_port))
    try:
        yield socket.getsockname()[1]
    finally:
        socket.close()

def main():
    with _init_grpc_port(5005) as port:
        pass

    for i in range(2):
        p = Process(target=server.start, args=(i, ))
        # if set time.sleep(1) then always throw "os_error":"Address already in use","syscall":"bind"
        p.start()

@lidavidm
Copy link
Member

lidavidm commented Aug 1, 2022

The option has to be passed through gRPC (the code above effectively does nothing). See generic_options and the gRPC documentation

I'll see about adding a code snippet in the cookbook

@lidavidm lidavidm transferred this issue from apache/arrow Aug 1, 2022
@lidavidm lidavidm reopened this Aug 1, 2022
@abjidge
Copy link

abjidge commented Apr 19, 2023

@lidavidm Can you please share or document the code snippet for this.

@lidavidm
Copy link
Member

Ah, hmm, this isn't overridable from Python. Do you want to file an issue on the main repo?

@abjidge
Copy link

abjidge commented Apr 19, 2023

@lidavidm, Thank you for quick response.

Yes, Will file an issue.
Is there any known workaround to use multiprocessing for a pyarrow flight server?

@lidavidm
Copy link
Member

Workaround? What is the issue?

@abjidge
Copy link

abjidge commented Apr 20, 2023

I mean is there any other way to use pyarrow server with multiprocessing until the above issue get resolved?

@lidavidm
Copy link
Member

Sorry, I don't understand. You can just use the multiprocessing module. Is there a problem when using it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants