-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No response from ElasticSearch when searching for a vector #92431
Comments
PS: I've refactored the query to the new (non-deprecated) form, but it makes no difference. Included for reference nevertheless:
|
Pinging @elastic/es-search (Team:Search) |
hi @christian-schlichtherle thanks for reporting this. Could you post the exact query you are sending to Elasticsearch? Do your clients connect directly to Elasticsearch? |
Here we go:
Here's a transcript - note that I had to type Ctrl-C to interrupt:
as you can see, after the For context: We are using ES on K8s. In our development cluster, we are running ES clients in four pods from three different LANs. It's an IoT project - in production the clients are running in customer LANs. Therefore we are using OpenVPN to encrypt all K8s traffic between the IoT devices and the cloud. There is no network issue with other K8s workloads and this query is working fine in two of our three test LANs. I have checked the network configuration of all LANs and it's consistent. This makes me think that there is something about the network configuration which triggers a bug in the ES server somehow. Otherwise, why would I get consistent results where the request hangs in one of three LANs? |
Also, I have defined an ingress using the AWS Load Balancer Controller and then used the ingress address in the ELASTIC_URL instead of the K8s service address. This works consistently in all three LANs. The obvious reason is that now all requests from all client locations are routed through an AWS ALB, so the ES server sees the requests coming from the AWS ALB. This workaround also suggests that ES is somehow biased against some client network configurations. Of course, I want to get rid of this workaround as routing all requests through the AWS ALB adds unnecessary latency and costs. |
The problem with this type of issue is reproducing them. It could be a bug in the code like a listener that never gets called, but that would not depend on network configuration. @elastic/es-distributed have you seen similar things in the past? Any idea how to further diagnose? |
Understood! I'm also trying to isolate the problem further. |
Some more context: The above query works fine on all pods in all LANs if I set |
More context: I have been following the instructions on this page to enable request/response tracing. The results are that I cannot find a trace of the request in case it actually hangs. I do get traces of requests and responses when the request is small enough, e.g. when I suspect the explanation is that when the request/response hangs for some reason, then the request is not complete, so it's never logged, never processed and therefore never answered. |
Thanks for the effort in reproducing the issue and understanding the cause. Based on your last comment around http request tracing, I would verify that the request does make it to Elasticsearch first. The idea of a missing listener callback sounds less likely, because we would then see the request make it in. We log http requests first thing before they are dispatched internally, hence it is very surprising not to see the trace log. |
Well, what can I do? As you can see from the transcript above, I can connect and send the request, so I guess this rules out any issues with firewalls, security groups, packet filters etc. Even if, I still do not understand why it works in 2/3 LANs. |
Maybe a crazy idea, but could you send that same request to any other endpoint other than _search (even unsupported) for instance and see if you get anything back? |
Yes, I can send requests to other endpoints from anywhere and they work just fine. I can even search for vectors from anywhere and it works just fine, for example the above request is working fine if I set |
I am not sure I explained well what I meant. Could you send the same request that hangs to any endpoint other than search? Do you get a timely response (hopefully an error) back from Elasticsearch or does the request hang? |
Good point! So I sent the same request to the endpoint So now we know there is an issue with larger request body sizes when sending requests from specific networks. The endpoint doesn't matter and there is no TRACE of the request in the ES logs when it hangs. I can also rule out K8s services as the culprit because the behavior is the same when I address the IP and port of the |
That seems to exclude software bugs in the search endpoint itself. And again, if you don't see the request trace in the logs, I would look outside of Elasticsearch and make sure that the request does make it to Elasticsearch. I am closing this issue as we have no evidence that this is an Elasticsearch issue. Let us know if we can help any further. |
Elasticsearch Version
8.5.3
Installed Plugins
No response
Java Version
openjdk version "19.0.1" 2022-10-18 OpenJDK Runtime Environment (build 19.0.1+10-21) OpenJDK 64-Bit Server VM (build 19.0.1+10-21, mixed mode, sharing)
OS Version
Linux test-es-default-0 5.15.0-1026-aws #30-Ubuntu SMP Wed Nov 23 17:01:09 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux
Problem Description
Im trying to search for a vector with 128 scalars. For reproducing this issue, you can use the following test request where the scalars are all the same value, which is 0.1234:
I'm running this test on four different clients, which are located in different networks. In the first two networks, I get a response (not included here for brevity). In the third network, I get no response from the server at all. Here's a transcript before I have to interrupt using ^C:
So the connection is established, the request is sent, just there is no response ever.
This looks like a network issue, so I checked network configuration like DNS, address ranges, MTU, packet filters etc. They look all very similar, nothing suspicious.
Next, I've set
SCALAR=0.123
- one digit less. Now guess what: It works on all four clients in all three networks! I get a complete response (not included here for brevity).Is it possible that ElasticSearch is allergic to some specific combination of request and network?
Steps to Reproduce
See above.
Logs (if relevant)
See above.
The text was updated successfully, but these errors were encountered: