No response from ElasticSearch when searching for a vector #92431

christian-schlichtherle · 2022-12-18T12:27:36Z

Elasticsearch Version

8.5.3

Installed Plugins

No response

Java Version

openjdk version "19.0.1" 2022-10-18 OpenJDK Runtime Environment (build 19.0.1+10-21) OpenJDK 64-Bit Server VM (build 19.0.1+10-21, mixed mode, sharing)

OS Version

Linux test-es-default-0 5.15.0-1026-aws #30-Ubuntu SMP Wed Nov 23 17:01:09 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Problem Description

Im trying to search for a vector with 128 scalars. For reproducing this issue, you can use the following test request where the scalars are all the same value, which is 0.1234:

SCALAR=0.1234
curl \
  --data-raw "
{
  \"knn\": {
    \"field\": \"$ELASTIC_FIELD\",
    \"query_vector\": [$(for i in $(seq 1 128); do echo -n $SCALAR; [ $i -lt 128 ] && echo -n ", "; done)],
    \"k\": 10,
    \"num_candidates\": 100
  }
}" \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --silent \
  --user $ELASTIC_USERNAME:$ELASTIC_PASSWORD \
  --verbose \
  http://$ELASTIC_HOST:9200/$ELASTIC_INDEX/_knn_search?pretty

I'm running this test on four different clients, which are located in different networks. In the first two networks, I get a response (not included here for brevity). In the third network, I get no response from the server at all. Here's a transcript before I have to interrupt using ^C:

$ curl <parameters>
*   Trying 10.42.1.207:9200...
* Connected to 10.42.1.207 (10.42.1.207) port 9200 (#0)
* Server auth using Basic with user 'elastic'
> POST /<index>/_knn_search?pretty HTTP/1.1
> Host: 10.42.1.207:9200
> Authorization: Basic <oops>
> User-Agent: curl/7.81.0
> Accept: application/json
> Content-Type: application/json
> Content-Length: 1132
> 
^C

So the connection is established, the request is sent, just there is no response ever.

This looks like a network issue, so I checked network configuration like DNS, address ranges, MTU, packet filters etc. They look all very similar, nothing suspicious.

Next, I've set SCALAR=0.123 - one digit less. Now guess what: It works on all four clients in all three networks! I get a complete response (not included here for brevity).

Is it possible that ElasticSearch is allergic to some specific combination of request and network?

Steps to Reproduce

See above.

Logs (if relevant)

See above.

The text was updated successfully, but these errors were encountered:

christian-schlichtherle · 2022-12-18T14:09:54Z

PS: I've refactored the query to the new (non-deprecated) form, but it makes no difference. Included for reference nevertheless:

curl \
  --data-raw "
{
  \"knn\": {
    \"field\": \"$ELASTIC_FIELD\",
    \"query_vector\": [$(for i in $(seq 1 128); do echo -n $SCALAR; [ $i -lt 128 ] && echo -n ", "; done)],
    \"k\": 10,
    \"num_candidates\": 100
  }
}" \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --silent \
  --user $ELASTIC_USERNAME:$ELASTIC_PASSWORD \
  --verbose \
  http://$ELASTIC_HOST:9200/$ELASTIC_INDEX/_search?pretty

elasticsearchmachine · 2022-12-19T19:57:04Z

Pinging @elastic/es-search (Team:Search)

javanna · 2022-12-20T11:05:35Z

hi @christian-schlichtherle thanks for reporting this. Could you post the exact query you are sending to Elasticsearch? Do your clients connect directly to Elasticsearch?

christian-schlichtherle · 2022-12-20T11:36:43Z

Here we go:

ELASTIC_FIELD=my_vector
ELASTIC_INDEX=my_vectors
ELASTIC_PASSWORD=super-secret
ELASTIC_URL=http://test-es-http:9200
ELASTIC_USERNAME=elastic
SCALAR=0.123456789012345678

curl \
  --data "{
  \"knn\": {
    \"field\": \"$ELASTIC_FIELD\",
    \"query_vector\": [$(for i in $(seq 1 128); do echo -n $SCALAR; [ $i -lt 128 ] && echo -n ", "; done)],
    \"k\": 10,
    \"num_candidates\": 100
  }
}" \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --silent \
  --user $ELASTIC_USERNAME:$ELASTIC_PASSWORD \
  --verbose \
  $ELASTIC_URL/$ELASTIC_INDEX/_search?pretty

Here's a transcript - note that I had to type Ctrl-C to interrupt:

$ curl \
>   --data "{
>   \"knn\": {
>     \"field\": \"$ELASTIC_FIELD\",
>     \"query_vector\": [$(for i in $(seq 1 128); do echo -n $SCALAR; [ $i -lt 128 ] && echo -n ", "; done)],
>     \"k\": 10,
>     \"num_candidates\": 100
>   }
> }" \
>   --header 'Accept: application/json' \
>   --header 'Content-Type: application/json' \
>   --silent \
>   --user $ELASTIC_USERNAME:$ELASTIC_PASSWORD \
>   --verbose \
>   $ELASTIC_URL/$ELASTIC_INDEX/_search?pretty
*   Trying 10.43.14.45:9200...
* TCP_NODELAY set
* Connected to test-es-http (10.43.14.45) port 9200 (#0)
* Server auth using Basic with user 'elastic'
> POST /my_vectors/_search?pretty HTTP/1.1
> Host: test-es-http:9200
> Authorization: Basic <oops>
> User-Agent: curl/7.68.0
> Accept: application/json
> Content-Type: application/json
> Content-Length: 2923
> Expect: 100-continue
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
^C

as you can see, after the HTTP/1.1 100 Continue nothing is sent anymore.

For context: We are using ES on K8s. In our development cluster, we are running ES clients in four pods from three different LANs. It's an IoT project - in production the clients are running in customer LANs. Therefore we are using OpenVPN to encrypt all K8s traffic between the IoT devices and the cloud. There is no network issue with other K8s workloads and this query is working fine in two of our three test LANs. I have checked the network configuration of all LANs and it's consistent.

This makes me think that there is something about the network configuration which triggers a bug in the ES server somehow. Otherwise, why would I get consistent results where the request hangs in one of three LANs?

christian-schlichtherle · 2022-12-20T11:43:26Z

Also, I have defined an ingress using the AWS Load Balancer Controller and then used the ingress address in the ELASTIC_URL instead of the K8s service address. This works consistently in all three LANs. The obvious reason is that now all requests from all client locations are routed through an AWS ALB, so the ES server sees the requests coming from the AWS ALB. This workaround also suggests that ES is somehow biased against some client network configurations.

Of course, I want to get rid of this workaround as routing all requests through the AWS ALB adds unnecessary latency and costs.

javanna · 2022-12-20T13:03:15Z

The problem with this type of issue is reproducing them. It could be a bug in the code like a listener that never gets called, but that would not depend on network configuration.

@elastic/es-distributed have you seen similar things in the past? Any idea how to further diagnose?

christian-schlichtherle · 2022-12-20T15:31:50Z

Understood! I'm also trying to isolate the problem further.

christian-schlichtherle · 2022-12-20T18:24:15Z

Some more context: The above query works fine on all pods in all LANs if I set SCALAR=0.12345, which results in a request header Content-Length: 1258. If add one more digit, it stops working (hangs) while the request header increases to Content-Length: 1386. This suggests that the issue may be related to the size of the request body.

christian-schlichtherle · 2022-12-20T18:48:19Z

More context: I have been following the instructions on this page to enable request/response tracing. The results are that I cannot find a trace of the request in case it actually hangs. I do get traces of requests and responses when the request is small enough, e.g. when SCALAR=0.12345, or when I use a pod in a LAN where it always works.

I suspect the explanation is that when the request/response hangs for some reason, then the request is not complete, so it's never logged, never processed and therefore never answered.

javanna · 2022-12-21T16:02:10Z

Thanks for the effort in reproducing the issue and understanding the cause. Based on your last comment around http request tracing, I would verify that the request does make it to Elasticsearch first. The idea of a missing listener callback sounds less likely, because we would then see the request make it in. We log http requests first thing before they are dispatched internally, hence it is very surprising not to see the trace log.

christian-schlichtherle · 2022-12-21T16:05:07Z

Well, what can I do? As you can see from the transcript above, I can connect and send the request, so I guess this rules out any issues with firewalls, security groups, packet filters etc. Even if, I still do not understand why it works in 2/3 LANs.

javanna · 2022-12-21T21:04:36Z

Maybe a crazy idea, but could you send that same request to any other endpoint other than _search (even unsupported) for instance and see if you get anything back?

christian-schlichtherle · 2022-12-22T11:03:16Z

Yes, I can send requests to other endpoints from anywhere and they work just fine. I can even search for vectors from anywhere and it works just fine, for example the above request is working fine if I set SCALAR=0.12345. This suggests that there is an issue in specific networks and specific request bodies, i.e. the request body size.

javanna · 2022-12-22T13:02:09Z

I am not sure I explained well what I meant. Could you send the same request that hangs to any endpoint other than search? Do you get a timely response (hopefully an error) back from Elasticsearch or does the request hang?

christian-schlichtherle · 2022-12-23T15:12:46Z

Good point! So I sent the same request to the endpoint /_cluster/health and... it hangs (on the same pod where the /_search request hangs)! If I make the request smaller using SCALAR=0.12345 then I get a 405 Method Not Allowed (because I used POST, not GET).

So now we know there is an issue with larger request body sizes when sending requests from specific networks. The endpoint doesn't matter and there is no TRACE of the request in the ES logs when it hangs.

I can also rule out K8s services as the culprit because the behavior is the same when I address the IP and port of the test-es-default-0 or test-es-default-1 pods directly (I have two nodes here) instead of just addressing the test-es-http service.

javanna · 2023-01-11T23:20:02Z

That seems to exclude software bugs in the search endpoint itself. And again, if you don't see the request trace in the logs, I would look outside of Elasticsearch and make sure that the request does make it to Elasticsearch. I am closing this issue as we have no evidence that this is an Elasticsearch issue. Let us know if we can help any further.

christian-schlichtherle added >bug needs:triage Requires assignment of a team area label labels Dec 18, 2022

DJRickyB added :Search Relevance/Vectors Vector search and removed needs:triage Requires assignment of a team area label labels Dec 19, 2022

elasticsearchmachine added the Team:Search Meta label for search team label Dec 19, 2022

javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2023

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No response from ElasticSearch when searching for a vector #92431

No response from ElasticSearch when searching for a vector #92431

christian-schlichtherle commented Dec 18, 2022

christian-schlichtherle commented Dec 18, 2022

elasticsearchmachine commented Dec 19, 2022

javanna commented Dec 20, 2022

christian-schlichtherle commented Dec 20, 2022

christian-schlichtherle commented Dec 20, 2022

javanna commented Dec 20, 2022

christian-schlichtherle commented Dec 20, 2022

christian-schlichtherle commented Dec 20, 2022

christian-schlichtherle commented Dec 20, 2022 •

edited

Loading

javanna commented Dec 21, 2022 •

edited

Loading

christian-schlichtherle commented Dec 21, 2022

javanna commented Dec 21, 2022

christian-schlichtherle commented Dec 22, 2022

javanna commented Dec 22, 2022

christian-schlichtherle commented Dec 23, 2022 •

edited

Loading

javanna commented Jan 11, 2023

No response from ElasticSearch when searching for a vector #92431

No response from ElasticSearch when searching for a vector #92431

Comments

christian-schlichtherle commented Dec 18, 2022

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

christian-schlichtherle commented Dec 18, 2022

elasticsearchmachine commented Dec 19, 2022

javanna commented Dec 20, 2022

christian-schlichtherle commented Dec 20, 2022

christian-schlichtherle commented Dec 20, 2022

javanna commented Dec 20, 2022

christian-schlichtherle commented Dec 20, 2022

christian-schlichtherle commented Dec 20, 2022

christian-schlichtherle commented Dec 20, 2022 • edited Loading

javanna commented Dec 21, 2022 • edited Loading

christian-schlichtherle commented Dec 21, 2022

javanna commented Dec 21, 2022

christian-schlichtherle commented Dec 22, 2022

javanna commented Dec 22, 2022

christian-schlichtherle commented Dec 23, 2022 • edited Loading

javanna commented Jan 11, 2023

christian-schlichtherle commented Dec 20, 2022 •

edited

Loading

javanna commented Dec 21, 2022 •

edited

Loading

christian-schlichtherle commented Dec 23, 2022 •

edited

Loading