Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No response from ElasticSearch when searching for a vector #92431

Closed
christian-schlichtherle opened this issue Dec 18, 2022 · 16 comments
Closed
Labels
>bug :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@christian-schlichtherle

Elasticsearch Version

8.5.3

Installed Plugins

No response

Java Version

openjdk version "19.0.1" 2022-10-18 OpenJDK Runtime Environment (build 19.0.1+10-21) OpenJDK 64-Bit Server VM (build 19.0.1+10-21, mixed mode, sharing)

OS Version

Linux test-es-default-0 5.15.0-1026-aws #30-Ubuntu SMP Wed Nov 23 17:01:09 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

Problem Description

Im trying to search for a vector with 128 scalars. For reproducing this issue, you can use the following test request where the scalars are all the same value, which is 0.1234:

SCALAR=0.1234
curl \
  --data-raw "
{
  \"knn\": {
    \"field\": \"$ELASTIC_FIELD\",
    \"query_vector\": [$(for i in $(seq 1 128); do echo -n $SCALAR; [ $i -lt 128 ] && echo -n ", "; done)],
    \"k\": 10,
    \"num_candidates\": 100
  }
}" \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --silent \
  --user $ELASTIC_USERNAME:$ELASTIC_PASSWORD \
  --verbose \
  http://$ELASTIC_HOST:9200/$ELASTIC_INDEX/_knn_search?pretty

I'm running this test on four different clients, which are located in different networks. In the first two networks, I get a response (not included here for brevity). In the third network, I get no response from the server at all. Here's a transcript before I have to interrupt using ^C:

$ curl <parameters>
*   Trying 10.42.1.207:9200...
* Connected to 10.42.1.207 (10.42.1.207) port 9200 (#0)
* Server auth using Basic with user 'elastic'
> POST /<index>/_knn_search?pretty HTTP/1.1
> Host: 10.42.1.207:9200
> Authorization: Basic <oops>
> User-Agent: curl/7.81.0
> Accept: application/json
> Content-Type: application/json
> Content-Length: 1132
> 
^C

So the connection is established, the request is sent, just there is no response ever.

This looks like a network issue, so I checked network configuration like DNS, address ranges, MTU, packet filters etc. They look all very similar, nothing suspicious.

Next, I've set SCALAR=0.123 - one digit less. Now guess what: It works on all four clients in all three networks! I get a complete response (not included here for brevity).

Is it possible that ElasticSearch is allergic to some specific combination of request and network?

Steps to Reproduce

See above.

Logs (if relevant)

See above.

@christian-schlichtherle christian-schlichtherle added >bug needs:triage Requires assignment of a team area label labels Dec 18, 2022
@christian-schlichtherle
Copy link
Author

PS: I've refactored the query to the new (non-deprecated) form, but it makes no difference. Included for reference nevertheless:

curl \
  --data-raw "
{
  \"knn\": {
    \"field\": \"$ELASTIC_FIELD\",
    \"query_vector\": [$(for i in $(seq 1 128); do echo -n $SCALAR; [ $i -lt 128 ] && echo -n ", "; done)],
    \"k\": 10,
    \"num_candidates\": 100
  }
}" \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --silent \
  --user $ELASTIC_USERNAME:$ELASTIC_PASSWORD \
  --verbose \
  http://$ELASTIC_HOST:9200/$ELASTIC_INDEX/_search?pretty

@DJRickyB DJRickyB added :Search Relevance/Vectors Vector search and removed needs:triage Requires assignment of a team area label labels Dec 19, 2022
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Dec 19, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@javanna
Copy link
Member

javanna commented Dec 20, 2022

hi @christian-schlichtherle thanks for reporting this. Could you post the exact query you are sending to Elasticsearch? Do your clients connect directly to Elasticsearch?

@christian-schlichtherle
Copy link
Author

Here we go:

ELASTIC_FIELD=my_vector
ELASTIC_INDEX=my_vectors
ELASTIC_PASSWORD=super-secret
ELASTIC_URL=http://test-es-http:9200
ELASTIC_USERNAME=elastic
SCALAR=0.123456789012345678

curl \
  --data "{
  \"knn\": {
    \"field\": \"$ELASTIC_FIELD\",
    \"query_vector\": [$(for i in $(seq 1 128); do echo -n $SCALAR; [ $i -lt 128 ] && echo -n ", "; done)],
    \"k\": 10,
    \"num_candidates\": 100
  }
}" \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --silent \
  --user $ELASTIC_USERNAME:$ELASTIC_PASSWORD \
  --verbose \
  $ELASTIC_URL/$ELASTIC_INDEX/_search?pretty

Here's a transcript - note that I had to type Ctrl-C to interrupt:

$ curl \
>   --data "{
>   \"knn\": {
>     \"field\": \"$ELASTIC_FIELD\",
>     \"query_vector\": [$(for i in $(seq 1 128); do echo -n $SCALAR; [ $i -lt 128 ] && echo -n ", "; done)],
>     \"k\": 10,
>     \"num_candidates\": 100
>   }
> }" \
>   --header 'Accept: application/json' \
>   --header 'Content-Type: application/json' \
>   --silent \
>   --user $ELASTIC_USERNAME:$ELASTIC_PASSWORD \
>   --verbose \
>   $ELASTIC_URL/$ELASTIC_INDEX/_search?pretty
*   Trying 10.43.14.45:9200...
* TCP_NODELAY set
* Connected to test-es-http (10.43.14.45) port 9200 (#0)
* Server auth using Basic with user 'elastic'
> POST /my_vectors/_search?pretty HTTP/1.1
> Host: test-es-http:9200
> Authorization: Basic <oops>
> User-Agent: curl/7.68.0
> Accept: application/json
> Content-Type: application/json
> Content-Length: 2923
> Expect: 100-continue
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
^C

as you can see, after the HTTP/1.1 100 Continue nothing is sent anymore.

For context: We are using ES on K8s. In our development cluster, we are running ES clients in four pods from three different LANs. It's an IoT project - in production the clients are running in customer LANs. Therefore we are using OpenVPN to encrypt all K8s traffic between the IoT devices and the cloud. There is no network issue with other K8s workloads and this query is working fine in two of our three test LANs. I have checked the network configuration of all LANs and it's consistent.

This makes me think that there is something about the network configuration which triggers a bug in the ES server somehow. Otherwise, why would I get consistent results where the request hangs in one of three LANs?

@christian-schlichtherle
Copy link
Author

Also, I have defined an ingress using the AWS Load Balancer Controller and then used the ingress address in the ELASTIC_URL instead of the K8s service address. This works consistently in all three LANs. The obvious reason is that now all requests from all client locations are routed through an AWS ALB, so the ES server sees the requests coming from the AWS ALB. This workaround also suggests that ES is somehow biased against some client network configurations.

Of course, I want to get rid of this workaround as routing all requests through the AWS ALB adds unnecessary latency and costs.

@javanna
Copy link
Member

javanna commented Dec 20, 2022

The problem with this type of issue is reproducing them. It could be a bug in the code like a listener that never gets called, but that would not depend on network configuration.

@elastic/es-distributed have you seen similar things in the past? Any idea how to further diagnose?

@christian-schlichtherle
Copy link
Author

Understood! I'm also trying to isolate the problem further.

@christian-schlichtherle
Copy link
Author

Some more context: The above query works fine on all pods in all LANs if I set SCALAR=0.12345, which results in a request header Content-Length: 1258. If add one more digit, it stops working (hangs) while the request header increases to Content-Length: 1386. This suggests that the issue may be related to the size of the request body.

@christian-schlichtherle
Copy link
Author

christian-schlichtherle commented Dec 20, 2022

More context: I have been following the instructions on this page to enable request/response tracing. The results are that I cannot find a trace of the request in case it actually hangs. I do get traces of requests and responses when the request is small enough, e.g. when SCALAR=0.12345, or when I use a pod in a LAN where it always works.

I suspect the explanation is that when the request/response hangs for some reason, then the request is not complete, so it's never logged, never processed and therefore never answered.

@javanna
Copy link
Member

javanna commented Dec 21, 2022

Thanks for the effort in reproducing the issue and understanding the cause. Based on your last comment around http request tracing, I would verify that the request does make it to Elasticsearch first. The idea of a missing listener callback sounds less likely, because we would then see the request make it in. We log http requests first thing before they are dispatched internally, hence it is very surprising not to see the trace log.

@christian-schlichtherle
Copy link
Author

Well, what can I do? As you can see from the transcript above, I can connect and send the request, so I guess this rules out any issues with firewalls, security groups, packet filters etc. Even if, I still do not understand why it works in 2/3 LANs.

@javanna
Copy link
Member

javanna commented Dec 21, 2022

Maybe a crazy idea, but could you send that same request to any other endpoint other than _search (even unsupported) for instance and see if you get anything back?

@christian-schlichtherle
Copy link
Author

Yes, I can send requests to other endpoints from anywhere and they work just fine. I can even search for vectors from anywhere and it works just fine, for example the above request is working fine if I set SCALAR=0.12345. This suggests that there is an issue in specific networks and specific request bodies, i.e. the request body size.

@javanna
Copy link
Member

javanna commented Dec 22, 2022

I am not sure I explained well what I meant. Could you send the same request that hangs to any endpoint other than search? Do you get a timely response (hopefully an error) back from Elasticsearch or does the request hang?

@christian-schlichtherle
Copy link
Author

christian-schlichtherle commented Dec 23, 2022

Good point! So I sent the same request to the endpoint /_cluster/health and... it hangs (on the same pod where the /_search request hangs)! If I make the request smaller using SCALAR=0.12345 then I get a 405 Method Not Allowed (because I used POST, not GET).

So now we know there is an issue with larger request body sizes when sending requests from specific networks. The endpoint doesn't matter and there is no TRACE of the request in the ES logs when it hangs.

I can also rule out K8s services as the culprit because the behavior is the same when I address the IP and port of the test-es-default-0 or test-es-default-1 pods directly (I have two nodes here) instead of just addressing the test-es-http service.

@javanna
Copy link
Member

javanna commented Jan 11, 2023

That seems to exclude software bugs in the search endpoint itself. And again, if you don't see the request trace in the logs, I would look outside of Elasticsearch and make sure that the request does make it to Elasticsearch. I am closing this issue as we have no evidence that this is an Elasticsearch issue. Let us know if we can help any further.

@javanna javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2023
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

4 participants