-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Elastic/Opensearch backend does not return all trace spans #4330
Comments
If max-doc-count is not set then the default in the code is 10'000, so I think this behavior is expected. |
The max doc count is set to 10000 because the Opensearch max search size is 10000. So by default Opensearch can't return more then 10000 hits in a single request. That is why the msearch is doing a loop, so it gets all spans. But this is broken. Right now you will get 40000 spans max due to the terminate_after. |
After further debugging I noticed that The log message of the last msearch with no returned hits
After more research I found why I am getting 40'000 Hits Total with My theory with the issue above is following. Therefore I would suggest to remove the |
But earlier you said:
So wouldn't you still have the exact same limitation? |
No. Jaeger does some sort of pagination by moving its search start point with the search_after parameter. So it loops through all results. The issue with Example: 1 Index with 4 Shards, each shard has 20000 documents. So a total of 80000 documents.
Without |
not sure I follow about step 5. If the looping is done on the client side, why wouldn't it continue in 2500 chunks? How does ES know that 10k docs from each shard were already loaded? Isn't every request in a loop independent? Or is this some kind of session in play? |
Imagine you have 4 arrays of dics/hash with 20000 entries. Those are our shards. When searching with terminate_after it will get the first 10000 entries of each array puts them into an tmp array and sorts them by a time field in the dict. This tmp array is the max amount of search results you can get with pagination/search_after. Without terminate_after it would copy all 20000 entries of each array into the tmp array and sorts it. You can the use pagination/search_after to get them. This is just and example to explain it. The loop of the client ends because the server is not returning any results due to the limit of terminate_after. Does this explain it? Otherwise I could try giving you a real example tomorrow. |
What are you basing this assumption on? That sounds like a pretty naive / stupid implementation to have in the database. ES builds inverted indices at insert time, so you don't have to sort at query time. So when query4 is sent with condition |
It was not an exact implementation example. I just wanted to explain in a simple way. I don't know the exact way how terminate_after works. All I know is that it is broken and prevents jaeger query from handling traces with a large amount of spans. I just noticed that the search results with
You can test it your self or maybe change line 52 in plugin/storage/integration/elasticsearch_test.go to Do I need to past each search jaeger currently does to show you the results of it or is the evidence from above enough? |
…ing in incomplete span count/list (#4336) ## Which problem is this PR solving? Closes #4330 ## Short description of the changes - Remove TerminateAfter from Elasticsearch/Opensearch query resulting in incomplete span count/list. --------- Signed-off-by: Jakob Hahn <[email protected]> Co-authored-by: Albert <[email protected]>
What happened?
The Jaeger query is not resolving all spans with Elasticsearch/Opensearch Backend.
This happens as the
_msearch
request uses theterminate_after
option in the body, which result in a wronghits.total
value. Due to the wronghits.total
value the search loop stops before getting all spans.Steps to reproduce
--es.max-doc-count
untouchedExpected behavior
I would expect jaeger to resolve all spans from the trace.
Relevant log output
No response
Screenshot
No response
Additional context
teminate_after
is set to max doc count.https://github.com/jaegertracing/jaeger/blob/main/plugin/storage/es/spanstore/reader.go#L208
The total hits is compared to the length of all saved trace span hits.
https://github.com/jaegertracing/jaeger/blob/main/plugin/storage/es/spanstore/reader.go#L432
Example of how
teminate_after
behaves with hits total:Possible solutions I can think of:
_search
andtrack_total_hits
once before and match against the correct total hits value - additional time spend due to extra request.Jaeger backend version
No response
SDK
No response
Pipeline
No response
Stogage backend
Opensearch 2.6.0
Operating system
Linux
Deployment model
Kubernetes
Deployment configs
No response
The text was updated successfully, but these errors were encountered: