Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mango queries should not use other index when use_index is specified #4511

Closed
paultranvan opened this issue Apr 4, 2023 · 6 comments · Fixed by #4792
Closed

Mango queries should not use other index when use_index is specified #4511

paultranvan opened this issue Apr 4, 2023 · 6 comments · Fixed by #4792

Comments

@paultranvan
Copy link

paultranvan commented Apr 4, 2023

Description

When a mango query includes a use_index option and the specified index does not exist, CouchDB might fallback on an existing index, which can cause unexpected behavior.

Steps to Reproduce

Consider the following mango query:

{
   "selector": {
      "worker": "service",
      "$or": [
         {
            "state": "queued"
         },
         {
            "state": "running"
         }
      ]
   },
   "sort": [
      {
         "worker": "asc"
      }
   ],
   "use_index": "_design/by-worker"
}

Now suppose the _design/by-worker index does not exist, but there is an existing one on the same fields:

{
    "index": {
        "fields": ["worker", "state"]
    },
    "ddoc": "_design/by-worker-and-state",
    "type" : "json"
}

The _design/by-worker-and-state index will actually be used. However, this can be problematic if the database is quite huge: the $or will be evaluated in-memory, which could result in timeouts as the whole database will be scanned.

In our client logic, we first run the mango query and expect an error, if the index does not exist. Then, we create the index, and run the query again. This avoids to check if the index exists for every mango query, as it will be the case most of the time.
See the logic here.

In our example, if the index does not exist, we would create one with a partial filter:

{
    "index": {
        "fields": ["worker"],
        "partial_filter_selector": {
          "$or": [
             {
                "state": "queued"
             },
             {
                "state": "running"
             }
           ]
        }
    },
    "ddoc": "_design/by-worker-and-partial-state",
    "type" : "json"
}

This avoids to evaluate the $or at query time, and potential timeouts. But because of the existing index being used, we never have the opportunity to create the actual and efficient index.

Expected Behaviour

We expect any mango query including a use_index to immediately fails if the index does not exist. This seems reasonable to force the use of a particular index without any fallback, as it might lead to sub-optimal queries.

Your Environment

  • CouchDB version used: 3.2.1
  • Browser name and version: Not relevant
  • Operating system and version: Not relevant

Additional Context

This issue can be convoluted by specifying a sort on fields that are not used on the existing index, but this is not really a solution for us.

taratatach added a commit to cozy/cozy-banks that referenced this issue Apr 4, 2023
  The in-progress jobs query for the export feature relies on the
  presence of a very specific index to be fast on cozies with a lot of
  `io.cozy.jobs` documents.

  To make sure the index presence detection goes well, we forced a sort
  on the `worker` attribute.
  However, since `cozy-stack` uses an index on `worker` and `state`
  only, CouchDB fallbacks to this index instead of detecting the index
  we want does not exist and runs the slow `$or` query in-memory.
  See apache/couchdb#4511 for more details on
  this issue.

  To avoid this fallback we will sort on a second attribute not present
  in the existing `cozy-stack` index: `message.slug`.
taratatach added a commit to cozy/cozy-banks that referenced this issue Apr 4, 2023
  The in-progress jobs query for the export feature relies on the
  presence of a very specific index to be fast on cozies with a lot of
  `io.cozy.jobs` documents.

  To make sure the index presence detection goes well, we forced a sort
  on the `worker` attribute.
  However, since `cozy-stack` uses an index on `worker` and `state`
  only, CouchDB fallbacks to this index instead of detecting the index
  we want does not exist and runs the slow `$or` query in-memory.
  See apache/couchdb#4511 for more details on
  this issue.

  To prevent this fallback we will sort on a second attribute not present
  in the existing `cozy-stack` index: `message.slug`.
@pgj
Copy link
Contributor

pgj commented Apr 4, 2023

I would like to work on this ticket. Committers, please assign it to me.

taratatach added a commit to cozy/cozy-banks that referenced this issue Apr 5, 2023
  The in-progress jobs query for the export feature relies on the
  presence of a very specific index to be fast on cozies with a lot of
  `io.cozy.jobs` documents.

  To make sure the index presence detection goes well, we forced a sort
  on the `worker` attribute.
  However, since `cozy-stack` uses an index on `worker` and `state`
  only, CouchDB fallbacks to this index instead of detecting the index
  we want does not exist and runs the slow `$or` query in-memory.
  See apache/couchdb#4511 for more details on
  this issue.

  To prevent this fallback we will sort on a second attribute not present
  in the existing `cozy-stack` index: `message.slug`.
taratatach added a commit to cozy/cozy-banks that referenced this issue Apr 5, 2023
The in-progress jobs query for the export feature relies on the
presence of a very specific index to be fast on cozies with a lot of
`io.cozy.jobs` documents.

To make sure the index presence detection goes well, we forced a sort
on the `worker` attribute.
However, since `cozy-stack` uses an index on `worker` and `state`
only, CouchDB fallbacks to this index instead of detecting the index
we want does not exist and runs the slow `$or` query in-memory.
See apache/couchdb#4511 for more details on
this issue.

To prevent this fallback we will sort on a second attribute not present
in the existing `cozy-stack` index: `message.slug`.

```
### 🐛 Bug Fixes

* Prevent fallback on sub-optimal index for in-progress jobs query.
```
taratatach added a commit to cozy/cozy-banks that referenced this issue Apr 5, 2023
  The in-progress jobs query for the export feature relies on the
  presence of a very specific index to be fast on cozies with a lot of
  `io.cozy.jobs` documents.

  To make sure the index presence detection goes well, we forced a sort
  on the `worker` attribute.
  However, since `cozy-stack` uses an index on `worker` and `state`
  only, CouchDB fallbacks to this index instead of detecting the index
  we want does not exist and runs the slow `$or` query in-memory.
  See apache/couchdb#4511 for more details on
  this issue.

  To prevent this fallback we will sort on a second attribute not present
  in the existing `cozy-stack` index: `message.slug`.
taratatach added a commit to cozy/cozy-banks that referenced this issue Apr 5, 2023
Backport of #2644 

The in-progress jobs query for the export feature relies on the
presence of a very specific index to be fast on cozies with a lot of
`io.cozy.jobs` documents.

To make sure the index presence detection goes well, we forced a sort
on the `worker` attribute.
However, since `cozy-stack` uses an index on `worker` and `state`
only, CouchDB fallbacks to this index instead of detecting the index
we want does not exist and runs the slow `$or` query in-memory.
See apache/couchdb#4511 for more details on
this issue.

To prevent this fallback we will sort on a second attribute not present
in the existing `cozy-stack` index: `message.slug`.

```
### 🐛 Bug Fixes

* Prevent fallback on sub-optimal index for in-progress jobs query.
```
taratatach added a commit to cozy/cozy-banks that referenced this issue Apr 5, 2023
  The in-progress jobs query for the export feature relies on the
  presence of a very specific index to be fast on cozies with a lot of
  `io.cozy.jobs` documents.

  To make sure the index presence detection goes well, we forced a sort
  on the `worker` attribute.
  However, since `cozy-stack` uses an index on `worker` and `state`
  only, CouchDB fallbacks to this index instead of detecting the index
  we want does not exist and runs the slow `$or` query in-memory.
  See apache/couchdb#4511 for more details on
  this issue.

  To prevent this fallback we will sort on a second attribute not present
  in the existing `cozy-stack` index: `message.slug`.
taratatach added a commit to cozy/cozy-banks that referenced this issue Apr 5, 2023
Backport of #2644 

The in-progress jobs query for the export feature relies on the
presence of a very specific index to be fast on cozies with a lot of
`io.cozy.jobs` documents.

To make sure the index presence detection goes well, we forced a sort
on the `worker` attribute.
However, since `cozy-stack` uses an index on `worker` and `state`
only, CouchDB fallbacks to this index instead of detecting the index
we want does not exist and runs the slow `$or` query in-memory.
See apache/couchdb#4511 for more details on
this issue.

To prevent this fallback we will sort on a second attribute not present
in the existing `cozy-stack` index: `message.slug`.

```
### 🐛 Bug Fixes

* Prevent fallback on sub-optimal index for in-progress jobs query.
```
@pgj
Copy link
Contributor

pgj commented Apr 6, 2023

Hi @paultranvan, I have studied the source code and it seems this behavior is intended (see #962). However, I see what you want to achieve here and why this behavior causes a problem.

On the side of CouchDB, a possible solution would be to adjust the API so that the desired semantics could be requested. That is, by passing, say use_index={"candidates": ["_design/by_worker"], "strict": true} will result HTTP 400 (Bad Parameters) when the requested index is missing.

But I think the whole issue could be managed on the client side. Why not to store the list of known indexes there and use it to decide on each query if an index has been in use? This might be even faster as you would not have to look out for errors constantly but get the desired result by a single query. When needed, the list of known indexes could be initialized / synced from a CouchDB /{db}/_index GET request.

@paultranvan
Copy link
Author

Hi @pgj, thank you for your answer.

Yes it could be handled by the client, but it is quite cumbersome because you have to deal with this list state and update it when there are add/edit/remove on indexes. Typically, if an index is not in the list, you don't know if it's because it does not exist yet or simply because the list is not up-to-date, which could happen if there are several clients.

That's why I prefer the idea of adjusting the API e.g. with a strict: true param :)

@willholley
Copy link
Member

I like the idea of an optional behaviour to require strict index usage, and probably should have a toggle for the default behaviour in default.ini as well as an API option. I worry a bit about overloading the type in use_index as I know this causes problems with some clients as well as being difficult to specify in OpenAPI, though I know it is already a bit confused in allowing either a string or array. A distinct strict_index_selection or similar field would likely be simpler in that regard, and I wonder whether we might want to use it to toggle index fall-back behaviour more generally (i.e. error instead of using all_docs if no index is available to service a query)?

@pgj
Copy link
Contributor

pgj commented Apr 11, 2023

I proposed the API option because that can be controlled per query not per deployment. My impression is that users are mostly fine with the original behavior but there are use cases (like the one featured in this issue) where "more strictness" would be preferred. Nevertheless I can be fine with controlling the default from default.ini and let it be overridden per-query, which could then work in both directions.

I fully accept the criticism about overloading use_index even more. Adding a separate flag is indeed a cleaner approach. Extending the approach to the generic case can also be a good move -- similarly to this use case, there may be others where it is not preferred to execute suboptimal queries but return an error.

pgj added a commit to pgj/couchdb that referenced this issue Jul 30, 2023
It is not always beneficial for the performance if the Mango query
planner tries to assign an index to the selector.  User-specified
indexes may save the day, but since they are only hints for the
planner, automated overrides can still happen.

Introduce the concept of "strict index selection" which lets the
user to request the exclusive use of a specific index.  When it is
not possible, give up on planning and return an HTTP 400 response
right away.  This way the user has the chance to learn about the
missing index, request its creation and try again later.

The feature comes with a configuration toggle.  By default, the
feature is disabled to maintain backward compatibility, but the
user may ask for this behavior via the new `use_index_strict`
query parameter for a specific query.  Note that, similarly,
`use_index_strict` could be used to disable strict index selection
temporarily when it is enabled on the server.

Fixes apache#4511
pgj added a commit to pgj/couchdb that referenced this issue Jul 30, 2023
It is not always beneficial for the performance if the Mango query
planner tries to assign an index to the selector.  User-specified
indexes may save the day, but since they are only hints for the
planner, automated overrides can still happen.

Introduce the concept of "strict index selection" which lets the
user to request the exclusive use of a specific index.  When it is
not possible, give up on planning and return an HTTP 400 response
right away.  This way the user has the chance to learn about the
missing index, request its creation and try again later.

The feature comes with a configuration toggle.  By default, the
feature is disabled to maintain backward compatibility, but the
user may ask for this behavior via the new `use_index_strict`
query parameter for a specific query.  Note that, similarly,
`use_index_strict` could be used to disable strict index selection
temporarily when it is enabled on the server.

Fixes apache#4511
pgj added a commit to pgj/couchdb that referenced this issue Aug 1, 2023
It is not always beneficial for the performance if the Mango query
planner tries to assign an index to the selector.  User-specified
indexes may save the day, but since they are only hints for the
planner, automated overrides can still happen.

Introduce the concept of "strict index selection" which lets the
user to request the exclusive use of a specific index.  When it is
not possible, give up on planning and return an HTTP 400 response
right away.  This way the user has the chance to learn about the
missing index, request its creation and try again later.

The feature comes with a configuration toggle.  By default, the
feature is disabled to maintain backward compatibility, but the
user may ask for this behavior via the new `use_index_strict`
query parameter for a specific query.  Note that, similarly,
`use_index_strict` could be used to disable strict index selection
temporarily when it is enabled on the server.

Fixes apache#4511
pgj added a commit to pgj/couchdb that referenced this issue Oct 4, 2023
It is not always beneficial for the performance if the Mango query
planner tries to assign an index to the selector.  User-specified
indexes may save the day, but since they are only hints for the
planner, fallbacks may still happen.

Introduce the `allow_fallback` flag which can be used to tell if
falling back to other indexes is acceptable when an index is
explicitly specified by the user.  When set to `false`, give up
on planning and return an HTTP 400 response right away.  This way
the user has the chance to learn about the requested but missing
index, optionally create it and try again.

By default, fallbacks are allowed to maintain backwards
compatibility.  It is possible to set `allow_fallback` to `true`
but currently it coincides with the default behavior hence becomes
a no-op in practice.

Fixes apache#4511
pgj added a commit to pgj/couchdb that referenced this issue Oct 4, 2023
It is not always beneficial for the performance if the Mango query
planner tries to assign an index to the selector.  User-specified
indexes may save the day, but since they are only hints for the
planner, fallbacks may still happen.

Introduce the `allow_fallback` flag which can be used to tell if
falling back to other indexes is acceptable when an index is
explicitly specified by the user.  When set to `false`, give up
on planning and return an HTTP 400 response right away.  This way
the user has the chance to learn about the requested but missing
index, optionally create it and try again.

By default, fallbacks are allowed to maintain backwards
compatibility.  It is possible to set `allow_fallback` to `true`
but currently it coincides with the default behavior hence becomes
a no-op in practice.

Fixes apache#4511
pgj added a commit to pgj/couchdb that referenced this issue Oct 5, 2023
It is not always beneficial for the performance if the Mango query
planner tries to assign an index to the selector.  User-specified
indexes may save the day, but since they are only hints for the
planner, fallbacks may still happen.

Introduce the `allow_fallback` flag which can be used to tell if
falling back to other indexes is acceptable when an index is
explicitly specified by the user.  When set to `false`, give up
on planning and return an HTTP 400 response right away.  This way
the user has the chance to learn about the requested but missing
index, optionally create it and try again.

By default, fallbacks are allowed to maintain backwards
compatibility.  It is possible to set `allow_fallback` to `true`
but currently it coincides with the default behavior hence becomes
a no-op in practice.

Fixes apache#4511
pgj added a commit to pgj/couchdb that referenced this issue Sep 30, 2024
It is not always beneficial for the performance if the Mango query
planner tries to assign an index to the selector.  User-specified
indexes may save the day, but since they are only hints for the
planner, fallbacks may still happen.

Introduce the `allow_fallback` flag which can be used to tell if
falling back to other indexes is acceptable when an index is
explicitly specified by the user.  When set to `false`, give up
on planning and return an HTTP 400 response right away.  This way
the user has the chance to learn about the requested but missing
index, optionally create it and try again.

By default, fallbacks are allowed to maintain backwards
compatibility.  It is possible to set `allow_fallback` to `true`
but currently it coincides with the default behavior hence becomes
a no-op in practice.

Fixes apache#4511
@paultranvan
Copy link
Author

Thanks a lot @pgj !
This will definitely be useful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants