Skip to content

Commit

Permalink
Provide users with warning about special characters in query DSL and …
Browse files Browse the repository at this point in the history
…API query fields (#1255)

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <[email protected]>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <[email protected]>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <[email protected]>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <[email protected]>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <[email protected]>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <[email protected]>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <[email protected]>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <[email protected]>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <[email protected]>

* fix#1196_Spesh-Chars

Signed-off-by: cwillum <[email protected]>

Signed-off-by: cwillum <[email protected]>
  • Loading branch information
cwillum authored Sep 27, 2022
1 parent bbd0f11 commit 1b69f70
Show file tree
Hide file tree
Showing 3 changed files with 78 additions and 0 deletions.
35 changes: 35 additions & 0 deletions _opensearch/query-dsl/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,38 @@ With query DSL, however, you can include an HTTP request body to look for result
}
```
The OpenSearch query DSL comes in three varieties: term-level queries, full-text queries, and boolean queries. You can even perform more complicated searches by using different elements from each variety to find whatever data you need.

## A note on Unicode special characters in text fields

Due to word boundaries associated with Unicode special characters, the Unicode standard analyzer cannot index a [text field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/text/) value as a whole value when it includes one of these special characters. As a result, a text field value that includes a special character is parsed by the standard analyzer as multiple values separated by the special character, effectively tokenizing the different elements on either side of it. This can lead to unintentional filtering of documents and potentially compromise control over their access.

The examples below illustrate values containing special characters that will be parsed improperly by the standard analyzer. In this example, the existence of the hyphen/minus sign in the value prevents the analyzer from distinguishing between the two different users for `user.id` and interprets them as one and the same:

```json
{
"bool": {
"must": {
"match": {
"user.id": "User-1"
}
}
}
}
```

```json
{
"bool": {
"must": {
"match": {
"user.id": "User-2"
}
}
}
}
```

To avoid this circumstance when using either query DSL or the REST API, you can use a custom analyzer or map the field as `keyword`, which performs an exact-match search. See [Keyword field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/keyword/) for the latter option.

For a list of characters that should be avoided when field type is `text`, see [Word Boundaries](https://unicode.org/reports/tr29/#Word_Boundaries).

9 changes: 9 additions & 0 deletions _security-plugin/access-control/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,15 @@ PUT _plugins/_security/api/roles/<role>
}
```

>Due to word boundaries associated with Unicode special characters, the Unicode standard analyzer cannot index a [text field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/text/) value as a whole value when it includes one of these special characters. As a result, a text field value that includes a special character is parsed by the standard analyzer as multiple values separated by the special character, effectively tokenizing the different elements on either side of it.
>
>For example, since the values in the fields ```"user.id": "User-1"``` and ```"user.id": "User-2"``` contain the hyphen/minus sign, this special character will prevent the analyzer from distinguishing between the two different users for `user.id` and interpret them as one and the same. This can lead to unintentional filtering of documents and potentially compromise control over their access.
>
>To avoid this circumstance, you can use a custom analyzer or map the field as `keyword`, which performs an exact-match search. See [Keyword field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/keyword/) for the latter option.
>
>For a list of characters that should be avoided when field type is `text`, see [Word Boundaries](https://unicode.org/reports/tr29/#Word_Boundaries).
{: .warning}


### Patch role
Introduced 1.0
Expand Down
34 changes: 34 additions & 0 deletions _security-plugin/access-control/document-level-security.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,40 @@ PUT _plugins/_security/api/roles/public_data
These queries can be as complex as you want, but we recommend keeping them simple to minimize the performance impact that the document-level security feature has on the cluster.
{: .warning }

### A note on Unicode special characters in text fields

Due to word boundaries associated with Unicode special characters, the Unicode standard analyzer cannot index a [text field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/text/) value as a whole value when it includes one of these special characters. As a result, a text field value that includes a special character is parsed by the standard analyzer as multiple values separated by the special character, effectively tokenizing the different elements on either side of it. This can lead to unintentional filtering of documents and potentially compromise control over their access.

The examples below illustrate values containing special characters that will be parsed improperly by the standard analyzer. In this example, the existence of the hyphen/minus sign in the value prevents the analyzer from distinguishing between the two different users for `user.id` and interprets them as one and the same:

```json
{
"bool": {
"must": {
"match": {
"user.id": "User-1"
}
}
}
}
```

```json
{
"bool": {
"must": {
"match": {
"user.id": "User-2"
}
}
}
}
```

To avoid this circumstance when using either Query DSL or the REST API, you can use a custom analyzer or map the field as `keyword`, which performs an exact-match search. See [Keyword field type](https://opensearch.org/docs/2.2/opensearch/supported-field-types/keyword/) for the latter option.

For a list of characters that should be avoided when field type is `text`, see [Word Boundaries](https://unicode.org/reports/tr29/#Word_Boundaries).


## Parameter substitution

Expand Down

0 comments on commit 1b69f70

Please sign in to comment.