Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Unified search query syntax using the full-text search capabilities of the underlying DB #11635

Merged
merged 59 commits into from
Oct 25, 2022

Conversation

novocaine
Copy link
Contributor

@novocaine novocaine commented Dec 23, 2021

Support a unified search query syntax which leverages more of the full-text search of each DB we support.

We now support, with the same syntax across Postgresql and Sqlite:

  • quoted "search terms"
  • AND, OR, - (negation) set operators and parentheses
  • Matching words based on their stem, e.g. searches for "dog" matches documents containing "dogs".

This is achieved by

  • If on pgsql 11+, pass the user inputer to websearch_to_tsquery
  • If on an earlier version of pgsql or sqlite, manually parse the query and transform it into the underlying DB syntax supporting the same query dialect

Multiple terms separated by a space are implicitly ANDed.

Possible Issues:

  1. There is no escaping of full-text syntax that might be supported by the DB; e.g. NOT, NEAR, * in sqlite, &, | in pgsql. This runs the risk that people might discover this as accidental functionality and depend on something we don't guarantee.
  2. On both pgsql and sqllite, english text is assumed for stemming. To support other languages, it's work: either the target language needs to be known at the time of indexing the message (via room metadata, or otherwise), or it'll be necessary to create a separate index for each language supported. Alternatively, we could just disable stemming for now (but still use the other improvements in this PR).

Before

Prior to this PR, we transformed all full-text queries into a DB full-text query by

  1. splitting the query on all non-word characters (i.e. not in the python regex class \w)
  2. suffixing all the words with "*" to use prefix matching
  3. re-joining the words using '&' as the separator.

For example, the user input foo? bar is transformed into foo* & bar*. This happens for both sqlite and postgres.

This had some drawbacks -

  • On sqlite, & isn't actually an operator, its actually just being stripped and then the remaining spaces are interpreted to mean AND.
  • Stripping the query of non-word characters means quote search and NOT aren't available to users, despite both engines supporting these operators. Note that client-side seshat search does support these operators as its based on on tantivy
  • Prefix search probably isn't what users expect as that's not how search engines usually work; it probably interferes with the expected stemming behaviour; and it might have significant performance impact

Sqlite docs
Postgres docs

And for comparison with how Element clients support search, see the Tantivy docs

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
  • Pull request includes a sign off
  • Code style is correct
    (run the linters)

Sign-off

Signed-off-by: James Salter [email protected]

@@ -437,7 +440,7 @@ async def search_msgs(self, room_ids, search_term, keys):
"SELECT room_id, count(*) as count FROM event_search"
" WHERE value MATCH ?"
)
count_args = [search_term] + count_args
Copy link
Contributor Author

@novocaine novocaine Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was a bug to pass search_term here rather than search_query..?

that can be passed to database.
We use this so that we can add prefix matching, which isn't something
that is supported by default.
that can be passed to sqllite's matchinfo().
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently a no-op, but we will probably want to add stuff here to handle certain cases e.g. #3024

@novocaine
Copy link
Contributor Author

novocaine commented Dec 23, 2021

There's a test failing on sqlite because it relies on some sort of prefixing (searches for labels to match label); this test is passing on postgres because there the database is doing stemming during parsing.

sqlite does actually support stemming, so I'm going to try to enable that to fix (this relies on a db migration, so might get tricky)

@MadLittleMods MadLittleMods added A-Message-Search Searching messages T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Feb 7, 2022
@clokep clokep removed request for a team and erikjohnston October 18, 2022 16:28
@clokep
Copy link
Member

clokep commented Oct 19, 2022

@erikjohnston any thoughts on the currently open comment threads?

@clokep clokep removed the request for review from erikjohnston October 24, 2022 13:55
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise I think its good

" room_id, event_id"
" FROM event_search"
" WHERE vector @@ to_tsquery('english', ?)"
f" WHERE vector @@ {tsquery_func}('english', ?)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these to multiline strings while we're here 😇

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to do a follow-up PR to update the entire module to multi-line strings. Would that be acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

synapse/storage/databases/main/search.py Outdated Show resolved Hide resolved
@clokep clokep merged commit d902181 into matrix-org:develop Oct 25, 2022
DMRobertson pushed a commit that referenced this pull request Nov 2, 2022
…lities of the underlying DB. (#11635)"

This reverts commit d902181.
squahtx pushed a commit that referenced this pull request Nov 4, 2022
…h capabilities of the underlying DB. (#11635)""

This reverts commit 7e0dd52.
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Nov 14, 2022
Synapse 1.71.0 (2022-11-08)
===========================

Please note that, as announced in the release notes for Synapse
1.69.0, legacy Prometheus metric names are now disabled by default.
They will be removed altogether in Synapse 1.73.0.  If not already
done, server administrators should update their dashboards and
alerting rules to avoid using the deprecated metric names.  See the
[upgrade
notes](https://matrix-org.github.io/synapse/v1.71/upgrade.html#upgrading-to-v1710)
for more details.

**Note:** in line with our [deprecation
  policy](https://matrix-org.github.io/synapse/latest/deprecation_policy.html)
  for platform dependencies, this will be the last release to support
  PostgreSQL 10, which reaches upstream end-of-life on November 10th,
  2022. Future releases of Synapse will require PostgreSQL 11+.

Features
--------

- Support back-channel logouts from OpenID Connect
  providers. ([\#11414](matrix-org/synapse#11414))

- Allow use of Postgres and SQLlite full-text search operators in
  search
  queries. ([\#11635](matrix-org/synapse#11635),
  [\#14310](matrix-org/synapse#14310),
  [\#14311](matrix-org/synapse#14311))

- Implement
  [MSC3664](matrix-org/matrix-spec-proposals#3664),
  Pushrules for relations. Contributed by
  Nico. ([\#11804](matrix-org/synapse#11804))

- Improve aesthetics of HTML templates. Note that these changes do not
  retroactively apply to templates which have been
  [customised](https://matrix-org.github.io/synapse/latest/templates.html#templates)
  by server
  admins. ([\#13652](matrix-org/synapse#13652))

- Enable write-ahead logging for SQLite installations. Contributed by
  [@asymmetric](https://github.com/asymmetric). ([\#13897](matrix-org/synapse#13897))

- Show erasure status when [listing
  users](https://matrix-org.github.io/synapse/latest/admin_api/user_admin_api.html#query-user-account)
  in the Admin
  API. ([\#14205](matrix-org/synapse#14205))

- Provide a specific error code when a `/sync` request provides a
  filter which doesn't represent a JSON
  object. ([\#14262](matrix-org/synapse#14262))
bradtgmurray added a commit to beeper/synapse-legacy-fork that referenced this pull request Nov 15, 2022
Synapse 1.71.0 (2022-11-08)
===========================

Please note that, as announced in the release notes for Synapse 1.69.0, legacy Prometheus metric names are now disabled by default.
They will be removed altogether in Synapse 1.73.0.
If not already done, server administrators should update their dashboards and alerting rules to avoid using the deprecated metric names.
See the [upgrade notes](https://matrix-org.github.io/synapse/v1.71/upgrade.html#upgrading-to-v1710) for more details.

**Note:** in line with our [deprecation policy](https://matrix-org.github.io/synapse/latest/deprecation_policy.html) for platform dependencies, this will be the last release to support PostgreSQL 10, which reaches upstream end-of-life on November 10th, 2022. Future releases of Synapse will require PostgreSQL 11+.

No significant changes since 1.71.0rc2.

Synapse 1.71.0rc2 (2022-11-04)
==============================

Improved Documentation
----------------------

- Document the changes to monthly active user metrics due to deprecation of legacy Prometheus metric names. ([\matrix-org#14358](matrix-org#14358), [\matrix-org#14360](matrix-org#14360))

Deprecations and Removals
-------------------------

- Disable legacy Prometheus metric names by default. They can still be re-enabled for now, but they will be removed altogether in Synapse 1.73.0. ([\matrix-org#14353](matrix-org#14353))

Internal Changes
----------------

- Run unit tests against Python 3.11. ([\matrix-org#13812](matrix-org#13812))

Synapse 1.71.0rc1 (2022-11-01)
==============================

Features
--------

- Support back-channel logouts from OpenID Connect providers. ([\matrix-org#11414](matrix-org#11414))
- Allow use of Postgres and SQLlite full-text search operators in search queries. ([\matrix-org#11635](matrix-org#11635), [\matrix-org#14310](matrix-org#14310), [\matrix-org#14311](matrix-org#14311))
- Implement [MSC3664](matrix-org/matrix-spec-proposals#3664), Pushrules for relations. Contributed by Nico. ([\matrix-org#11804](matrix-org#11804))
- Improve aesthetics of HTML templates. Note that these changes do not retroactively apply to templates which have been [customised](https://matrix-org.github.io/synapse/latest/templates.html#templates) by server admins. ([\matrix-org#13652](matrix-org#13652))
- Enable write-ahead logging for SQLite installations. Contributed by [@asymmetric](https://github.com/asymmetric). ([\matrix-org#13897](matrix-org#13897))
- Show erasure status when [listing users](https://matrix-org.github.io/synapse/latest/admin_api/user_admin_api.html#query-user-account) in the Admin API. ([\matrix-org#14205](matrix-org#14205))
- Provide a specific error code when a `/sync` request provides a filter which doesn't represent a JSON object. ([\matrix-org#14262](matrix-org#14262))

Bugfixes
--------

- Fix a long-standing bug where the `update_synapse_database` script could not be run with multiple databases. Contributed by @thefinn93 @ Beeper. ([\matrix-org#13422](matrix-org#13422))
- Fix a bug which prevented setting an avatar on homeservers which have an explicit port in their `server_name` and have `max_avatar_size` and/or `allowed_avatar_mimetypes` configuration. Contributed by @ashfame. ([\matrix-org#13927](matrix-org#13927))
- Check appservice user interest against the local users instead of all users in the room to align with [MSC3905](matrix-org/matrix-spec-proposals#3905). ([\matrix-org#13958](matrix-org#13958))
- Fix a long-standing bug where Synapse would accidentally include extra information in the response to [`PUT /_matrix/federation/v2/invite/{roomId}/{eventId}`](https://spec.matrix.org/v1.4/server-server-api/#put_matrixfederationv2inviteroomideventid). ([\matrix-org#14064](matrix-org#14064))
- Fix a bug introduced in Synapse 1.64.0 where presence updates could be missing from `/sync` responses. ([\matrix-org#14243](matrix-org#14243))
- Fix a bug introduced in Synapse 1.60.0 which caused an error to be logged when Synapse received a SIGHUP signal if debug logging was enabled. ([\matrix-org#14258](matrix-org#14258))
- Prevent history insertion ([MSC2716](matrix-org/matrix-spec-proposals#2716)) during an partial join ([MSC3706](matrix-org/matrix-spec-proposals#3706)). ([\matrix-org#14291](matrix-org#14291))
- Fix a bug introduced in Synapse 1.34.0 where device names would be returned via a federation user key query request when `allow_device_name_lookup_over_federation` was set to `false`. ([\matrix-org#14304](matrix-org#14304))
- Fix a bug introduced in Synapse 0.34.0 where logs could include error spam when background processes are measured as taking a negative amount of time. ([\matrix-org#14323](matrix-org#14323))
- Fix a bug introduced in Synapse 1.70.0 where clients were unable to PUT new [dehydrated devices](matrix-org/matrix-spec-proposals#2697). ([\matrix-org#14336](matrix-org#14336))

Improved Documentation
----------------------

- Explain how to disable the use of [`trusted_key_servers`](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html#trusted_key_servers). ([\matrix-org#13999](matrix-org#13999))
- Add workers settings to [configuration manual](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html#individual-worker-configuration). ([\matrix-org#14086](matrix-org#14086))
- Correct the name of the config option [`encryption_enabled_by_default_for_room_type`](https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html#encryption_enabled_by_default_for_room_type). ([\matrix-org#14110](matrix-org#14110))
- Update docstrings of `SynapseError` and `FederationError` to bettter describe what they are used for and the effects of using them are. ([\matrix-org#14191](matrix-org#14191))

Internal Changes
----------------

- Remove unused `@lru_cache` decorator. ([\matrix-org#13595](matrix-org#13595))
- Save login tokens in database and prevent login token reuse. ([\matrix-org#13844](matrix-org#13844))
- Refactor OIDC tests to better mimic an actual OIDC provider. ([\matrix-org#13910](matrix-org#13910))
- Fix type annotation causing import time error in the Complement forking launcher. ([\matrix-org#14084](matrix-org#14084))
- Refactor [MSC3030](matrix-org/matrix-spec-proposals#3030) `/timestamp_to_event` endpoint to loop over federation destinations with standard pattern and error handling. ([\matrix-org#14096](matrix-org#14096))
- Add initial power level event to batch of bulk persisted events when creating a new room. ([\matrix-org#14228](matrix-org#14228))
- Refactor `/key/` endpoints to use `RestServlet` classes. ([\matrix-org#14229](matrix-org#14229))
- Switch to using the `matrix-org/backend-meta` version of `triage-incoming` for new issues in CI. ([\matrix-org#14230](matrix-org#14230))
- Build wheels on macos 11, not 10.15. ([\matrix-org#14249](matrix-org#14249))
- Add debugging to help diagnose lost device list updates. ([\matrix-org#14268](matrix-org#14268))
- Add Rust cache to CI for `trial` runs. ([\matrix-org#14287](matrix-org#14287))
- Improve type hinting of `RawHeaders`. ([\matrix-org#14303](matrix-org#14303))
- Use Poetry 1.2.0 in the Twisted Trunk CI job. ([\matrix-org#14305](matrix-org#14305))

<details>
<summary>Dependency updates</summary>

Runtime:

- Bump anyhow from 1.0.65 to 1.0.66. ([\matrix-org#14278](matrix-org#14278))
- Bump jinja2 from 3.0.3 to 3.1.2. ([\matrix-org#14271](matrix-org#14271))
- Bump prometheus-client from 0.14.0 to 0.15.0. ([\matrix-org#14274](matrix-org#14274))
- Bump psycopg2 from 2.9.4 to 2.9.5. ([\matrix-org#14331](matrix-org#14331))
- Bump pysaml2 from 7.1.2 to 7.2.1. ([\matrix-org#14270](matrix-org#14270))
- Bump sentry-sdk from 1.5.11 to 1.10.1. ([\matrix-org#14330](matrix-org#14330))
- Bump serde from 1.0.145 to 1.0.147. ([\matrix-org#14277](matrix-org#14277))
- Bump serde_json from 1.0.86 to 1.0.87. ([\matrix-org#14279](matrix-org#14279))

Tooling and CI:

- Bump black from 22.3.0 to 22.10.0. ([\matrix-org#14328](matrix-org#14328))
- Bump flake8-bugbear from 21.3.2 to 22.9.23. ([\matrix-org#14042](matrix-org#14042))
- Bump peaceiris/actions-gh-pages from 3.8.0 to 3.9.0. ([\matrix-org#14276](matrix-org#14276))
- Bump peaceiris/actions-mdbook from 1.1.14 to 1.2.0. ([\matrix-org#14275](matrix-org#14275))
- Bump setuptools-rust from 1.5.1 to 1.5.2. ([\matrix-org#14273](matrix-org#14273))
- Bump twine from 3.8.0 to 4.0.1. ([\matrix-org#14332](matrix-org#14332))
- Bump types-opentracing from 2.4.7 to 2.4.10. ([\matrix-org#14133](matrix-org#14133))
- Bump types-requests from 2.28.11 to 2.28.11.2. ([\matrix-org#14272](matrix-org#14272))
</details>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Message-Search Searching messages T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants