-
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamp search API #286
Comments
After continually jumping back and forth between performance issues and little bugs in our search implementation, it has become clear that we need to make progress on our driver-based search infrastructure (#509). This would then allow for swapping out a default, naive implementation (hand-maintained index, similar to FluxBB; MySQL fulltext search; etc.) for more advanced drivers based on e.g. Elasticsearch or Algolia. I was asked to describe my thoughts on the initial architecture in a new ticket. But we already have this one, so I'll add my thoughts here. Laravel ScoutMultiple people brought up using Laravel Scout, as it incorporates a driver infrastructure for searches on Eloquent models. This might be possible to use, but I am somewhat skeptical for several reasons:
What might be possible is using Scout's engines (i.e. drivers) so that we have less work in building out the adapters to the various backends. We would still have to take care of the integration into Eloquent / our search infrastructure manually, though. Definitely worth a shot when we get to the implementation phase. RequirementsThese are the filters / gambits supported by our current search layer...
Initial thoughts:
We really have to come up with a better conceptual separation between what is search and what is filtering. Driver capabilitiesOnce this list is complete, we should be able to derive an interface for driver implementations:
Open questions
Naive implementationAs we don't want to require something like Elasticsearch by default (this would complicate the setup far too much), we have to provide a simple implementation by default. As an alternative to the current MySQL-based fulltext search, we could implement a search based on a hand-made index, similar to what FluxBB and other forum software do / have done. Now that the application layer has to notify the index when it should be changed, this is possible. The main benefit would be compatibility with other database systems, should we ever get there. Quick primer on how FluxBB's search works:
Possible future changesWe want to keep these things in mind when designing the interface, even if we don't implement them right away...
This is very likely incomplete right now. I will try my best to keep this post updated, as we discuss things below. |
Currently, we have a DiscussionSearcher and a UserSearcher. I think we should abstract this system down to an AbstractSearcher, which would support extending ANY model, flarum-defined or not, with a searcher. When applied to Posts, this should also remove necessity for the (frankly redundant) ConfigurePostsQuery event. Within that searcher, filtering is currently supported by gambits. I think this is a good system, and the new AbstractSearcher system should support extending ANY searcher with a gambit (if it exists). Now, the searching itself it currently handled by a 'FullTextGambit'. This, in my opinion, is the part that we could refactor out into a system that supports various search drivers. A few possible ideas:
In my opinion, the refactor I described above should be included by stable. Implementing an extender for custom search drivers could be delayed by 0.2. |
Franz and I had a very productive discussion today, here are my takeaways: Searching and FilteringOne point of uncertainty is where to draw the line between searching and filtering, and how these 2 concepts should fit in/work together. I am in favor of a tightly coupled approach, that is, having filtering and searching together under one system. Here's why:
DriversOur goal in designing the new system should be to eventually support search drivers. This should be extendable, so that extension developers can add new drivers. The issue emerges of how to generalize our implementation, so that filters (currently presented by gambits) will work across various extenders. What I'm Proposing
Extenders
Interfaces
|
This sounds like a great plan @askvortsov1 ; I'm extremely curious about the technical implications once we move gambits to the SimpleSearch. Well done! |
After a bit more thought, I realized I'm missing one method in the SearchDriverInterface:
This should work similarly to implementedFilters, but I think it's important to have the 2 separately, because we might have filters but not a searcher, or vice versa. However, there might be a bit of overlap, which might be unideal. One other thing to consider: should filter specs go beyond a regex, and provide, say, a description? or non-regex filter types (equals, contains, between, gt, gte, etc)? In essence, I think that regex might be better as an implementation detail than as the actual interface, but I might be wrong |
I feel for the API instead of: filter[q]=testquery tag:testtag author:testauthor this would be a lot clearer: filter[q]=testquery&filter[tag]=testtag&filter[author]=testauthor That would make it behave like the page parameter But since it's not an easy change, the better alternative could be to rename filter=testquery tag:testtag author:testauthor (And maybe even simplify the page parameters to just |
I don't know that we'd want to rename That being said, I feel like we should separate discussion about the format of REST API requests, and the underlying search architecture (which is more focused on establishing a driver-based system) |
I've spent a bit more time playing around with the system (using elasticsearch as an example, but only as that), and here's a reiteration of the main issues (shocklingly, pretty much exactly what Franz identified above):
So, possible, non-mutually-exclusive approaches to fighting these issues (some of these are quite rough) : a. Run 2 search systems (Flarum's basic one and elasticsearch/algolia/whatever) at the same time. If a search includes an unsupportable gambit like subscription status or unread, default to the native Flarum driver I'm planning to take a better look through the Laravel Scout system and see how they accomplish this (and whether they actually do). Somewhat concrete proposal: If we consider typical user behavior, people are unlikely to look beyond the nth page of search results, where n is typically in single digits. Results become less relevant, and people have typically found something that works by then. (cue "The best place to hide a body is the 2nd page of google results" jokes). Criteria that filter out significant portions of the search space (tags, dates, authors/participants, etc) can all be handled by Elasticsearch As long as we run queries that contain subscription and unread filters through the SQL search, searches we send to ElasticSearch should mostly contain results the user can view. We could further improve that by telling elasticsearch not to include anything from tags/categories the user can't view discussions in, which shouldn't be too performance-intensive, and would further cut down on inaccessible (permission-wise) results returned by ElasticSearch. With these optimizations in place, if we request It goes without saying that this is an extension-space issue, although we might want to make it bundled (at least at the start) because of how critical an issue it is. |
Re-reading this discussion, I feel like we do need to separate filters and full text search into different API endpoints and implementations. The filter part, like proposed in flarum/framework#2104 , makes a lot of sense to me for retrieving posts with filters, discussions with a particular flag (bookmarked, followed, ...) or users based on filters (enabled, group) for an admin UI. The current approach works well and apart from creating proper extenders and re-usable classes, I don't think we need to change it too much. Changing to query-string based over string-based would be nice. The full text search really stands on its own. Very few filters matter. You just want relevant results, and fast. Currently there are only two situations where this endpoint would be used: the quick search dropdown that includes discussions and users, and the full discussion search results page. We don't really need to make the search work with anything else by default. Here's there's also potentially the use case of directly hitting a search server like Algolia without going through Flarum backend at all. My preference/proposal would be as follows:
So the API could look something like this: PHP class FilteredQuery implements ExtenderInterface {
public function __construct(string $filteredQueryClass);
// Equivalent to FlarumSearch::addGambit in the PR
public function addFilter(string $filterClass);
// Equivalent to FlarumSearch::addSearchMutator in the PR
public function addQueryMutator(callable $callback);
} // This extender's only purpose would be to make the drivers and resources known to the admin panel to create a configuration screen. Everything else will be on the client-side
class Search implements ExtenderInterface {
// Registers a new driver for full text search
public function driver(string $driverClass);
// Registers a new full text search resource that this extension provides (for example, pages or albums)
public function resource(string $name);
} // This is the extender's for the base search driver provided by Flarum
class SimpleTextSearch implements ExtenderInterface {
// Add a gambit for the default search driver. Those could work very similarly to what we have now, but would be separate from the filtered query filters
public function addGambit(string $gambitClass);
// Allows altering the SQL query of the default search driver, just like we can with filtered queries
public function addQueryMutator(callable $callback);
} // The server-side search driver class would only serve to build the admin configuration panel. Flarum needs to get a name/description, and know which resources are compatible with this driver. The list of resources isn't hard-coded as an extension could use this search driver's own extenders to add functionality
class AbstractSearchDriver {
abstract function name();
abstract function compatibleResources(): array;
} JS: // The extension would be responsible to use this extender to register one class for each resource they are compatible with
class SearchExtender {
addDriver(resource: string, class implements AbstractSearchDriver);
} // The search driver would work similarly to the current SearchSource objects, but it would also be used for paginated results in the discussions search results
class AbstractSearchDriver {
// This method lets Flarum know which gambits it can offer to this driver
// Flarum would take the list of suggested gambits for the context and compare it against this before suggesting it
abstract supportedGambits(resource: string);
// Performs the search asynchronously. We could wrap the result in another object that gives pagination hints
// The data returned does not need to come from Flarum's API, but must contain models with some minimal attributes/relationships that we could spec out
async abstract search(resource: string, query: string, page: number): Model[];
} // We could add a new method to the global search state that can be extended by extensions to provide gambit suggestions
// This would replace the current GlobalSearchState.prototype.params used by extensions like Tags
app.search.suggestedGambits(): string[]; In the UI, I suggest making it so gambits get offered under the search field when focused. So for example you click the search bar, and the following appears: Clicking a gambit would add it visually to the search, either in text form or ideally in a fashion similar to the tag picker: |
2 completely separated systems works even better! I like pretty much everything about this proposal. We'll probably have some duplicate code between the filters and SimpleTextSearch gambits, especially towards the start. I think we should just keep it duplicate without any forced abstractions to unite the two: as you mentioned, these are more complex systems and unnecessary complexity doesn't help. The biggest challenge here will probably need to be implementation, as the PRs would be bulky. Possible split of work: PR 1A
PR 1B
PR 2
PR 3
PR 4
PR 5
So, that begs the question: what do we need when? PR 1A and 1B I think needs to be included for stable so we can deprecate the old events (preferably in beta 15, instead of flarum/framework#2104, which I think is feasible). PR 2 would also be really nice to have sooner than later to enable foreign language search. I would also like to have this for stable, but maybe as a beta 16 thing. PRs 3, 4, and 5 can be done when developers are available, but are not THAT urgent. PR 3 is the most important IMO. |
I love the idea of separating fulltext vs filters. This would also allow retrieving private discussions using the fulltext search and then running filtering afterwards. We only need to tackle some basic issues like retrieving more entries than we need etc. This would also mean we could look at laravel scout again, potentially.. Well established and easy to replace. I personally don't think that we need to allow search drivers per resource. I think that would over-complicate core as most (if not all) implementation will use one driver. Changing search, adding an advanced search page and introducing better ux for gambits/filters but I think it doesn't make sense to add that to the scope before stable. For the issues with search that UTF-8 needs fixed, we can possibly go for a simpler, temporary solution? |
I think it would be simpler to register drivers to resources, not even from the perspective of ALLOWING search per resources as for registering and accessing those drivers in the first place. Since each resource would have different gambits, trying to shoehorn search for ALL resources under one block of logic would be messy IMO. Most of what I bring up in Phase 1A about custom pages isn't for an "advanced search UI". It's just that we need the index page to hit the filtering API endpoint, and the search page to hit the search API endpoint. This might be possible to integrate into the index page, but I need to play around with it first. |
I can't see a solution where there isn't one driver per resource and where extensions can define their own search endpoints. Unless we standardize a way for extensions to provide the data that must be indexed, but this kind of goes against the flexibility that could be offered by having radically different drivers. My use case for drivers per resources is as follows:
If we force the forum to use basic for everything since it's the only driver that supports all search types, we will create lots of incompatibilities or just discourage extensions from implementing search in their own data. We could force selecting the same driver for both user and discussion however since the two will always be used together and can't be disabled. |
This actually wouldn't happen. Retrieving more entries than we need is the exact issue I brought up above. I'm envisioning these as 2 parallel, but maximally detached systems, |
I made another attempt at flarum/framework#2453, and came across some concerns: Shared EndpointsAfter some more thinking, I don't see why search and filter couldn't share one endpoint. My main reasoning behind this is that regardless of whether a query is searched or filtered, the frontend expects a consistent format of results. The distinction between which method is used is essentially:
Either method should return an Eloquent collection, which can then be marked as paginated (the actual application of sort, offset, and size happens inside the filtering / searching logic), have related objects / includes added to it, and so on. IMO, https://github.com/flarum/core/blob/master/src/Search/SearchResults.php#L14-L14, which is currently returned by Gambits vs Filters code duplicationWith the filtering / search split, everything that is currently a gambit for the
This means we'd have 2 copies of code for all filters / SimpleFlarumSearch gambits. A few ideas:
Another concern is preserving backwards compatibility. With the approach I'm taking in flarum/framework#2453, it's completely preserved for the Search logic (since that is completely unchanged), but it seems like it'll be a breaking change for the filter endpoints, as we can't really apply the gambits from the |
I've gone ahead and refactored flarum/framework#2454 so that gamits and filters are combined into
A few thoughts on how we could (potentially) make it better:
That brings me to my other design concern (quoted from the PR):
With how it's looking now, I think having a class per-resource might be the better solution. I wonder if we could use |
How would it be confusing if we're renaming it though ?
we could deprecate the pattern property, and then add a I don't know yet about the Query namespace, might be better to look into that after we're done with the current PRs as they are. |
I didn't explain this well, sorry. I like the |
I see now, yeah.. in that case, I'd leave them as is for now. |
The PR has been refactored to use
Since it's useful to have this kind of wrapper around user and query (it allows us to pass in additional stuff to gambits/filters in the future without breaking anything), I think we should keep this class in some. After some thinking, my preference would be the following:
That leaves only the question of where the default sort should come from. I see 2 options:
|
The remainder of this is building in an extender + manager class, and adding the frontend integration. Neither of these will be particularly breaking, so they can be delayed to the next release. |
I don't think I found the SQL injection vulnerability of the framework. I just don't know why when I search for some special characters, these characters will be filtered out by the PHP program, resulting in SQL syntax errors. When I modify the packaging mode of MySQL engine InnoDB to mroonga engine, I don't think MySQL storage engine has any impact on your program. But your code does filter out special characters and cause syntax errors in full-text search SQL. in general:
This logic puzzles me! Can't you use MySQL precompiling? |
Plan for new search system:
Search driver (eg. Algolia) is called upon to update its index each time a post/discussion is changed
(Each post is a separate object in the index which contains not only post content but also its discussion's title/tags/other metadata)
q = foo "bar baz" (author:toby OR author:franz) is:following
q
into an abstracted query object . eg. https://github.com/pimcore/search-query-parser or https://github.com/netgen/query-translatorTerm
in the query object, allow gambits to match and convert term into aWhere
object to be processed by search driverWhere
to restrict results to only tags that the user can seeTerms
/Wheres
, returns list of discussion IDsHowever, this is complex and will take a while to implement, so it would be wise to keep it slated for 0.2 (as per current roadmap). In the meantime, we are doing flarum/framework#1339
The text was updated successfully, but these errors were encountered: