-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignoring MySQL stop words for full-text search #43
Comments
Hi, thanks for your feedback! There is no stop word handling implemented within search_cop currently and no DMBS related code regarding stopwords either. So, if MySQL isn't handling the stopwords properly, you first need to double-check the MySQL docs to fix it (if possible). I'd guess the Personally, i don't want the DMBS/Searchengine/etc. to handle synonyms/stopwords/etc, as this often fails short when i18n or multi-word synonyms/stopwords are desired. Thus, i would do synonym/stopword mapping before passing the query to search cop as well. However, search_cop maybe could provide a synoym/stopword mapping engine: search_scope :search do
attributes :title, :description
stopwords do |phrase|
redis.sismember("stopwords:en", phrase)
end
end such that |
That's correct. The search query entered is in this example is mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger in a +strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name | title |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge. |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.01 sec)
Agreed. Some sort of an optional pull in makes sense as different users may have different needs. It would certainly be very helpful if the users could simply set it up. At the moment I have fixed it as below in the application controller code as shown below.
@posts = Post.search(remove_stopwords(params[:search][:query]))
# Private method much later ....
# SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;
STOP_WORDS = %w{ a about an are as at be by com de en for from how i
in is it la of on or that the this to was what when where who will
with und the www }.freeze
def remove_stopwords(query)
words = query.split(/\s+/).select do |w|
(w.length >= 3) && (w.length <= 84) && !STOP_WORDS.include?(w)
end
words.join(' ')
end Regarding the length checks above, The MySQL manual here has a note for InnoDB stating:
So, if the user were to put in mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('+stranger in a +x +strange' IN BOOLEAN MODE)));
Empty set (0.00 sec)
The stop words for InnoDB are pull using the query mentioned on the page. The page also mentions MyISAM stopwords. |
Just another thought from a solution perspective. Is it possible to disable the Without the plus, the query works just fine. mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('stranger in a strange' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name | title |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge. |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec) This would make the query more flexible and the user could add It may be OK to give results without mysql> SELECT artifact_name, title FROM `posts` WHERE ((MATCH(`posts`.`artifact_name`, `posts`.`title`, `posts`.`body`) AGAINST('stranger in a strange foo' IN BOOLEAN MODE)));
+----------------------------+----------------------------------------------------------------------+
| artifact_name | title |
+----------------------------+----------------------------------------------------------------------+
| Stranger in a Strange Land | You can't copy the driver without indexing the back-end SSL circuit! |
| Stranger in a Strange Land | Digitized incremental challenge. |
+----------------------------+----------------------------------------------------------------------+
2 rows in set (0.00 sec)
|
i probably need to evaluate the mysql fulltext syntax/behaviour once again. |
Thank you for considering this and for search_cop 😃 . Just wanted to mention that the above behavior is for MySQL 8.0. I don't have access to MySQL 5.7 or another version to check if that's any different. |
Is there a way to ignore stop-words for MySQL full-text search? At the moment if we enter a search string that contains stop words, the stop words also get added to the SQL query with + signand the search result is empty.
Note that the above query returns an empty set.
The below query works fine as the stop words are removed.
The stop gap can be that at the application level we could filter stop words before submitting the query. However, I would think this should be a common issue and it would be great if search_cop could help with this.
The text was updated successfully, but these errors were encountered: