Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17346: Synchronise stopwords from snowball with those in lucene #2533

Merged

Conversation

alastair
Copy link
Contributor

https://issues.apache.org/jira/browse/SOLR-17346

Description

Solr's default configset comes with a collection of sample stopwords from the snowball project, There is a similar list of stopwords in the lucene repository, however these have been updated to a more recent list of snowball.
Specifically, the most recent list of stopwords for the french language has removed a number of words which are homonyms of other useful words which shouldn't be skipped.

Solution

Copy the stopword files from the snowball project from lucene to solr.
I only copied files that were present in https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball and only if the version of this file in solr was also from the snowball project (e.g. the english and indonesian stopwords files in solr aren't from snowball, so I didn't copy them from lucene even though they existed there).

Tests

build solr with ./gradlew dev
start solr and create a new core
verify that the expected files were coped to the new core
verify that the core starts up

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@epugh
Copy link
Contributor

epugh commented Jun 26, 2024

This all looks very straightforward to me... One concern, is this something that can go on a 9.x release, or needs to ship as part of 10? The reason I ask is that if we are changing the way we apply stopwords, well, that might be NOT backwards compatible from a relevancy perspective. I don't know how we have handle other data sets like this? I could see this being something that only goes on 10x....

@hossman
Copy link
Member

hossman commented Jun 27, 2024

One concern, is this something that can go on a 9.x release, or needs to ship as part of 10? The reason I ask is that if we are changing the way we apply stopwords, well, that might be NOT backwards compatible from a relevancy perspective.

Nothing in this PR changes the "way" we apply stopwords, it only changes the list of stopwords in the _default configset which we are free to do in any release (even a bugfix release if we feel it's warranted). People who expect backcompat when upgrading should not be overwriting their configsets on upgrade.

@epugh epugh self-assigned this Jun 27, 2024
@epugh
Copy link
Contributor

epugh commented Jun 28, 2024

@alastair how would you like to be credited in CHANGES.txt?

@alastair
Copy link
Contributor Author

thanks @epugh. I'm happy to be credited as I am on my commit - "Alastair Porter"

@HoustonPutman HoustonPutman merged commit 991e761 into apache:main Jul 11, 2024
2 of 3 checks passed
HoustonPutman pushed a commit that referenced this pull request Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants