From 991e76171e489e5f655d2dda7b0cab40177e5e57 Mon Sep 17 00:00:00 2001 From: Alastair Porter Date: Thu, 11 Jul 2024 20:52:41 +0200 Subject: [PATCH] SOLR-17346: Synchronise stopwords from snowball with those in lucene (#2533) --- solr/CHANGES.txt | 2 ++ .../_default/conf/lang/stopwords_da.txt | 8 ++--- .../_default/conf/lang/stopwords_de.txt | 6 ++-- .../_default/conf/lang/stopwords_es.txt | 6 ++-- .../_default/conf/lang/stopwords_fi.txt | 13 ++++---- .../_default/conf/lang/stopwords_fr.txt | 30 +++++++++---------- .../_default/conf/lang/stopwords_hu.txt | 8 ++--- .../_default/conf/lang/stopwords_it.txt | 6 ++-- .../_default/conf/lang/stopwords_nl.txt | 8 +++-- .../_default/conf/lang/stopwords_no.txt | 12 +++----- .../_default/conf/lang/stopwords_pt.txt | 6 ++-- .../_default/conf/lang/stopwords_ru.txt | 7 +++-- .../_default/conf/lang/stopwords_sv.txt | 8 ++--- 13 files changed, 60 insertions(+), 60 deletions(-) diff --git a/solr/CHANGES.txt b/solr/CHANGES.txt index 61af4795afa..8216966d3df 100644 --- a/solr/CHANGES.txt +++ b/solr/CHANGES.txt @@ -146,6 +146,8 @@ Improvements such as -zkHost continue to be supported in the 9.x line of Solr. -u is now used to specify user credentials everywhere, this only impacts the bin/solr assert commands "same user" check which has -u as the short form of --same-user. (Eric Pugh, janhoy, Jason Gerlowski) +* SOLR-17346: Synchronise stopwords from snowball with those in Lucene (Alastair Porter via Houston Putman) + Optimizations --------------------- * SOLR-17257: Both Minimize Cores and the Affinity replica placement strategies would over-gather diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_da.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_da.txt index 42e6145b98e..6e90e8f1aae 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_da.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_da.txt @@ -1,7 +1,7 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/danish/stop.txt + | From https://snowballstem.org/algorithms/danish/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | @@ -60,7 +60,7 @@ hvor | where eller | or hvad | what skal | must/shall etc. -selv | myself/youself/herself/ourselves etc., even +selv | myself/yourself/herself/ourselves etc., even her | here alle | all/everyone/everybody etc. vil | will (verb) diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_de.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_de.txt index 86525e7ae08..804bbbdb010 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_de.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_de.txt @@ -1,7 +1,7 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/german/stop.txt + | From https://snowballstem.org/algorithms/german/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_es.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_es.txt index 487d78c8d56..48bd65ef867 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_es.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_es.txt @@ -1,7 +1,7 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/spanish/stop.txt + | From https://snowballstem.org/algorithms/spanish/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_fi.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_fi.txt index 4372c9a055b..c9ee2f16dc5 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_fi.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_fi.txt @@ -1,12 +1,12 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/finnish/stop.txt + | From https://snowballstem.org/algorithms/finnish/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" - + | forms of BE olla @@ -48,8 +48,8 @@ me meidän meidät meitä meissä meistä meihin meillä meiltä meille te teidän teidät teitä teissä teistä teihin teillä teiltä teille | you he heidän heidät heitä heissä heistä heihin heillä heiltä heille | they -tämä tämän tätä tässä tästä tähän tallä tältä tälle tänä täksi | this -tuo tuon tuotä tuossa tuosta tuohon tuolla tuolta tuolle tuona tuoksi | that +tämä tämän tätä tässä tästä tähän tällä tältä tälle tänä täksi | this +tuo tuon tuota tuossa tuosta tuohon tuolla tuolta tuolle tuona tuoksi | that se sen sitä siinä siitä siihen sillä siltä sille sinä siksi | it nämä näiden näitä näissä näistä näihin näillä näiltä näille näinä näiksi | these nuo noiden noita noissa noista noihin noilla noilta noille noina noiksi | those @@ -91,7 +91,6 @@ yli | over, across | other kun | when -niin | so nyt | now itse | self diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_fr.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_fr.txt index 749abae6846..658ae9c91ac 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_fr.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_fr.txt @@ -1,7 +1,7 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/french/stop.txt + | From https://snowballstem.org/algorithms/french/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | @@ -51,7 +51,7 @@ qui | who sa | his, her (fem) se | oneself ses | his (pl) -son | his, her (masc) + | son | his, her (masc). Omitted because it is homonym of "sound" sur | on ta | thy (fem) te | thee @@ -79,15 +79,15 @@ t | t' y | there | forms of être (not including the infinitive): -été + | été - Omitted because it is homonym of "summer" étée étées -étés + | étés - Omitted because it is homonym of "summers" étant suis es -est -sommes + | est - Omitted because it is homonym of "east" + | sommes - Omitted because it is homonym of "sums" êtes sont serai @@ -118,7 +118,7 @@ soyez soient fusse fusses -fût + | fût - Omitted because it is homonym of "tap", like in "beer on tap" fussions fussiez fussent @@ -130,13 +130,13 @@ eue eues eus ai -as + | as - Omitted because it is homonym of "ace" avons avez ont aurai -auras -aura + | auras - Omitted because it is also the name of a kind of wind + | aura - Omitted because it is also the name of a kind of wind and homonym of "aura" aurons aurez auront @@ -147,7 +147,7 @@ auriez auraient avais avait -avions + | avions - Omitted because it is homonym of "planes" aviez avaient eut @@ -169,8 +169,8 @@ eussent | Later additions (from Jean-Christophe Deschamps) ceci | this -cela | that -celà | that +cela | that (added 11 Apr 2012. Omission reported by Adrien Grand) +celà | that (incorrect, though common) cet | this cette | this ici | here diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_hu.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_hu.txt index 37526da8aa9..3fa279eac91 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_hu.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_hu.txt @@ -1,12 +1,12 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/hungarian/stop.txt + | From https://snowballstem.org/algorithms/hungarian/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" - + | Hungarian stop word list | prepared by Anna Tordai diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_it.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_it.txt index 1219cc773ab..c74160e28ca 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_it.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_it.txt @@ -1,7 +1,7 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/italian/stop.txt + | From https://snowballstem.org/algorithms/italian/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_nl.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_nl.txt index 47a2aeacf6f..48c5515123a 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_nl.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_nl.txt @@ -1,12 +1,13 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/dutch/stop.txt + | From https://snowballstem.org/algorithms/dutch/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" + | A Dutch stop word list. Comments begin with vertical bar. Each stop | word is at the start of a line. @@ -117,3 +118,4 @@ uw | your iemand | somebody geweest | been; past participle of 'be' andere | other + diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_no.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_no.txt index a7a2c28ba54..f427609484f 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_no.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_no.txt @@ -1,7 +1,7 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/norwegian/stop.txt + | From https://snowballstem.org/algorithms/norwegian/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | @@ -25,7 +25,7 @@ et | a/an den | it/this/that til | to er | is/am/are -som | who/that +som | who/which/that på | on de | they / you(formal) med | with @@ -84,7 +84,6 @@ noen | some noe | some ville | would dere | you -som | who/which/that deres | their/theirs kun | only/just ja | yes @@ -129,7 +128,6 @@ mange | many også | also slik | just vært | been -være | to be båe | both * begge | both siden | since @@ -155,7 +153,6 @@ hennar | her/hers hennes | hers hoss | how * hossen | how * -ikkje | not * ingi | noone * inkje | noone * korleis | how * @@ -177,7 +174,6 @@ noka | some (fem.) * nokor | some * noko | some * nokre | some * -si | his/hers * sia | since * sidan | since * so | so * diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_pt.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_pt.txt index acfeb01af6b..d03d7f234d5 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_pt.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_pt.txt @@ -1,7 +1,7 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/portuguese/stop.txt + | From https://snowballstem.org/algorithms/portuguese/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_ru.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_ru.txt index 55271400c64..65512d49dbd 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_ru.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_ru.txt @@ -1,12 +1,13 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/russian/stop.txt + | From https://snowballstem.org/algorithms/russian/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | | NOTE: To use this file with StopFilterFactory, you must specify format="snowball" + | a russian stop word list. comments begin with vertical bar. each stop | word is at the start of a line. diff --git a/solr/server/solr/configsets/_default/conf/lang/stopwords_sv.txt b/solr/server/solr/configsets/_default/conf/lang/stopwords_sv.txt index 096f87f6766..d1d0d100880 100644 --- a/solr/server/solr/configsets/_default/conf/lang/stopwords_sv.txt +++ b/solr/server/solr/configsets/_default/conf/lang/stopwords_sv.txt @@ -1,7 +1,7 @@ - | From svn.tartarus.org/snowball/trunk/website/algorithms/swedish/stop.txt + | From https://snowballstem.org/algorithms/swedish/stop.txt | This file is distributed under the BSD License. - | See http://snowball.tartarus.org/license.php - | Also see http://www.opensource.org/licenses/bsd-license.html + | See https://snowballstem.org/license.html + | Also see https://opensource.org/licenses/bsd-license.html | - Encoding was converted to UTF-8. | - This notice was added. | @@ -120,7 +120,7 @@ vilka | who, that ditt | thy vem | who vilket | who, that -sitta | his +sitt | his sådana | such a vart | each dina | thy