Skip to content

Commit

Permalink
SOLR-17346: Synchronise stopwords from snowball with those in lucene (#…
Browse files Browse the repository at this point in the history
…2533)

(cherry picked from commit 991e761)
  • Loading branch information
alastair authored and HoustonPutman committed Jul 11, 2024
1 parent e50df5b commit 006e3a8
Show file tree
Hide file tree
Showing 13 changed files with 60 additions and 60 deletions.
2 changes: 2 additions & 0 deletions solr/CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ Improvements

* SOLR-15591: Make using debugger in Solr easier by avoiding NPE in ExternalPaths.determineSourceHome. (@charlygrappa via Eric Pugh)

* SOLR-17346: Synchronise stopwords from snowball with those in Lucene (Alastair Porter via Houston Putman)

Optimizations
---------------------
* SOLR-17257: Both Minimize Cores and the Affinity replica placement strategies would over-gather
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/danish/stop.txt
| From https://snowballstem.org/algorithms/danish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
Expand Down Expand Up @@ -60,7 +60,7 @@ hvor | where
eller | or
hvad | what
skal | must/shall etc.
selv | myself/youself/herself/ourselves etc., even
selv | myself/yourself/herself/ourselves etc., even
her | here
alle | all/everyone/everybody etc.
vil | will (verb)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/german/stop.txt
| From https://snowballstem.org/algorithms/german/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/spanish/stop.txt
| From https://snowballstem.org/algorithms/spanish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
Expand Down
13 changes: 6 additions & 7 deletions solr/server/solr/configsets/_default/conf/lang/stopwords_fi.txt
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/finnish/stop.txt
| From https://snowballstem.org/algorithms/finnish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| forms of BE

olla
Expand Down Expand Up @@ -48,8 +48,8 @@ me meidän meidät meitä meissä meistä meihin meillä meiltä meille
te teidän teidät teitä teissä teistä teihin teillä teiltä teille | you
he heidän heidät heitä heissä heistä heihin heillä heiltä heille | they

tämä tämän tätä tässä tästä tähän tallä tältä tälle tänä täksi | this
tuo tuon tuotä tuossa tuosta tuohon tuolla tuolta tuolle tuona tuoksi | that
tämä tämän tätä tässä tästä tähän tällä tältä tälle tänä täksi | this
tuo tuon tuota tuossa tuosta tuohon tuolla tuolta tuolle tuona tuoksi | that
se sen sitä siinä siitä siihen sillä siltä sille sinä siksi | it
nämä näiden näitä näissä näistä näihin näillä näiltä näille näinä näiksi | these
nuo noiden noita noissa noista noihin noilla noilta noille noina noiksi | those
Expand Down Expand Up @@ -91,7 +91,6 @@ yli | over, across
| other

kun | when
niin | so
nyt | now
itse | self

30 changes: 15 additions & 15 deletions solr/server/solr/configsets/_default/conf/lang/stopwords_fr.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/french/stop.txt
| From https://snowballstem.org/algorithms/french/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
Expand Down Expand Up @@ -51,7 +51,7 @@ qui | who
sa | his, her (fem)
se | oneself
ses | his (pl)
son | his, her (masc)
| son | his, her (masc). Omitted because it is homonym of "sound"
sur | on
ta | thy (fem)
te | thee
Expand Down Expand Up @@ -79,15 +79,15 @@ t | t'
y | there

| forms of être (not including the infinitive):
été
| été - Omitted because it is homonym of "summer"
étée
étées
étés
| étés - Omitted because it is homonym of "summers"
étant
suis
es
est
sommes
| est - Omitted because it is homonym of "east"
| sommes - Omitted because it is homonym of "sums"
êtes
sont
serai
Expand Down Expand Up @@ -118,7 +118,7 @@ soyez
soient
fusse
fusses
fût
| fût - Omitted because it is homonym of "tap", like in "beer on tap"
fussions
fussiez
fussent
Expand All @@ -130,13 +130,13 @@ eue
eues
eus
ai
as
| as - Omitted because it is homonym of "ace"
avons
avez
ont
aurai
auras
aura
| auras - Omitted because it is also the name of a kind of wind
| aura - Omitted because it is also the name of a kind of wind and homonym of "aura"
aurons
aurez
auront
Expand All @@ -147,7 +147,7 @@ auriez
auraient
avais
avait
avions
| avions - Omitted because it is homonym of "planes"
aviez
avaient
eut
Expand All @@ -169,8 +169,8 @@ eussent

| Later additions (from Jean-Christophe Deschamps)
ceci | this
cela | that
celà | that
cela | that (added 11 Apr 2012. Omission reported by Adrien Grand)
celà | that (incorrect, though common)
cet | this
cette | this
ici | here
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/hungarian/stop.txt
| From https://snowballstem.org/algorithms/hungarian/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| Hungarian stop word list
| prepared by Anna Tordai

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/italian/stop.txt
| From https://snowballstem.org/algorithms/italian/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/dutch/stop.txt
| From https://snowballstem.org/algorithms/dutch/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"


| A Dutch stop word list. Comments begin with vertical bar. Each stop
| word is at the start of a line.

Expand Down Expand Up @@ -117,3 +118,4 @@ uw | your
iemand | somebody
geweest | been; past participle of 'be'
andere | other

12 changes: 4 additions & 8 deletions solr/server/solr/configsets/_default/conf/lang/stopwords_no.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/norwegian/stop.txt
| From https://snowballstem.org/algorithms/norwegian/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
Expand All @@ -25,7 +25,7 @@ et | a/an
den | it/this/that
til | to
er | is/am/are
som | who/that
som | who/which/that
på | on
de | they / you(formal)
med | with
Expand Down Expand Up @@ -84,7 +84,6 @@ noen | some
noe | some
ville | would
dere | you
som | who/which/that
deres | their/theirs
kun | only/just
ja | yes
Expand Down Expand Up @@ -129,7 +128,6 @@ mange | many
også | also
slik | just
vært | been
være | to be
båe | both *
begge | both
siden | since
Expand All @@ -155,7 +153,6 @@ hennar | her/hers
hennes | hers
hoss | how *
hossen | how *
ikkje | not *
ingi | noone *
inkje | noone *
korleis | how *
Expand All @@ -177,7 +174,6 @@ noka | some (fem.) *
nokor | some *
noko | some *
nokre | some *
si | his/hers *
sia | since *
sidan | since *
so | so *
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/portuguese/stop.txt
| From https://snowballstem.org/algorithms/portuguese/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/russian/stop.txt
| From https://snowballstem.org/algorithms/russian/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"


| a russian stop word list. comments begin with vertical bar. each stop
| word is at the start of a line.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
| From svn.tartarus.org/snowball/trunk/website/algorithms/swedish/stop.txt
| From https://snowballstem.org/algorithms/swedish/stop.txt
| This file is distributed under the BSD License.
| See http://snowball.tartarus.org/license.php
| Also see http://www.opensource.org/licenses/bsd-license.html
| See https://snowballstem.org/license.html
| Also see https://opensource.org/licenses/bsd-license.html
| - Encoding was converted to UTF-8.
| - This notice was added.
|
Expand Down Expand Up @@ -120,7 +120,7 @@ vilka | who, that
ditt | thy
vem | who
vilket | who, that
sitta | his
sitt | his
sådana | such a
vart | each
dina | thy
Expand Down

0 comments on commit 006e3a8

Please sign in to comment.