Should fix and extend the pluralization rules (WAS: Consider using built in .NET for underlying pluralization) #142

ghost · 2014-04-09T16:09:41Z

The built in .NET PluralizationService has more work done in correctly pluralizing most of English. Please consider using this as the service behind the Pluralize feature of Humanizer.

http://referencesource.microsoft.com/#System.Data.Entity.Design/Entity/Design/PluralizationService/EnglishPluralizationService.cs

I realize it is in an assembly you probably do not want to require as a dependency, but perhaps some notes can be taken from it.

MehdiK · 2014-04-09T17:25:45Z

Thanks. I seriously considered using that for singularization and pluralization; but as you said really didn't want to depend on a different package, particularly one so commonly disliked by the community.

One of the reasons I don't want to depend on another package is that Humanizer has an aggressive release cycle: releasing new features and patches on average every week. So if a bug is found on anything including the pluralization, which obviously is a bit unlikely, I want to be able to turn it around very quickly and not have to wait for the release of a third party library. Also this way I can add new features more freely.

FWIW this feature has been there for a few months now and I haven't received any issues on it. So I think it should be good.

Do you have an example of this working incorrectly?

ghost · 2014-04-09T17:57:27Z

Sure, it works for most words, but words which pluralize irregularly are hit and miss. I don't blame you for avoiding Entity of course.

Here are a few I discovered while checking some irregular words.

singular	plural returned	correct plural
atlas	atlas	atlases
cod	cods	cod
domino	dominos	dominoes
echo	echos	echoes
hero	heros	heroes
hoof	hoofs	hooves
iris	iris	irises
leaf	leafs	leaves
loaf	loafs	loaves
motto	mottos	mottoes
reflex	reflices	reflexes
sheaf	sheafs	sheaves
syllabus	syllabuses	syllabi
thief	thiefs	thieves
waltz	waltzs	waltzes
gas	gas	gases
cactus	cactus	catci
focus	focus	foci/focuses
nucleus	nucleus	nuclei
radius	radius	radii
stimulus	stimulus	stimuli
appendix	appendixes	appendicies
beau	beaus	beaux
corpus	corpus	corpora
criterion	criterions	criteria
curriculum	curriculums	curricula
genus	genus	genera
memorandums	memorandums	memoranda
offspring	offsprings	offspring
foot	foots	feet
tooth	tooths	teeth
nebula	nebulas	nebulae
vertebra	vertebras	vertebrae

These aren't the most common words in application development, but I like to make sure I don't have to doublecheck if a word will work when using a function.

Thanks for your attention.

MehdiK · 2014-04-09T18:37:26Z

Oopsi! That's quite a few for a quick check! Do'h. Thanks for your effort.

Perhaps an unfair question, but you think the issues may be mostly around irregular verbs? Because that should be relatively easy to fix.

ghost · 2014-04-09T19:57:08Z

No worries. I went directly to irregular nouns to test, so I'm not sure if the issues are specific to these or if there are deeper problems.

Even if there are other issues, fixing the irregular ones are as simple as adding them to the irregular word dictionary in the code... it's a step in the right direction.

I wish there was a place to get dictionaries of nicely formatted data to test against. (is there?)

MehdiK · 2014-04-09T20:06:05Z

Thanks. Yeah, I think we should at least do that. Wanna send me a PR for it? :p

hehehe, getting a dictionary and iterating over it with Humanizer is a good idea :) If you find that solution we could also complete an abandoned kick-arse feature in Humanizer.

hazzik · 2014-04-14T21:54:44Z

There are some errors in the table you provided.

"memorandum" => correct plurals are "memorandums" and "memoranda"

nemec · 2014-04-27T22:57:45Z

Here's a csv that I put together that merges the existing tests, the above table, and other cases that I came across when implementing this pluralization algorithm: https://gist.github.com/nemec/201f6e2b2af3a4390f0b

Humanizer won't open in Monodevelop (too lazy to boot into Windows), so I haven't yet tried running Humanizer against the dataset.

Note that there are a couple of existing plural tests that are totally wrong, like virus that have been corrected. When there was more than one valid plural option, I picked the one that made my matching code simpler -- I don't think I changed any of the existing tests, but the new ones I've added may not be picked up by Humanizer's current ruleset.

The regex dictionary is also really awesome for double checking your rules. I wanted to check my rule for "hoof" and other double-vowel-f's, so I plugged in ([aeiou])\1f$ and it worked perfectly.

MehdiK · 2014-04-28T05:22:01Z

Thanks @nemec.

I cannot be sure about these to be honest. I see different things all over the place; e.g. for octopus (and there are some discussions around Virus in there too) from wikipedia:

There are three plural forms of octopus: octopuses [ˈɒktəpəsɪz], octopi [ˈɒktəpaɪ], and octopodes [ˌɒkˈtəʊpədiːz]. Currently, octopuses is the most common form in the UK as well as the US; octopodes is rare, and octopi is often objectionable.

I guess what we could do is to provide a solid baseline for the rules by adding some of the missing ones and fixing the existing rule and then open the API so users can plug-in new entries or override the behavior. Thoughts?

nemec · 2014-04-28T05:36:27Z

You could add in support for returning a list of matches in the case where multiple alternatives are equally viable, or maybe rank the forms based on Google N-gram hits, but what we really need to do is ban most existing languages and forbid people from speaking languages that don't fit a context-free grammar ;)

For the opposite direction, plural -> singular, it would be cool if it was comprehensive enough to accept all forms, even disputed ones, but that may not be feasible.

Plugins for new entries is definitely a great idea. Giving users the ability to extend it with regional dialects (you -> y'all, for example) or things like pop culture could make the library feel more intelligent, even if those additions aren't strictly necessary.

MehdiK · 2014-04-28T18:48:20Z

You could add in support for returning a list of matches in the case where multiple alternatives are equally viable, or maybe rank the forms based on Google N-gram hits

This is not possible as it would be a huge breaking change which not only impacts these methods but also things like ToQuantity which calls this under the hood.

For the opposite direction, plural -> singular, it would be cool if it was comprehensive enough to accept all forms, even disputed ones, but that may not be feasible.

Good idea; although the mapping might get a bit complicated. Definitely worth considering.

Show me the PR :)

SimonCropp · 2024-03-03T10:10:59Z

seems nothing to action here

MehdiK mentioned this issue Apr 12, 2014

Localize Pluralize/Singularize (WAS: Localizable InflectorExtensions) #197

Open

MehdiK changed the title ~~Consider using built in .NET for underlying pluralization~~ Should fix and extend the pluralization rules (WAS: Consider using built in .NET for underlying pluralization) Apr 28, 2014

SimonCropp closed this as completed Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should fix and extend the pluralization rules (WAS: Consider using built in .NET for underlying pluralization) #142

Should fix and extend the pluralization rules (WAS: Consider using built in .NET for underlying pluralization) #142

ghost commented Apr 9, 2014

MehdiK commented Apr 9, 2014

ghost commented Apr 9, 2014

MehdiK commented Apr 9, 2014

ghost commented Apr 9, 2014

MehdiK commented Apr 9, 2014

hazzik commented Apr 14, 2014

nemec commented Apr 27, 2014

MehdiK commented Apr 28, 2014

nemec commented Apr 28, 2014

MehdiK commented Apr 28, 2014

SimonCropp commented Mar 3, 2024

Should fix and extend the pluralization rules (WAS: Consider using built in .NET for underlying pluralization) #142

Should fix and extend the pluralization rules (WAS: Consider using built in .NET for underlying pluralization) #142

Comments

ghost commented Apr 9, 2014

MehdiK commented Apr 9, 2014

ghost commented Apr 9, 2014

MehdiK commented Apr 9, 2014

ghost commented Apr 9, 2014

MehdiK commented Apr 9, 2014

hazzik commented Apr 14, 2014

nemec commented Apr 27, 2014

MehdiK commented Apr 28, 2014

nemec commented Apr 28, 2014

MehdiK commented Apr 28, 2014

SimonCropp commented Mar 3, 2024