Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should fix and extend the pluralization rules (WAS: Consider using built in .NET for underlying pluralization) #142

Closed
ghost opened this issue Apr 9, 2014 · 11 comments

Comments

@ghost
Copy link

ghost commented Apr 9, 2014

The built in .NET PluralizationService has more work done in correctly pluralizing most of English. Please consider using this as the service behind the Pluralize feature of Humanizer.

http://referencesource.microsoft.com/#System.Data.Entity.Design/Entity/Design/PluralizationService/EnglishPluralizationService.cs

I realize it is in an assembly you probably do not want to require as a dependency, but perhaps some notes can be taken from it.

@MehdiK
Copy link
Member

MehdiK commented Apr 9, 2014

Thanks. I seriously considered using that for singularization and pluralization; but as you said really didn't want to depend on a different package, particularly one so commonly disliked by the community.

One of the reasons I don't want to depend on another package is that Humanizer has an aggressive release cycle: releasing new features and patches on average every week. So if a bug is found on anything including the pluralization, which obviously is a bit unlikely, I want to be able to turn it around very quickly and not have to wait for the release of a third party library. Also this way I can add new features more freely.

FWIW this feature has been there for a few months now and I haven't received any issues on it. So I think it should be good.

Do you have an example of this working incorrectly?

@ghost
Copy link
Author

ghost commented Apr 9, 2014

Sure, it works for most words, but words which pluralize irregularly are hit and miss. I don't blame you for avoiding Entity of course.

Here are a few I discovered while checking some irregular words.

singular plural returned correct plural
atlas atlas atlases
cod cods cod
domino dominos dominoes
echo echos echoes
hero heros heroes
hoof hoofs hooves
iris iris irises
leaf leafs leaves
loaf loafs loaves
motto mottos mottoes
reflex reflices reflexes
sheaf sheafs sheaves
syllabus syllabuses syllabi
thief thiefs thieves
waltz waltzs waltzes
gas gas gases
cactus cactus catci
focus focus foci/focuses
nucleus nucleus nuclei
radius radius radii
stimulus stimulus stimuli
appendix appendixes appendicies
beau beaus beaux
corpus corpus corpora
criterion criterions criteria
curriculum curriculums curricula
genus genus genera
memorandums memorandums memoranda
offspring offsprings offspring
foot foots feet
tooth tooths teeth
nebula nebulas nebulae
vertebra vertebras vertebrae

These aren't the most common words in application development, but I like to make sure I don't have to doublecheck if a word will work when using a function.

Thanks for your attention.

@MehdiK
Copy link
Member

MehdiK commented Apr 9, 2014

Oopsi! That's quite a few for a quick check! Do'h. Thanks for your effort.

Perhaps an unfair question, but you think the issues may be mostly around irregular verbs? Because that should be relatively easy to fix.

@ghost
Copy link
Author

ghost commented Apr 9, 2014

No worries. I went directly to irregular nouns to test, so I'm not sure if the issues are specific to these or if there are deeper problems.

Even if there are other issues, fixing the irregular ones are as simple as adding them to the irregular word dictionary in the code... it's a step in the right direction.

I wish there was a place to get dictionaries of nicely formatted data to test against. (is there?)

@MehdiK
Copy link
Member

MehdiK commented Apr 9, 2014

Thanks. Yeah, I think we should at least do that. Wanna send me a PR for it? :p

hehehe, getting a dictionary and iterating over it with Humanizer is a good idea :) If you find that solution we could also complete an abandoned kick-arse feature in Humanizer.

@hazzik
Copy link
Member

hazzik commented Apr 14, 2014

There are some errors in the table you provided.

"memorandum" => correct plurals are "memorandums" and "memoranda"

@nemec
Copy link

nemec commented Apr 27, 2014

Here's a csv that I put together that merges the existing tests, the above table, and other cases that I came across when implementing this pluralization algorithm: https://gist.github.com/nemec/201f6e2b2af3a4390f0b

Humanizer won't open in Monodevelop (too lazy to boot into Windows), so I haven't yet tried running Humanizer against the dataset.

Note that there are a couple of existing plural tests that are totally wrong, like virus that have been corrected. When there was more than one valid plural option, I picked the one that made my matching code simpler -- I don't think I changed any of the existing tests, but the new ones I've added may not be picked up by Humanizer's current ruleset.

The regex dictionary is also really awesome for double checking your rules. I wanted to check my rule for "hoof" and other double-vowel-f's, so I plugged in ([aeiou])\1f$ and it worked perfectly.

@MehdiK
Copy link
Member

MehdiK commented Apr 28, 2014

Thanks @nemec.

I cannot be sure about these to be honest. I see different things all over the place; e.g. for octopus (and there are some discussions around Virus in there too) from wikipedia:

There are three plural forms of octopus: octopuses [ˈɒktəpəsɪz], octopi [ˈɒktəpaɪ], and octopodes [ˌɒkˈtəʊpədiːz]. Currently, octopuses is the most common form in the UK as well as the US; octopodes is rare, and octopi is often objectionable.

I guess what we could do is to provide a solid baseline for the rules by adding some of the missing ones and fixing the existing rule and then open the API so users can plug-in new entries or override the behavior. Thoughts?

@MehdiK MehdiK changed the title Consider using built in .NET for underlying pluralization Should fix and extend the pluralization rules (WAS: Consider using built in .NET for underlying pluralization) Apr 28, 2014
@nemec
Copy link

nemec commented Apr 28, 2014

You could add in support for returning a list of matches in the case where multiple alternatives are equally viable, or maybe rank the forms based on Google N-gram hits, but what we really need to do is ban most existing languages and forbid people from speaking languages that don't fit a context-free grammar ;)

For the opposite direction, plural -> singular, it would be cool if it was comprehensive enough to accept all forms, even disputed ones, but that may not be feasible.

Plugins for new entries is definitely a great idea. Giving users the ability to extend it with regional dialects (you -> y'all, for example) or things like pop culture could make the library feel more intelligent, even if those additions aren't strictly necessary.

@MehdiK
Copy link
Member

MehdiK commented Apr 28, 2014

You could add in support for returning a list of matches in the case where multiple alternatives are equally viable, or maybe rank the forms based on Google N-gram hits

This is not possible as it would be a huge breaking change which not only impacts these methods but also things like ToQuantity which calls this under the hood.

For the opposite direction, plural -> singular, it would be cool if it was comprehensive enough to accept all forms, even disputed ones, but that may not be feasible.

Good idea; although the mapping might get a bit complicated. Definitely worth considering.

Show me the PR :)

@SimonCropp
Copy link
Collaborator

seems nothing to action here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants