To lemmatize or not to lemmatize? #1

atsyplenkov · 2024-10-23T03:38:29Z

Hi, guys. Thank you for your research. It is extremely interesting and valuable for the community, and I mean it! I am curious why didn’t you use lemmatization or stemming of the words prior to analysis? Is it only due to the increased computational power required, or is there another reason I am missing?

From my perspective, your current approach may potentially underestimate the frequency ratio of some words. For example, from Figure 2, it is clear that the frequency of the word "delve" should be higher, as both "delves" and "delved" are presented in the figure.

I am asking because I am planning to conduct similar research with the Earth Science manuscripts and finding excess words specific for my domain.

dkobak · 2024-10-23T15:49:54Z

To be honest, the main reason was "for simplicity", but one secondary reason was that we thought it may actually be interesting to look at all forms separately -- e.g. "delves", "delved" and "delve" may increase their usage by a different amount (because ChatGPT may prefer to use a specific form particularly often).

In retrospect I think it would actually be more sensible to lemmatize everything. We may change the analysis in future revisions, or possibly add a supplementary analysis with/without lemmatization. Depends also on how the peer review process will go.

It should be relatively straightforward, smth like this code given here https://scikit-learn.org/stable/modules/feature_extraction.html#tips-and-tricks:

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

vectorizer = CountVectorizer(tokenizer=LemmaTokenizer())

Let me know if you try it out!

atsyplenkov · 2024-10-23T20:59:42Z

Thanks for that, I will let you know! Good luck with the peer-review.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To lemmatize or not to lemmatize? #1

To lemmatize or not to lemmatize? #1

atsyplenkov commented Oct 23, 2024

dkobak commented Oct 23, 2024

atsyplenkov commented Oct 23, 2024

To lemmatize or not to lemmatize? #1

To lemmatize or not to lemmatize? #1

Comments

atsyplenkov commented Oct 23, 2024

dkobak commented Oct 23, 2024

atsyplenkov commented Oct 23, 2024