Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To lemmatize or not to lemmatize? #1

Open
atsyplenkov opened this issue Oct 23, 2024 · 2 comments
Open

To lemmatize or not to lemmatize? #1

atsyplenkov opened this issue Oct 23, 2024 · 2 comments

Comments

@atsyplenkov
Copy link

Hi, guys. Thank you for your research. It is extremely interesting and valuable for the community, and I mean it! I am curious why didn’t you use lemmatization or stemming of the words prior to analysis? Is it only due to the increased computational power required, or is there another reason I am missing?

From my perspective, your current approach may potentially underestimate the frequency ratio of some words. For example, from Figure 2, it is clear that the frequency of the word "delve" should be higher, as both "delves" and "delved" are presented in the figure.

I am asking because I am planning to conduct similar research with the Earth Science manuscripts and finding excess words specific for my domain.

@dkobak
Copy link
Contributor

dkobak commented Oct 23, 2024

To be honest, the main reason was "for simplicity", but one secondary reason was that we thought it may actually be interesting to look at all forms separately -- e.g. "delves", "delved" and "delve" may increase their usage by a different amount (because ChatGPT may prefer to use a specific form particularly often).

In retrospect I think it would actually be more sensible to lemmatize everything. We may change the analysis in future revisions, or possibly add a supplementary analysis with/without lemmatization. Depends also on how the peer review process will go.

It should be relatively straightforward, smth like this code given here https://scikit-learn.org/stable/modules/feature_extraction.html#tips-and-tricks:

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

vectorizer = CountVectorizer(tokenizer=LemmaTokenizer())

Let me know if you try it out!

@atsyplenkov
Copy link
Author

Thanks for that, I will let you know! Good luck with the peer-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants