Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error calculating coherence score for BERTopic model trained on Indic language #120

Open
sanketshinde0707 opened this issue Jan 10, 2024 · 1 comment

Comments

@sanketshinde0707
Copy link

  • OCTIS version: 1.13.1
  • Python version: 3.10.12
  • Operating System: Google Colab

Description

I am working with BERTopic and I am trying to evaluate my topic models trained on Marathi language (Indic language) using some metrics.I found this code written by MaartenGR (Author of BERTopic) but unfortunately I was not able to install the dependencies of the setup he has mentioned here (https://github.com/MaartenGr/BERTopic_evaluation/tree/main). The author recommended using OCTIS as it provides more metrics. I tried calculating the topic diversity and npmi score. The topic diversity is calculated,but I keep getting issues while calculating npmi score.

Here is my code

from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.evaluation_metrics.diversity_metrics import TopicDiversity

#This is how the sentence arrays looks
sentence array = ['तीन दिवस झाले, पण गाडी अजून सापडली नाही. पोलिसांचा कडक तपास सुरु आहे.' , 'डाळी भारतीय थाळीमध्ये सामील असलेले मुख्य भोजन आहेत.'] 

#This is how the topics are 
topics_list = [
['ठाकरे', 'एक', 'भारतीय', 'दिवस', 'शिंदे', 'सांगितले', 'दोन', 'माहिती', 'देण्यात', 'जात'],
['भारतीय', 'शिंदे', 'ठाकरे', 'मुख्यमंत्री', 'उद्धव', 'एक', 'पोलीस', 'धावा', 'दोन', 'सरकार'],
['देण्यात', 'फोन', 'डेटा', 'कॅमेरा', 'स्मार्टफोन', 'सादर', 'डिस्प्ले', 'सेन्सर', 'सपोर्ट', 'बॅटरी']
]

octis_texts = [sentence_array]
npmi = Coherence(texts = octis_texts, topk = 10, measure = 'c_npmi')
octis_output = {"topics": list1}
topic_diversity = TopicDiversity(topk=10)

topic_diversity_score = topic_diversity.score(octis_output)
print("Topic diversity: "+str(topic_diversity_score))

npmi_score = npmi.score(octis_output)
print("Coherence: "+str(npmi_score))

Error

This is the error I get.

Topic diversity: 0.8857142857142857
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-68-c000efdb667a>](https://localhost:8080/#) in <cell line: 5>()
      3 print("Topic diversity: "+str(topic_diversity_score))
      4 
----> 5 npmi_score = npmi.score(octis_output)
      6 print("Coherence: "+str(npmi_score))

3 frames
[/usr/local/lib/python3.10/dist-packages/gensim/models/coherencemodel.py](https://localhost:8080/#) in _ensure_elements_are_ids(self, topic)
    452             return np.array(ids_from_ids)
    453         else:
--> 454             raise ValueError('unable to interpret topic as either a list of tokens or a list of ids')
    455 
    456     def _update_accumulator(self, new_topics):

ValueError: unable to interpret topic as either a list of tokens or a list of ids

Can anyone point out what exactly is wrong here and how can i evaluate BERTopic models trained on indic languages.

Thanks.

@jiezhao2002
Copy link

Hey I've encountered the same issue - have you resolved it yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants