- We are trying to identify near duplicate Questions existing in the CMS
- This will help improve student experience by avoiding facing almost similar or exact Qs repeatedly
- To do this, we are using a similarity measure (cosine similarity) to identify potential pairs of duplicate Qs
- In future, we can use a machine learning model to idenitfy duplicates using labelled training data
- The analysis is performed using NLP packages on R (text2vec and tm) and the R Markdown script is added. The text repository fetcehd from CMS is also added here.
- The identified duplicate Qs have been added, and verified, on this Googlesheet : https://docs.google.com/spreadsheets/d/1j5gmb8l2U2974yS0WF2wY6w7ZTU-Iz7E0lOXQqbSe6c/edit?usp=sharing
-
Notifications
You must be signed in to change notification settings - Fork 0
peerlearning/de-duplication_CMS
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Identifying near duplicate Qs on CMS
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published