-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pivoted normalization for tfidf model #220
Comments
Sure, sounds interesting. As long as we make it optional (and the default stays the "standard tfidf"), no harm in trying other methods. Send a PR, I'll try to comment & suggest improvements and optimizations. |
@jotaefea Are you still interested in adding this feature? |
Hi Lev, it's been a while. But I'll be happy to send some code for On Saturday, January 23, 2016, Lev Konstantinovskiy <
|
@jotaefea Will be happy to accept a pivoted normalization PR |
Thank you Lev. I'll try to get something out to you soon.
On Sun, Jan 24, 2016 at 12:07 AM, Lev Konstantinovskiy <
|
I use (sklearn TfIdfTransformation and cosine_similarity with Eucliean Normalization) and I am running on problems for similarity calculation with long documents , which probably share many of the possible tokens with other documents just because they are long. I am looking for an implementation of Pivoted normalization. |
Add to wiki ideas page |
@tmylk is that a note to self, or a suggestion for somebody else to add it? |
@javier-artiles hello, right now @markroxor works on this task and we stuck on evaluation step, have you works with TREC dataset ever? |
Hi @menshikh-iv , I'm afraid I have not worked on TREC tasks. During my researcher years I did mostly organize my own competitive evaluation (WePS) for document clustering. |
@javier-artiles thanks, probably we'll try to evaluate pivot normalization using your WePS. |
I would suggest to use it on more standard information retrieval tasks (ad hoc retrieval for instance). It may be easier to interpret the results that way. Of course it all depends on your end goal regarding this evaluation. |
Also, I apologize I never got around sending a PR for this feature. To be completely sincere I have not been a contributor to open source. Personal and private projects have always kept me too busy. |
@javier-artiles If you could recommend concrete datasets & tasks, it will be really nice for us! |
It would depend on what your end goal is. Are you just trying to test an implementation of this normalization scheme? If that is the case you can write simple unit tests with minimal data. As for the tasks themselves any standard ad-hoc IR task would make sense IMHO. |
* pivot normalization * verify weights * verify weights * smartirs ready * change old tests * remove lambdas * address suggestions * minor fix * pep8 fix * pep8 fix * fix pickle problem * flake8 fix * fix bug in docstring * added few tests * fix normalize issue for pickling * fix normalize issue for pickling * test without sklearn api * hanging idents and new tests * add docstring * add docstring * pivotized normalization * better way cmparing floats * merge develop * added benchmarks * address comments * benchmarking * testing pipeline * pivoted normalisation * taking overall norm * Update tfidfmodel.py * Update sklearn_api.ipynb * tests for pivoted normalization * results * adding visualizations * minor nb changes * minor nb changes * removed self.pivoted_normalisation * Update test_tfidfmodel.py * minor suggestions * added description * added description * last commit * cleanup * cosmetic fixes * changed pivot * changed pivot
I was wondering if this is of interest to other gensim users. When dealing with a wide range of document sizes pivoted normalization allows a finer tuning of the weight given to terms relative to document size. This is crucial in some document similarity scenarios.
I'm currently testing a modification on the tfidf model code to add this feature. Unfortunately my numpy/scypy knowledge is limited, and right now I have a very inefficient implementation.
A couple of references on the topic:
http://singhal.info/pivoted-dln.pdf
http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html
The text was updated successfully, but these errors were encountered: