Pivoted normalization for tfidf model #220

javier-artiles · 2014-07-15T22:47:44Z

I was wondering if this is of interest to other gensim users. When dealing with a wide range of document sizes pivoted normalization allows a finer tuning of the weight given to terms relative to document size. This is crucial in some document similarity scenarios.

I'm currently testing a modification on the tfidf model code to add this feature. Unfortunately my numpy/scypy knowledge is limited, and right now I have a very inefficient implementation.

A couple of references on the topic:
http://singhal.info/pivoted-dln.pdf
http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html

piskvorky · 2014-08-26T06:53:50Z

Sure, sounds interesting. As long as we make it optional (and the default stays the "standard tfidf"), no harm in trying other methods.

Send a PR, I'll try to comment & suggest improvements and optimizations.

tmylk · 2016-01-23T22:34:14Z

@jotaefea Are you still interested in adding this feature?

javier-artiles · 2016-01-24T07:15:39Z

Hi Lev, it's been a while. But I'll be happy to send some code for
review :-)

On Saturday, January 23, 2016, Lev Konstantinovskiy <
[email protected]> wrote:

@jotaefea https://github.com/jotaefea Are you still interested in
adding this feature?

—
Reply to this email directly or view it on GitHub
#220 (comment).

tmylk · 2016-01-24T08:06:59Z

@jotaefea Will be happy to accept a pivoted normalization PR

javier-artiles · 2016-01-24T21:09:27Z

Thank you Lev. I'll try to get something out to you soon.

J

On Sun, Jan 24, 2016 at 12:07 AM, Lev Konstantinovskiy <
[email protected]> wrote:

@jotaefea https://github.com/jotaefea Will be happy to accept a pivoted
normalization PR

—
Reply to this email directly or view it on GitHub
#220 (comment).

gabrielspmoreira · 2016-02-16T20:50:27Z

I use (sklearn TfIdfTransformation and cosine_similarity with Eucliean Normalization) and I am running on problems for similarity calculation with long documents , which probably share many of the possible tokens with other documents just because they are long. I am looking for an implementation of Pivoted normalization.

tmylk · 2016-10-06T06:32:51Z

Add to wiki ideas page

piskvorky · 2016-10-29T01:49:24Z

@tmylk is that a note to self, or a suggestion for somebody else to add it?

menshikh-iv · 2018-02-06T14:51:11Z

@javier-artiles hello, right now @markroxor works on this task and we stuck on evaluation step, have you works with TREC dataset ever?

javier-artiles · 2018-02-07T18:17:24Z

Hi @menshikh-iv , I'm afraid I have not worked on TREC tasks. During my researcher years I did mostly organize my own competitive evaluation (WePS) for document clustering.

menshikh-iv · 2018-02-07T18:20:19Z

@javier-artiles thanks, probably we'll try to evaluate pivot normalization using your WePS.

javier-artiles · 2018-02-07T18:28:26Z

I would suggest to use it on more standard information retrieval tasks (ad hoc retrieval for instance). It may be easier to interpret the results that way. Of course it all depends on your end goal regarding this evaluation.

javier-artiles · 2018-02-07T18:31:15Z

Also, I apologize I never got around sending a PR for this feature. To be completely sincere I have not been a contributor to open source. Personal and private projects have always kept me too busy.

menshikh-iv · 2018-02-07T18:32:40Z

@javier-artiles If you could recommend concrete datasets & tasks, it will be really nice for us!

javier-artiles · 2018-02-07T19:11:33Z

It would depend on what your end goal is. Are you just trying to test an implementation of this normalization scheme? If that is the case you can write simple unit tests with minimal data.
If you are looking for a model quality benchmark then traditional dev/test datasets make sense.
TREC would be the way to go if you want to show public benchmark results that are easy to compare. Unfortunately most TREC data I know of is not freely available. This data is available through the Linguistic Data Consortium site at pretty steep fees (unless your research institution is footing the bill :-) ).
As an alternative you can look into the CLEF collections. I can't tell for sure, but it would seem their data is accessible after registration.
Finally, the datasets section of this page may give you some useful pointers https://github.com/harpribot/awesome-information-retrieval#datasets

As for the tasks themselves any standard ad-hoc IR task would make sense IMHO.

* pivot normalization * verify weights * verify weights * smartirs ready * change old tests * remove lambdas * address suggestions * minor fix * pep8 fix * pep8 fix * fix pickle problem * flake8 fix * fix bug in docstring * added few tests * fix normalize issue for pickling * fix normalize issue for pickling * test without sklearn api * hanging idents and new tests * add docstring * add docstring * pivotized normalization * better way cmparing floats * merge develop * added benchmarks * address comments * benchmarking * testing pipeline * pivoted normalisation * taking overall norm * Update tfidfmodel.py * Update sklearn_api.ipynb * tests for pivoted normalization * results * adding visualizations * minor nb changes * minor nb changes * removed self.pivoted_normalisation * Update test_tfidfmodel.py * minor suggestions * added description * added description * last commit * cleanup * cosmetic fixes * changed pivot * changed pivot

piskvorky assigned tmylk Feb 16, 2016

tmylk added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills wishlist Feature request labels Oct 6, 2016

markroxor mentioned this issue Dec 12, 2017

Add Pivot Normalization for gensim.models.TfidfModel. Fix #220 #1780

Merged

3 tasks

menshikh-iv mentioned this issue Dec 15, 2017

Add smart information retrieval system for TFIDF #1785

Closed

menshikh-iv unassigned tmylk Feb 6, 2018

menshikh-iv closed this as completed in #1780 Mar 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pivoted normalization for tfidf model #220

Pivoted normalization for tfidf model #220

javier-artiles commented Jul 15, 2014

piskvorky commented Aug 26, 2014

tmylk commented Jan 23, 2016

javier-artiles commented Jan 24, 2016

tmylk commented Jan 24, 2016

javier-artiles commented Jan 24, 2016

gabrielspmoreira commented Feb 16, 2016

tmylk commented Oct 6, 2016

piskvorky commented Oct 29, 2016

menshikh-iv commented Feb 6, 2018

javier-artiles commented Feb 7, 2018

menshikh-iv commented Feb 7, 2018

javier-artiles commented Feb 7, 2018

javier-artiles commented Feb 7, 2018

menshikh-iv commented Feb 7, 2018

javier-artiles commented Feb 7, 2018

Pivoted normalization for tfidf model #220

Pivoted normalization for tfidf model #220

Comments

javier-artiles commented Jul 15, 2014

piskvorky commented Aug 26, 2014

tmylk commented Jan 23, 2016

javier-artiles commented Jan 24, 2016

tmylk commented Jan 24, 2016

javier-artiles commented Jan 24, 2016

gabrielspmoreira commented Feb 16, 2016

tmylk commented Oct 6, 2016

piskvorky commented Oct 29, 2016

menshikh-iv commented Feb 6, 2018

javier-artiles commented Feb 7, 2018

menshikh-iv commented Feb 7, 2018

javier-artiles commented Feb 7, 2018

javier-artiles commented Feb 7, 2018

menshikh-iv commented Feb 7, 2018

javier-artiles commented Feb 7, 2018