Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivoted normalization for tfidf model #220

Closed
javier-artiles opened this issue Jul 15, 2014 · 15 comments · Fixed by #1780
Closed

Pivoted normalization for tfidf model #220

javier-artiles opened this issue Jul 15, 2014 · 15 comments · Fixed by #1780
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request

Comments

@javier-artiles
Copy link

I was wondering if this is of interest to other gensim users. When dealing with a wide range of document sizes pivoted normalization allows a finer tuning of the weight given to terms relative to document size. This is crucial in some document similarity scenarios.

I'm currently testing a modification on the tfidf model code to add this feature. Unfortunately my numpy/scypy knowledge is limited, and right now I have a very inefficient implementation.

A couple of references on the topic:
http://singhal.info/pivoted-dln.pdf
http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html

download

@piskvorky
Copy link
Owner

Sure, sounds interesting. As long as we make it optional (and the default stays the "standard tfidf"), no harm in trying other methods.

Send a PR, I'll try to comment & suggest improvements and optimizations.

@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

@jotaefea Are you still interested in adding this feature?

@javier-artiles
Copy link
Author

Hi Lev, it's been a while. But I'll be happy to send some code for
review :-)

On Saturday, January 23, 2016, Lev Konstantinovskiy <
[email protected]> wrote:

@jotaefea https://github.com/jotaefea Are you still interested in
adding this feature?


Reply to this email directly or view it on GitHub
#220 (comment).

@tmylk
Copy link
Contributor

tmylk commented Jan 24, 2016

@jotaefea Will be happy to accept a pivoted normalization PR

@javier-artiles
Copy link
Author

Thank you Lev. I'll try to get something out to you soon.

  • J

On Sun, Jan 24, 2016 at 12:07 AM, Lev Konstantinovskiy <
[email protected]> wrote:

@jotaefea https://github.com/jotaefea Will be happy to accept a pivoted
normalization PR


Reply to this email directly or view it on GitHub
#220 (comment).

@gabrielspmoreira
Copy link

I use (sklearn TfIdfTransformation and cosine_similarity with Eucliean Normalization) and I am running on problems for similarity calculation with long documents , which probably share many of the possible tokens with other documents just because they are long. I am looking for an implementation of Pivoted normalization.

@tmylk tmylk added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills wishlist Feature request labels Oct 6, 2016
@tmylk
Copy link
Contributor

tmylk commented Oct 6, 2016

Add to wiki ideas page

@piskvorky
Copy link
Owner

@tmylk is that a note to self, or a suggestion for somebody else to add it?

@menshikh-iv
Copy link
Contributor

@javier-artiles hello, right now @markroxor works on this task and we stuck on evaluation step, have you works with TREC dataset ever?

@javier-artiles
Copy link
Author

Hi @menshikh-iv , I'm afraid I have not worked on TREC tasks. During my researcher years I did mostly organize my own competitive evaluation (WePS) for document clustering.

@menshikh-iv
Copy link
Contributor

@javier-artiles thanks, probably we'll try to evaluate pivot normalization using your WePS.

@javier-artiles
Copy link
Author

I would suggest to use it on more standard information retrieval tasks (ad hoc retrieval for instance). It may be easier to interpret the results that way. Of course it all depends on your end goal regarding this evaluation.

@javier-artiles
Copy link
Author

Also, I apologize I never got around sending a PR for this feature. To be completely sincere I have not been a contributor to open source. Personal and private projects have always kept me too busy.

@menshikh-iv
Copy link
Contributor

@javier-artiles If you could recommend concrete datasets & tasks, it will be really nice for us!

@javier-artiles
Copy link
Author

It would depend on what your end goal is. Are you just trying to test an implementation of this normalization scheme? If that is the case you can write simple unit tests with minimal data.
If you are looking for a model quality benchmark then traditional dev/test datasets make sense.
TREC would be the way to go if you want to show public benchmark results that are easy to compare. Unfortunately most TREC data I know of is not freely available. This data is available through the Linguistic Data Consortium site at pretty steep fees (unless your research institution is footing the bill :-) ).
As an alternative you can look into the CLEF collections. I can't tell for sure, but it would seem their data is accessible after registration.
Finally, the datasets section of this page may give you some useful pointers https://github.com/harpribot/awesome-information-retrieval#datasets

As for the tasks themselves any standard ad-hoc IR task would make sense IMHO.

menshikh-iv pushed a commit that referenced this issue Mar 13, 2018
* pivot normalization

* verify weights

* verify weights

* smartirs ready

* change old tests

* remove lambdas

* address suggestions

* minor fix

* pep8 fix

* pep8 fix

* fix pickle problem

* flake8 fix

* fix bug in docstring

* added few tests

* fix normalize issue for pickling

* fix normalize issue for pickling

* test without sklearn api

* hanging idents and new tests

* add docstring

* add docstring

* pivotized normalization

* better way cmparing floats

* merge develop

* added benchmarks

* address comments

* benchmarking

* testing pipeline

* pivoted normalisation

* taking overall norm

* Update tfidfmodel.py

* Update sklearn_api.ipynb

* tests for pivoted normalization

* results

* adding visualizations

* minor nb changes

* minor nb changes

* removed self.pivoted_normalisation

* Update test_tfidfmodel.py

* minor suggestions

* added description

* added description

* last commit

* cleanup

* cosmetic fixes

* changed pivot

* changed pivot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants