Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gensim.matutils.hellinger d(x,y) != d(y,x) if len(x) = len(y) #1854

Closed
elituan opened this issue Jan 24, 2018 · 3 comments
Closed

gensim.matutils.hellinger d(x,y) != d(y,x) if len(x) = len(y) #1854

elituan opened this issue Jan 24, 2018 · 3 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@elituan
Copy link

elituan commented Jan 24, 2018

Description

Compute the distance between 2 distribution with gensim.matutils.hellinger. The d(x,y) do not equal to d(y,x) if len(x) = len(y)

Steps/Code/Corpus to Reproduce

from gensim.matutils import hellinger
vec_1 = [(2, 0.1), (3, 0.4), (4, 0.1), (5, 0.1), (1, 0.1), (7, 0.2)]
vec_2 = [(1, 0.1), (3, 0.8), (4, 0.1), (8, 0.1), (10, 0.8), (9, 0.1)]
hellinger(vec_1,vec_2) == hellinger(vec_2,vec_1)

Expected Results

True

Actual Results

False

Versions

Linux-4.13.0-26-generic-x86_64-with-debian-stretch-sid
('Python', '2.7.11 |Anaconda custom (64-bit)| (default, Dec 6 2015, 18:08:32) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.13.3')
('SciPy', '1.0.0')
('gensim', '3.1.0')
('FAST_VERSION', 1)

@menshikh-iv
Copy link
Contributor

Thanks for report @elituan, reproduced on 3.2.0.

Problem here - https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/matutils.py#L900 (and lower)

@parulsethi can you fix this?

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Jan 24, 2018
@caiyulun
Copy link
Contributor

hi, I am willing to work on this. It seems that d(x,y) do not equal to d(y,x) becasue that x and y has different index sets, and d(x,y) is computed based on the indexs in x while d(y,x) is computed based on indexs in y.

So I am wording if it should have same result in this situation, and if it should, maybe one approach is that we collect all the indexs in x and y into a set firstly, then we can iter the index set and get values in x and y.

@menshikh-iv
Copy link
Contributor

@caiyulun sounds like a good idea, feel free to contribute! For all indexes that defined for one vector and undefined for other - I think we should use 0.0 as a default value (because of BoW representation for this case -> all undefined stuff is zero).

@menshikh-iv menshikh-iv added the good first issue Issue for new contributors (not required gensim understanding + very simple) label Jan 26, 2018
sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
…vorky#1860)

* fix: fix bugs in hellinger distance computing

This changes fix the bugs that return different distance when we
call hellinger(x, y) and hellinger(y, x).
The cause of this bug is that we compute the distance based on one
distribution's index previously, but we should iterate all the
index appears in two probability distributions.

* rename variable + fix union indices error in python3

* fix and add tests for hellinger distance

* fix the currrent test for different length BOW inputs
* add a test for symmetrical inputs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
Development

No branches or pull requests

3 participants