Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec 3.2.0 performance regression for corpus on s3 with smart-open 1.5.6 #1836

Closed
yjk21 opened this issue Jan 11, 2018 · 2 comments
Closed

Comments

@yjk21
Copy link

yjk21 commented Jan 11, 2018

Description

Upgrading to gensim 3.2.0 also upgrades smart-open to 1.5.6 which seems to have changed s3 code.
After the upgrade there is a performance regression in Word2Vec that leads to a > 2x slowdown when streaming a gzipped corpus from s3 (> 250K Words/sec => < 100K Words/sec).
Downgrading smart-open to 1.5.3 fixes the issue.

The release notes of smart-open 1.5.6 from Dec 28 state:

Steps/Code/Corpus to Reproduce

We use a private corpus of about 4M documents with about 150M words, chunked up into 2-3 MB sized gzipped files that we stream from s3 using smart-open.

Expected Results

Performance should be back to level of smart open 1.5.3.

Actual Results

See above

Versions

gensim with smart-open 1.5.6

@menshikh-iv
Copy link
Contributor

Hello @yjk21, this is smart_open problem (not a gensim), for this reason, I close this issue.

Please create an issue in smart_open repository https://github.com/RaRe-Technologies/smart_open/issues with a simple example of code that shows this regression (it seems that in 1.5.6 performance problem should be already fixed, if not - then we fix it again).

CC: @mpenkov

@piskvorky
Copy link
Owner

piskvorky commented Apr 9, 2018

Potential cause of this smart_open regression identified in piskvorky/smart_open#184 . Fix coming up. CC @mpenkov.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants