Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MmCorpus file-like object support bug #1869

Closed
menshikh-iv opened this issue Feb 1, 2018 · 6 comments
Closed

MmCorpus file-like object support bug #1869

menshikh-iv opened this issue Feb 1, 2018 · 6 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 1, 2018

Into
We have some "weird" behavior if a user passes a file-like object to MmCorpus, based on this mailing list thread

Demonstration

from gensim.corpora import MmCorpus
import bz2

f = bz2.BZ2File("testcorpus.mm.bz2")
print(f.closed)  # 0
corpus = MmCorpus(f)
print(f.closed)  # 1 ???

What happens
File-like object was closed when we call MmReader, problem located here

https://github.com/RaRe-Technologies/gensim/blob/5342153eb4f4b02bb45bfa3951eef8250ac9f6b6/gensim/matutils.py#L1274

with automatically close file-like when we out of scope, this is OK if we open this file, but we shouldn't close file-like passed from user.

Related PR #1867

UPD: another problem here - call IndexCopus.__init__, that didn't support file-like object at all.

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Feb 1, 2018
@sj29-innovate
Copy link
Contributor

sj29-innovate commented Feb 3, 2018

@menshikh-iv I am trying to fix this by replacing the "with" statement with the following code ,

lines = utils.file_or_filename(self.input)
''' Further code '''
if isinstance(self.input, string_types):
            lines.close()

Do you have any other suggestions for this ?
And for "IndexedCorpus" I am trying to do something like this ,

try 
       is_filename = isinstance(fname, string_typses)
       if index_fname is None and is_filename:
                     index_fname = utils.smart_extension(fname, '.index')
       if not is_filename:
                     self.index = fname
       else:
                    self.index = utils.unpickle(index_fname)

Current code looks like this

 try:
            if index_fname is None:
                index_fname = utils.smart_extension(fname, '.index')
            self.index = utils.unpickle(index_fname)

@menshikh-iv
Copy link
Contributor Author

@sj29-innovate

  1. About with statement, need to use more complicated (but more similar with with variant), look at SO-thread
  2. About IndexedCorpus, is this possible to use file-like instead of the filename? What're pros/cons here?

@sj29-innovate
Copy link
Contributor

sj29-innovate commented Feb 10, 2018

@menshikh-iv Using with statement will close the file object anyways so do you plan to use the nested try and finally blocks as mentioned in that thread ?

For indexed corpus , we need to support both file-object and filename right ? So i guess for the file object support we can directly convert it into a numpy array.

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Feb 12, 2018

@sj29-innovate about with - yes, this is a problem in this case (because we shouldn't close the file if we didn't open it).

For indexed-corpus - yes, we need both (but I'm not sure about possibility)

@menshikh-iv menshikh-iv added difficulty medium Medium issue: required good gensim understanding & python skills and removed difficulty easy Easy issue: required small fix labels Feb 12, 2018
@sj29-innovate
Copy link
Contributor

sj29-innovate commented Feb 14, 2018

@menshikh-iv I believe we can support both filename and file-like object with a difference that we wont be able to save it to an index file in case of the file object. This can create performance issues if in any subclass we need to call "super" so the "init" function of "IndexedCorpus" will again need to generate a new indexed corpus form the object instead of using a previously saved indexed corpus.

Also please have a look at this , I tried to fix the file closing problem using nested try and finally blocks as mentioned in that thread.
Original Behaviour :

with utils.file_or_filename(self.input) as lines:
    BLOCK

Changes :

mgr = (utils.file_or_filename(self.input))
exit = type(mgr).__exit__
value = type(mgr).__enter__(mgr)
exc = True
try:
     try: 
          lines = value
          BLOCK
    except:
          exc = False
          if not exit(mgr, *sys.exc_info()):
		   raise
finally:
     if isinstance(self.input, string_types):
          exit(mgr, None, None, None)

I have tested the changes as per your demonstration and they work fine.(Not closing the file)

@menshikh-iv
Copy link
Contributor Author

Aha, so, probably this issue impossible to fix (by IndexCorpus reason), but you can fix the incorrect behavior of MmCorpus anyway.

the current code looks scary (especially formatting), try to publish PR & write tests (I'll look into again on concrete example).

sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 16, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 16, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 16, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 16, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 16, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 16, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 19, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 19, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 19, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 20, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
sj29-innovate added a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

No branches or pull requests

2 participants