Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract text from doc files(windows10 64) #96

Closed
SHocker-Yu opened this issue Aug 23, 2016 · 10 comments
Closed

extract text from doc files(windows10 64) #96

SHocker-Yu opened this issue Aug 23, 2016 · 10 comments

Comments

@SHocker-Yu
Copy link

"DOC extraction requires antiword be installed, link, unless on OSX in which case textutil (installed by default) is used."

OS: windows10 64
I installed antiword.exe failed,and i don't konw how to do with this problem...

@zzzwx
Copy link

zzzwx commented Oct 30, 2016

have you declared the path to your antiword.exe file in the PATH global variable ?

@SHocker-Yu
Copy link
Author

@zzzwx thanks for your reply,antiword does not support Windows.

@zzzwx
Copy link

zzzwx commented Oct 31, 2016

@SHocker-Yu i am using it on windows (7 and 10)
some good fellow actually compiled it for windows, get it there : http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/

@SHocker-Yu
Copy link
Author

@zzzwx appreciate for your kind reply.
I loaded it at last time, but when i want to run antiword.exe, it flash back,
OS: Windows10
Have you come across this situation?
Could you tell me how to make it running success?

@zzzwx
Copy link

zzzwx commented Nov 1, 2016

@SHocker-Yu what do you mean by "flash back" ?

here are the steps I followed to make it work on windows :

0/ modify textract/lib/extractors/doc.js to fix a bug reported in a github issue

-        if ( error.toString().indexOf( 'is not a Word Document' ) ) {
+        if ( error.toString().indexOf( 'is not a Word Document' ) > 0 ) {

1/ download windows binary

2/ add antiword directory to Windows' PATH environnement variable

=> at this point it worked but only when the path to the doc file contained no spaces

3/ modify textract/lib/extractors/doc.js again to add quotes so that it reads the input path as is

-    var escapedPath = filePath.replace( /\s/g, '\\ ' );
+   var escapedPath = filePath/*.replace( /\s/g, '\\ ' )*/;

-    exec( 'antiword ' + escapedPath,
+   exec( 'antiword "' + escapedPath + '"',

=> at this point it worked for every paths

4/ modify textract/lib/extractors/doc.js one last time to manage UTF8 encoding of output text

  - exec( 'antiword "' + escapedPath + '"',
  + exec( 'antiword -m UTF-8.txt "' + escapedPath + '"',

=> and after that it worked well all the time :)

hope this helps you

@SHocker-Yu
Copy link
Author

@zzzwx I really appreciate for your kind,so sorry about my pool English,'flash back' means 'crash',these days i had to work all day ,and reply you so late,really sorry, i have readed your reply,and i will try it and then tell you the result.
Best wishes.

@SHocker-Yu
Copy link
Author

@zzzwx It works!Thank you so much!!!

@dbashford
Copy link
Owner

FYI, I've implemented the changes from above across a few different commits the last few months (sorry so slow!).

@dbashford
Copy link
Owner

Published as 2.1, thanks!

@zzzwx
Copy link

zzzwx commented Dec 23, 2016

Hi @dbashford , thank you for your work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants