extract text from doc files(windows10 64) #96

SHocker-Yu · 2016-08-23T03:52:18Z

"DOC extraction requires antiword be installed, link, unless on OSX in which case textutil (installed by default) is used."

OS: windows10 64
I installed antiword.exe failed,and i don't konw how to do with this problem...

zzzwx · 2016-10-30T01:23:45Z

have you declared the path to your antiword.exe file in the PATH global variable ?

SHocker-Yu · 2016-10-31T02:20:47Z

@zzzwx thanks for your reply,antiword does not support Windows.

zzzwx · 2016-10-31T18:51:26Z

@SHocker-Yu i am using it on windows (7 and 10)
some good fellow actually compiled it for windows, get it there : http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/

SHocker-Yu · 2016-11-01T06:19:56Z

@zzzwx appreciate for your kind reply.
I loaded it at last time, but when i want to run antiword.exe, it flash back,
OS: Windows10
Have you come across this situation?
Could you tell me how to make it running success?

zzzwx · 2016-11-01T16:48:30Z

@SHocker-Yu what do you mean by "flash back" ?

here are the steps I followed to make it work on windows :

0/ modify textract/lib/extractors/doc.js to fix a bug reported in a github issue

-        if ( error.toString().indexOf( 'is not a Word Document' ) ) {
+        if ( error.toString().indexOf( 'is not a Word Document' ) > 0 ) {

1/ download windows binary

2/ add antiword directory to Windows' PATH environnement variable

=> at this point it worked but only when the path to the doc file contained no spaces

3/ modify textract/lib/extractors/doc.js again to add quotes so that it reads the input path as is

-    var escapedPath = filePath.replace( /\s/g, '\\ ' );
+   var escapedPath = filePath/*.replace( /\s/g, '\\ ' )*/;

-    exec( 'antiword ' + escapedPath,
+   exec( 'antiword "' + escapedPath + '"',

=> at this point it worked for every paths

4/ modify textract/lib/extractors/doc.js one last time to manage UTF8 encoding of output text

  - exec( 'antiword "' + escapedPath + '"',
  + exec( 'antiword -m UTF-8.txt "' + escapedPath + '"',

=> and after that it worked well all the time :)

hope this helps you

SHocker-Yu · 2016-11-03T14:46:21Z

@zzzwx I really appreciate for your kind,so sorry about my pool English,'flash back' means 'crash',these days i had to work all day ,and reply you so late,really sorry, i have readed your reply,and i will try it and then tell you the result.
Best wishes.

SHocker-Yu · 2016-11-05T04:11:59Z

@zzzwx It works!Thank you so much!!!

dbashford · 2016-12-23T16:44:17Z

FYI, I've implemented the changes from above across a few different commits the last few months (sorry so slow!).

dbashford · 2016-12-23T16:48:24Z

Published as 2.1, thanks!

zzzwx · 2016-12-23T17:28:26Z

Hi @dbashford , thank you for your work

dbashford closed this as completed in 67f792e Dec 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract text from doc files(windows10 64) #96

extract text from doc files(windows10 64) #96

SHocker-Yu commented Aug 23, 2016

zzzwx commented Oct 30, 2016

SHocker-Yu commented Oct 31, 2016

zzzwx commented Oct 31, 2016

SHocker-Yu commented Nov 1, 2016

zzzwx commented Nov 1, 2016 •

edited

Loading

SHocker-Yu commented Nov 3, 2016

SHocker-Yu commented Nov 5, 2016

dbashford commented Dec 23, 2016

dbashford commented Dec 23, 2016

zzzwx commented Dec 23, 2016

extract text from doc files(windows10 64) #96

extract text from doc files(windows10 64) #96

Comments

SHocker-Yu commented Aug 23, 2016

zzzwx commented Oct 30, 2016

SHocker-Yu commented Oct 31, 2016

zzzwx commented Oct 31, 2016

SHocker-Yu commented Nov 1, 2016

zzzwx commented Nov 1, 2016 • edited Loading

0/ modify textract/lib/extractors/doc.js to fix a bug reported in a github issue

1/ download windows binary

2/ add antiword directory to Windows' PATH environnement variable

3/ modify textract/lib/extractors/doc.js again to add quotes so that it reads the input path as is

4/ modify textract/lib/extractors/doc.js one last time to manage UTF8 encoding of output text

SHocker-Yu commented Nov 3, 2016

SHocker-Yu commented Nov 5, 2016

dbashford commented Dec 23, 2016

dbashford commented Dec 23, 2016

zzzwx commented Dec 23, 2016

zzzwx commented Nov 1, 2016 •

edited

Loading