Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures #149

Closed
DarrenCook opened this issue Mar 21, 2018 · 5 comments
Closed

Test failures #149

DarrenCook opened this issue Mar 21, 2018 · 5 comments
Labels

Comments

@DarrenCook
Copy link

On Linux Mint 18, I forked the project, did a git clone, then yarn install. I also did these:

sudo apt install tesseract-ocr tesseract-ocr-jpn tesseract-ocr-chi-sim unrtf

Running yarn test (npm test is identical, by the way). I get "5 of 177 tests failed". One is because I haven't installed drawingtotext. Here are the others:

  1. textract for .pdf files will properly handle multiple columns:
    AssertionError: expected false to be true

  2. textract for .pdf files can handle files with spaces in the name:
    AssertionError: expected false to be true

  3. textract for image files will extract text from GIF files:
    AssertionError: expected [Error: Error extracting [[ testphoto.gif ]], exec error: Command failed: tesseract /home/darren/Projects/textract/test/files/testphoto.gif /tmp/textract/testphoto quiet
    Tesseract Open Source OCR Engine v3.04.01 with Leptonica
    Warning in pixReadMemGif: writing to a temp file, not directly to memory
    Error in pixReadStreamGif: Can't use giflib-5.1.2; suggest 5.1.1 or earlier
    Error in pixReadStream: gif: no pix returned
    Error in pixRead: pix not read
    Error in pixReadMemGif: pix not read
    Error in pixReadMem: gif: no pix returned
    Error during processing.
    ] to be null

  4. fromUrl tests will markdown files:

    actual expected

    ""# This is an h1 ## This is an h2 This__This text has been bolded and italicizeditalicized__ "

(The last one is hard to read without the colour-coding! Basically it is saying the # are still there and the underlines are still in there.)

@dbashford
Copy link
Owner

3 will likely be OS related.

I get 1/2/4 locally (along with the drawingtotext error) myself. Hoping to spend time today churning through some tickets. Will get tests going first, hopefully is fairly straight forward.

@dbashford
Copy link
Owner

Pushed update that addresses the other breakages. The Tesseract error isn't one I get locally. Can't use giflib-5.1.2; suggest 5.1.1 or earlier is the error you are getting. Not sure if there is something that can be tracked down to figure out what that might be? Funnily enough googling that gets you an issue in the Python textract library.

Going to close this issue out, but can keep discussing. If you figure something out I'm happy to update the docs for other folks that might run into it.

@DarrenCook
Copy link
Author

Just updated, and 1/2/4 have now gone, but in their place I get three rtf complaints about:

 TypeError: cb is not a function

lib/extractors/html.js:76 called from rtf.js:34.

All three failing tests are in the same describe(), extract_test.js, lines 122 to 152.

@dbashford
Copy link
Owner

hmm, no test failures locally, but think I may know why, I'll dig in later today

@dbashford
Copy link
Owner

PR above fixed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants