Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't read file, no event fire #221

Closed
ishowshao opened this issue Nov 20, 2020 · 11 comments
Closed

can't read file, no event fire #221

ishowshao opened this issue Nov 20, 2020 · 11 comments

Comments

@ishowshao
Copy link

ishowshao commented Nov 20, 2020

can not read some file, you can try this file download

code sample:

const fs = require('fs');
const PDFParser = require('pdf2json');

let pdfParser = new PDFParser();

pdfParser.on('pdfParser_dataError', (errData) => {
    console.error(errData.parserError);
});

pdfParser.on('pdfParser_dataReady', (pdfData) => {
    console.log('pdfParser_dataReady')
    fs.writeFileSync('1.json', JSON.stringify(pdfData), 'utf8');
});

pdfParser.loadPDF(process.argv[2]);

neither pdfParser_dataError or pdfParser_dataReady fire

but, pdf.js viewer can render this file

help me, thanks !!

@maelyt
Copy link

maelyt commented Nov 30, 2020

I can approved this (v1.2.0) for the linked file.

I have a similiar problem with one of my files, but I get an error message (Invalid XRef stream header) in the console instead in the events. No timeout, no events, only an eternal ghost process.

Edit: My "solution" is a timeout for x seconds. If the timeout triggers, call the error part. If the normal events trigger and the timeout is waiting, clear timeout and work normally. In my case it's ok to have some ghost process.

@ishowshao
Copy link
Author

@maelyt thank you very much, but i think there is some bug with pdf2json, because pdf.js can render this file correctly, pdf2json based on pdf.js, so i think maybe this is a small problem and can be fixed relatively easily

@maelyt
Copy link

maelyt commented Dec 2, 2020

I think it is not that easy. I looked into the code and pdf.js is mounted via eval() into the pdf2json-environment. Also the pdf.js files are not the same as the original pdf.js repo. They are extracted from the project. It could be, that the version in pdf2json has the bug and the current version of pdf.js has a fix for it. But I stopped there because I really don't want to go down this rabbit hole.

@sologgfun
Copy link

@ishowshao get same problem. And the way we use is same... extract file text to submit an expense account. If you solved the problem , please let me know

@rascafr
Copy link

rascafr commented Dec 8, 2020

Had the same issue as @maelyt (1.2.0)
In my case, I couldn't set a timeout because of the ghost process that'll be still running, but I used pdf-parse npm library as a test workaround:

Indeed, if pdf-parse results includes a non-null text string, then I'll start pdf2json. Otherwise I'll throw an error and voila, no more ghost process!

It works with the kind of PDF I'm using. That's quite a dirty workaround, but hopefully this will do the job for people having the same issue

@sologgfun
Copy link

@rascafr It works. And I improve it. When i review the code found these files cant be parse by pdf2json and no event fire have same fearture, they pdfDocument.pdfInfo.metadataare Object. So we can triggerpdfParser_dataReadyBy ourselves.And usepdf-parseto parse these files In eventpdfParser_dataReady`.

cls.prototype.parsePDFData = function(arrayBuffer, password, parent) {
        this.pdfDocument = null;
        let parameters = {password: password, data: arrayBuffer};
        PDFJS.getDocument(parameters).then(
            pdfDocument => {
                // console.log('modify here')
                if(pdfDocument.pdfInfo.metadata){
                    parent.emit("pdfParser_dataReady", pdfDocument.pdfInfo.metadata.split(/[\n]/));
                }
                this.load(pdfDocument, 1);
            },
            error => this.raiseErrorEvent("An error occurred while parsing the PDF: " + error)
        );
    };

@rascafr
Copy link

rascafr commented Dec 21, 2020

@sologgfun did you manually edited the pdf2json/lib/pdf.js file?

I'm still having issues, but without any error being thrown.
Seems like the PDFJS.getDocument(parameters).then(...) doesn't have a catch method, it might not be a Promise.
Second thought, where does your parent parameter came from? It is always undefined for me. (I'm not a PDFJS expert 😅 )

This said, I wrapped your code into a try / catch bloc, with the parent.emit line removed:

cls.prototype.parsePDFData = function (arrayBuffer, password, parent) {
        this.pdfDocument = null;
        let parameters = {
            password: password,
            data: arrayBuffer
        };
        PDFJS.getDocument(parameters).then(
            pdfDocument => {
                try {
                    if (pdfDocument.pdfInfo.metadata) {
                        console.log('Meta ready');
                        //parent.emit("pdfjs_parseDataReady", pdfDocument.pdfInfo.metadata.split(/[\n]/));
                    }
                    this.load(pdfDocument, 1);
                } catch (e) {
                    console.log(e);
                }
            },
            error => this.raiseErrorEvent("An error occurred while parsing the PDF: " + error)
        )
    };

It throws the following error, so it might came from the original PDFJS wrapped code, how did you ended up making it work in your case?

TypeError: Cannot read property 'nodeName' of null
    at Metadata_parse [as parse] (eval at <anonymous> (.../node_modules/pdf2json/lib/pdf.js:68:1), <anonymous>:42363:15)
    at new Metadata (eval at <anonymous> (.../node_modules/pdf2json/lib/pdf.js:68:1), <anonymous>:42355:10)
    at PDFDocumentProxy_getMetadata [as getMetadata] (eval at <anonymous> (.../node_modules/pdf2json/lib/pdf.js:68:1), <anonymous>:42659:30)
    at cls.loadMetaData (.../node_modules/pdf2json/lib/pdf.js:320:33)
    at cls.load (.../node_modules/pdf2json/lib/pdf.js:313:21)
    at Object.PDFJS.getDocument.then.pdfDocument [as onResolve] (.../node_modules/pdf2json/lib/pdf.js:282:26)
    at Object.runHandlers (eval at <anonymous> (.../node_modules/pdf2json/lib/pdf.js:68:1), <anonymous>:864:35)
    at ontimeout (timers.js:498:11)
    at tryOnTimeout (timers.js:323:5)
    at Timer.listOnTimeout (timers.js:290:5)

@oebilgen
Copy link

Any updates? We're having this issue too.

@modesty
Copy link
Owner

modesty commented Sep 19, 2021

fix pushed. test with `git pull && rm -rf node_modules/ && npm i && npm run test-misc" please.

@modesty
Copy link
Owner

modesty commented Sep 25, 2021

fixed in v1.2.5

@modesty modesty closed this as completed Sep 25, 2021
@jonnypa
Copy link

jonnypa commented Oct 19, 2021

Hi, using master version and uploading this document I'm still getting no event fired.
Can you help me? Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants