Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading files from S3 #41

Closed
rodrigogalindez opened this issue Jul 14, 2015 · 7 comments
Closed

Reading files from S3 #41

rodrigogalindez opened this issue Jul 14, 2015 · 7 comments
Milestone

Comments

@rodrigogalindez
Copy link

Hi David!

Trying to create an endpoint in an Express server like this:

app.get('/textract', function(req, res, next) { textract("https://s3.amazonaws.com/testbucket1a2b3c/test.pdf", function(error, text) { console.log(error); res.end(); }); });

Console returns [Error: File at path [[ https://s3.amazonaws.com/testbucket1a2b3c/test.pdf ]] does not exist.]

What does this mean exactly? Textract only works with local files? (in this case my file is uploaded to S3). Thanks!

@dbashford
Copy link
Owner

Yep, only local files.

@rodrigogalindez
Copy link
Author

OK, thanks. Any plan to make it work with remote files?

@dbashford
Copy link
Owner

That seems a bit like scope creep on a singularly focused module. But I can consider adding such a thing. It probably makes more sense as something wrapped around textract instead of embedded within. The files would still need to be written locally.

@rodrigogalindez
Copy link
Author

Alright, thanks. Looking forward to your implementation. Apache Tika is very complicated to install, and yours is the only good text extractor that's written in node as far as I know.

@dbashford
Copy link
Owner

Doing some refactoring and think I'll include this. Will be a few days.

@dbashford dbashford reopened this Jul 23, 2015
@rodrigogalindez
Copy link
Author

Awesome. I've implemented textract in an order form for translation agencies (clients upload documents and the app returns the number of words & pricing) and it works very well. All the files are stored in an AWS instance for now and textract is in the same instance as well. I will refactor the app to work with S3 when it's ready. If it helps, here's how I plan to use textract:

@dbashford
Copy link
Owner

textract was born out of the contracting work I did that involved uploading resumes, extracting the text from them, loading solr with the resume text for searching, and tossing the resume itself into S3. So that all sounds familiar. =)

Working through a set of enhancements over the next few days. This'll be one of them.

@dbashford dbashford modified the milestone: 1.0.0 Jul 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants