-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Access to Textract #4877
✨ Access to Textract #4877
Comments
To discuss be discussed at refinement. Request originally from BOLD and now data science team also now have a use case for this too |
We are about to start a project in the Probation Data Science team that would really benefit from this functionality. We are working in collaboration with Probation Digital to try and improve the Pre-Sentence Report writing process for operational staff. Essentially this will involve taking large sets of documents and looking for ways to summarise them or extract data from them. The first step in doing this will be extracting the text from those documents, especially in cases where the documents are hand-written. Our research indicates that open source systems perform this task much less well than Textract. The project overall aims to give significant efficiency savings for the probation staff that write these reports. The project will kick off at the end of September with a Turing intern who will be joining to help out with this work. Happy to provide any extra information that would be useful. |
Just wanted to add support for this. We're currently running a few proof of concepts that have made use of Textract to perform OCR on images and non-machine readable documents. It's performing much better than our previous iterations using Tesseract and the implementation was simpler too. |
Adding support for this too. My team are working on a project for the Parole Board reading in lots of (currently publicly available) PDF files containing tables, images, charts etc. We wanted to start using Textract as research suggests it would perform better than some of the other standard PDF loaders we've been experimenting with (e.g. pymupdf, pdfplumber, pypdf and pdfminer). These loaders are ok(ish) for our PoC but we are seeing limitations with them already so it would be good to be able to experiment with Textract on the AP. Happy to provide more info if needed. |
Describe the feature request.
Get access to use Textract in our webapp.
Describe the context.
BOLD and LAA want to build a document summarisation tool. We are currently in our investigative/exploratory stage. We have created a simple dev webapp and want to test access to Bedrock and Textract. Please could we get access to Textract?
Value / Purpose
We have tested Textract during a Hackathon project and this provides the best results over other alternatives.
User Types
BOLD analysts and LAA analysts
The text was updated successfully, but these errors were encountered: