Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tesseract very slow in R #54

Open
randomgambit opened this issue Feb 13, 2021 · 1 comment
Open

tesseract very slow in R #54

randomgambit opened this issue Feb 13, 2021 · 1 comment

Comments

@randomgambit
Copy link

Hello there,

Thanks for this amazing binding! I am running into some performance issues and I wonder if you have some hints or ideas.

Basically, the R wrapper works fine but it is very slow. I tried to use furrr and multiprocessing but I have read on the internet that it is not that easy to run many tesseract processing in parallel. Is that true? were you able to run tesseract in parallel already?

Thanks~

@morgan-dgk
Copy link

morgan-dgk commented Mar 9, 2022

Hi Randomgambit, I have run tesseract in parallel on Windows and it seems to perform pretty well. I tested a 47 page pdf both with and without parallel processing. The function using parallel processing appears to be approximately 70% faster. I've included my code below.

Hope this is helpful!

parallel_ocr <- function(x) {
  pdf_split <- as.list(pdftools::pdf_split(x, "./images/split/"))
  cl <- makeCluster(detectCores())
  clusterEvalQ(cl, {library(pdftools); library(tesseract)})
  clusterExport(cl, c("pdf_convert", "ocr"))
  
  png_file <- parLapplyLB(cl, pdf_split, pdf_convert, dpi = 150)
  
  text <- parLapplyLB(cl, png_file, ocr)
  
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants