RapidOCR Error - Leaked Semaphore Objects & OOM Killer #231

BennisonDevadoss · 2024-10-15T19:10:13Z

Problem Description:

While processing a large number of images (approximately 1000) using RapidOCR, I encountered the following errors midway through the process:

Leaked Semaphore Objects: "There appear to be 1 leaked semaphore object(s) to clean up at shutdown."
Process Killed by OOM Killer: "The process of this unit has been killed by the OOM killer."

System Information:

Operating System: Ubuntu 22.04 LTS
RapidOCR Version: rapidocr-onnxruntime 1.3.24

Reproducible Code:

from typing import Sequence, Union, Iterable
import numpy as np

def extract_from_images_with_rapidocr(
    images: Sequence[Union[Iterable[np.ndarray], bytes]],
) -> str:
    try:
        from rapidocr_onnxruntime import RapidOCR
    except ImportError:
        raise ImportError(
            "`rapidocr-onnxruntime` package not found, please install it with "
            "`pip install rapidocr-onnxruntime`"
        )
    ocr = RapidOCR()
    text = ""
    for img in images:
        result, _ = ocr(img)
        if result:
            result = [text[1] for text in result]
            text += "\n".join(result)
    return text

Research & Findings:

These errors seem to be related to memory leaks during batch image processing. I am uncertain about how to resolve these issues within RapidOCR, especially when handling large numbers of images.

Additional Questions:

Are there any memory management techniques or best practices for handling large image batches in RapidOCR?
How can I optimize memory usage to prevent OOM killer termination?
Is there a way to monitor memory consumption or manage semaphore objects during the process?
Would changing the version of RapidOCR (upgrading/downgrading) help resolve this memory-related issue?

Any guidance or solutions would be greatly appreciated!

The text was updated successfully, but these errors were encountered:

SWHL · 2024-10-16T01:06:39Z

I guess that some of the 1000 images are large in size, which causes the memory request to exceed the limit when recognizing these images.
At present, it is recommended to check the images sent for recognition to see if there are any images with particularly large sizes, such as 4000x7000. It is recommended to resize them in advance before sending them for OCR recognition.

Later, I will add this logic in the code to control the memory from exceeding the limit.

BennisonDevadoss · 2024-10-16T07:03:45Z

@SWHL, Thank you for the response! I have couple of follow-up questions based on your suggestions:

What would be the recommended target resolution for images to prevent memory overload during OCR processing? Is there an optimal balance between image size and OCR accuracy?
Could you share more details about the memory control logic you plan to add? Will this logic automatically resize or manage large images, and will it be included in a future release of RapidOCR?

SWHL · 2024-10-16T07:28:05Z

These two points are already under development, please refer to the develop branch, and they will be updated to the new version soon.

SWHL · 2024-10-17T14:58:25Z

You can try it again with the rapidocr_onnxruntime==1.3.25

BennisonDevadoss · 2024-11-08T09:27:56Z

@SWHL, Thanks for your update, I tried with the version 1.3.25, but it does not work for me. I am facing the same issue.

SWHL · 2024-11-09T14:38:57Z

Can you confirm if there are any fixed ones among the 1000 that will trigger OOM issues? If it can be stably reproduced, please provide this image.

BennisonDevadoss · 2024-11-21T19:48:20Z

@SWHL, I believe this issue might be related to image dimensions. In my experience, the OOM killer was triggered when the image dimension width was in 1px width and 602px height. To clarify, I’d like to understand what the minimum and maximum required dimensions for width and height are.

Additionally, I have a suggestion for improving the plugin: it would be helpful to implement an internal image size check. If an image’s dimensions are outside the required range, the plugin could resize it. If resizing isn’t possible, the image could be skipped during the OCR process.

This approach could be particularly beneficial when the plugin is integrated with others, such as LangChain. For instance, LangChain’s PDF loader (when extract_image is set to true) uses RapidOCR internally for OCR. Since we cannot predict the dimensions of images embedded in a PDF, having a dimension check before processing each image would make the workflow more robust.

Additionally I have attached that sample 1px width image here

SWHL · 2024-11-22T01:43:32Z

Thanks for the suggestion. There is definitely something wrong with the image resizing here.
The current image processing mainly goes through the following functions:

RapidOCR/python/rapidocr_onnxruntime/main.py

Lines 129 to 140 in 62bc487

    
           def preprocess(self, img: np.ndarray) -> Tuple[np.ndarray, float, float]: 
        
               h, w = img.shape[:2] 
        
               max_value = max(h, w) 
        
               ratio_h = ratio_w = 1.0 
        
               if max_value > self.max_side_len: 
        
                   img, ratio_h, ratio_w = reduce_max_side(img, self.max_side_len) 
        
               h, w = img.shape[:2] 
        
               min_value = min(h, w) 
        
               if min_value < self.min_side_len: 
        
                   img, ratio_h, ratio_w = increase_min_side(img, self.min_side_len) 
        
               return img, ratio_h, ratio_w

The original image width is 1px and height is 602px. After preprocess, img shape: hegith=18048px width=32px

Enter the following function:

RapidOCR/python/rapidocr_onnxruntime/main.py

Lines 142 to 159 in 62bc487

    
           def maybe_add_letterbox( 
        
               self, img: np.ndarray, op_record: Dict[str, Any] 
        
           ) -> Tuple[np.ndarray, Dict[str, Any]]: 
        
               h, w = img.shape[:2] 
        
               if self.width_height_ratio == -1: 
        
                   use_limit_ratio = False 
        
               else: 
        
                   use_limit_ratio = w / h > self.width_height_ratio 
        
               if h <= self.min_height or use_limit_ratio: 
        
                   padding_h = self._get_padding_h(h, w) 
        
                   block_img = add_round_letterbox(img, (padding_h, padding_h, 0, 0)) 
        
                   op_record["padding_1"] = {"top": padding_h, "left": 0} 
        
                   return block_img, op_record 
        
               op_record["padding_1"] = {"top": 0, "left": 0} 
        
               return img, op_record

Before entering the text detection model, the image width is always 32px and the height is 18048px, so it will trigger the OOM problem.

I'm thinking about how to avoid this problem. Or how to avoid this kind of image before sending it to OCR.
Welcome to communicate.

BennisonDevadoss · 2024-11-26T18:16:53Z

@SWHL, Any update on it?

SWHL self-assigned this Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RapidOCR Error - Leaked Semaphore Objects & OOM Killer #231

RapidOCR Error - Leaked Semaphore Objects & OOM Killer #231

BennisonDevadoss commented Oct 15, 2024

SWHL commented Oct 16, 2024

BennisonDevadoss commented Oct 16, 2024

SWHL commented Oct 16, 2024

SWHL commented Oct 17, 2024

BennisonDevadoss commented Nov 8, 2024

SWHL commented Nov 9, 2024

BennisonDevadoss commented Nov 21, 2024

SWHL commented Nov 22, 2024 •

edited

Loading

BennisonDevadoss commented Nov 26, 2024

RapidOCR Error - Leaked Semaphore Objects & OOM Killer #231

RapidOCR Error - Leaked Semaphore Objects & OOM Killer #231

Comments

BennisonDevadoss commented Oct 15, 2024

Problem Description:

System Information:

Reproducible Code:

Research & Findings:

Additional Questions:

SWHL commented Oct 16, 2024

BennisonDevadoss commented Oct 16, 2024

SWHL commented Oct 16, 2024

SWHL commented Oct 17, 2024

BennisonDevadoss commented Nov 8, 2024

SWHL commented Nov 9, 2024

BennisonDevadoss commented Nov 21, 2024

SWHL commented Nov 22, 2024 • edited Loading

BennisonDevadoss commented Nov 26, 2024

SWHL commented Nov 22, 2024 •

edited

Loading