Torchserve changes image bytes compared to using local inference #2054

rbavery · 2023-01-04T06:22:38Z

🐛 Describe the bug

I am getting slightly but significantly different results when running inference with Torchserve vs locally, due to the image input being slightly modified somewhere within the Torchserve environment. This is a bit of an involved issue so apologies for the long explanation, any help is much appreciated.

Below is the image I am using for inference. When running inference locally, I open this with PIL.Image.open()

When I load the image with a custom preprocess handler as a bytearray and open it with PIL, I get the above image, but with slight differences within the Torchserve environment. I've highlighted these by setting any nonzero difference to 1 or -1

These differences occur before any torch specific image transforms are applied from what I can tell. I've also made sure that the torchserve and local environement have the same numpy and PIL versions. these are the only non-standard libraries I can tell are being used in the preprocess handler up until the error occurs.

Below is my preprocess handler where I save out the intermediate preprocessed result that has differences. As can be seen, the only operation is load_image and io.BytesIO. load_image eventually just calls image = Image.open() after checking that the image is in RGB mode and does no have rotations.

    def preprocess(self, data):
        """Converts input images to float tensors.
        Args:
            data (List): Input data from the request in the form of a list of image tensors.
        Returns:
            Tensor: single Tensor of shape [BATCH_SIZE=1, 3, IMG_SIZE, IMG_SIZE]
        """

        # load images
        # taken from https://github.com/pytorch/serve/blob/master/ts/torch_handler/vision_handler.py
        
        # handle if images are given in base64, etc.
        row = data[0]
        # Compat layer: normally the envelope should just return the data
        # directly, but older versions of Torchserve didn't have envelope.
        image = row.get("data") or row.get("body")
        # if isinstance(image, str):
        #     # if the image is a string of bytesarray.
        #     image = base64.b64decode(image)

        # If the image is sent as bytesarray
        if isinstance(image, (bytearray, bytes)):
            image = load_image(io.BytesIO(image))
        else:
            print("not a bytearray")
            assert False

        # force convert to tensor
        # and resize to [img_size, img_size]
        image = np.asarray(image)
        np.save("/app/test-before-letterbox.arr", image)

load_image()

def open_image(input_file: Union[str, BytesIO]) -> Image:
    """
    Opens an image in binary format using PIL.Image and converts to RGB mode.
    
    Supports local files or URLs.
    This operation is lazy; image will not be actually loaded until the first
    operation that needs to load it (for example, resizing), so file opening
    errors can show up later.
    Args:
        input_file: str or BytesIO, either a path to an image file (anything
            that PIL can open), or an image as a stream of bytes
    Returns:
        an PIL image object in RGB mode
    """
    if (isinstance(input_file, str)
            and input_file.startswith(('http://', 'https://'))):
        try:
            response = requests.get(input_file)
        except Exception as e:
            print(f'Error retrieving image {input_file}: {e}')
            success = False
            if e.__class__.__name__ in error_names_for_retry:
                for i_retry in range(0,n_retries):
                    try:
                        time.sleep(retry_sleep_time)
                        response = requests.get(input_file)        
                    except Exception as e:
                        print(f'Error retrieving image {input_file} on retry {i_retry}: {e}')
                        continue
                    print('Succeeded on retry {}'.format(i_retry))
                    success = True
                    break
            if not success:
                raise
        try:
            image = Image.open(BytesIO(response.content))
        except Exception as e:
            print(f'Error opening image {input_file}: {e}')
            raise

    else:
        print("trying to open image")
        image = Image.open(input_file)
    if image.mode not in ('RGBA', 'RGB', 'L', 'I;16'):
        raise AttributeError(
            f'Image {input_file} uses unsupported mode {image.mode}')
    if image.mode == 'RGBA' or image.mode == 'L':
        print("trying to convert image")
        # PIL.Image.convert() returns a converted copy of this image
        image = image.convert(mode='RGB')

    # Alter orientation as needed according to EXIF tag 0x112 (274) for Orientation
    #
    # https://gist.github.com/dangtrinhnt/a577ece4cbe5364aad28
    # https://www.media.mit.edu/pia/Research/deepview/exif.html
    #
    try:
        exif = image._getexif()
        orientation: int = exif.get(274, None)  # 274 is the key for the Orientation field
        if orientation is not None and orientation in IMAGE_ROTATIONS:
            image = image.rotate(IMAGE_ROTATIONS[orientation], expand=True)  # returns a rotated copy
    except Exception:
        pass

    return image

My question is if there are other handler steps that could be occurring before preprocess?

Error logs

There are no tracebacks from the torchserve container. Before or during the preprocess handler.

Installation instructions

This is my Dockerfile

FROM pytorch/torchserve:0.5.3-cpu
RUN whoami
RUN ls -la /home/venv/bin/pip
USER root
# RUN pip install --upgrade pip && pip install opencv-python ipython
# commit id https://github.com/ultralytics/yolov5/blob/9286336cb49d577873b2113739788bbe3b90f83c/requirements.txt
RUN pip install gitpython ipython matplotlib>=3.2.2 numpy==1.23.4 opencv-python==4.6.0.66 \
    Pillow==9.2.0  psutil PyYAML>=5.3.1 requests>=2.23.0 scipy==1.9.3 thop>=0.1.1 \
    torch==1.10.0 torchvision==0.11.1 tqdm>=4.64.0 tensorboard>=2.4.1 pandas>=1.1.4 \
    seaborn>=0.11.0
USER model-server

Model Packaing

full custom handler: https://gist.github.com/rbavery/351563cd36e23216243d3587c14a0a55

model packaging step. The custom handler and non-torchserve local test uses torch hub to load the model.

torch-model-archiver --model-name mdv5 --version 1.0.0 --serialized-file models/megadetectorv5/md_v5a.0.0.pt --extra-files index_to_name.json --extra-files /root/.cache/torch/hub/ultralytics_yolov5_master/ --handler mdv5_handler.py
mkdir -p model_store
mv mdv5.mar model_store/megadetectorv5-yolov5-1-batch-1280-1280.mar

config.properties

I don't think I changed any of these, I start the server within docker with

torchserve --start --model-store /app/model_store --no-config-snapshots --models mdv5=/app/megadetectorv5-yolov5-1-batch-1280-1280.mar

Versions

I'm using torchserve via Docker so not sure this applies. the container is torchserve:0.5.3-cpu

Repro instructions

Below is copied from the readme. The s3 bucket with weights aren't publicly accessible so I'm more looking to document the issue and check in to ask if this could be related to image processing steps that occur before the preprocess handler.

Setup Instructions

Download weights and torchscript model

From this directory, run:

aws s3 sync s3://animl-model-zoo/megadetectorv5/ models/megadetectorv5/

Export yolov5 weights as torchscript model

first, clone and install yolov5 dependencies and yolov5 following these instructions: https://docs.ultralytics.com/tutorials/torchscript-onnx-coreml-export/

Then, if running locally, make sure to install the correct version of torch and torchvision, the same versions used to save the torchscript megadetector model, we need to use these to load the torchscript model. Check the Dockerfile for versions.

Size needs to be same as in mdv5_handler.py for good performance. Run this from this directory

python ../../../yolov5/export.py --weights models/megadetectorv5/md_v5a.0.0.pt --img 1280 1280 --batch 1

this will create models/megadetectorv5/md_v5a.0.0.torchscript

Run model archiver

first, pip install torch-model-archiver then,

torch-model-archiver --model-name mdv5 --version 1.0.0 --serialized-file models/megadetectorv5/md_v5a.0.0.torchscript --extra-files index_to_name.json --handler mdv5_handler.py
mkdir -p model_store
mv mdv5.mar model_store/megadetectorv5-yolov5-1-batch-1280-1280.mar

The .mar file is what is served by torchserve.

Serve the torchscript model with torchserve

bash docker_mdv5.sh

Return prediction in normalized coordinates with category integer and confidence score

curl http://127.0.0.1:8080/predictions/mdv5 -T ../../input/sample-img-fox.jpg

Possible Solution

No response

The text was updated successfully, but these errors were encountered:

mreso · 2023-01-11T22:10:23Z

Hi, this might be a long shot but the jpeg standard is loose enough so two compliant decoder can result in different image on pixel level.
Just in case you're using windows to test locally this might be of interest: python-pillow/Pillow#3833

Otherwise, did you check the libjpeg versions are equivalent between docker and local?

And did you check if opening the image/bytestream outside of TorchServe but inside the docker gives the same result/image as locally? If it does not we get a lot of unknowns out if the equation.

rbavery · 2023-01-11T23:43:31Z

Thanks very much for this suggestion! I am using an Ubuntu WSL environment to test locally. I'll check the libjpeg versions. And will also check on opening the image.

rbavery · 2023-01-12T00:00:01Z

I think the libjpeg difference might be it! The versions are different

In torchserve

        libjpeg-b1f3a3b7.so.62.3.0 => /usr/local/lib/python3.8/dist-packages/PIL/../Pillow.libs/libjpeg-b1f3a3b7.so.62.3.0 (0x00007f8bbc2d8000)

In local WSL env

libjpeg.so.9 => /root/miniconda3/lib/python3.9/site-packages/PIL/../../../libjpeg.so.9 (0x00007fdb9f68c000)

The correct result comes from the WSL environment where Pillow is installed from conda, I think since that's where the original author of the model installed Pillow from. I think what I might look into next is how to relink Torchserve to use python from conda since that seems like the quickest way to resolve this version mismatch.

rbavery · 2023-05-23T20:54:34Z

this was the issue ^

Ankur-singh · 2023-06-19T22:25:20Z

I am also trying to implement a handler for yolov5 model. But I am getting error that the response object type is not supported. Can you please tell me what is the response format? or if you can share your code.

msaroufim added help wanted Extra attention is needed triaged Issue has been reviewed and triaged preprocessing labels Jan 4, 2023

rbavery closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torchserve changes image bytes compared to using local inference #2054

Torchserve changes image bytes compared to using local inference #2054

rbavery commented Jan 4, 2023 •

edited

Loading

mreso commented Jan 11, 2023

rbavery commented Jan 11, 2023

rbavery commented Jan 12, 2023

rbavery commented May 23, 2023

Ankur-singh commented Jun 19, 2023

Torchserve changes image bytes compared to using local inference #2054

Torchserve changes image bytes compared to using local inference #2054

Comments

rbavery commented Jan 4, 2023 • edited Loading

🐛 Describe the bug

Error logs

Installation instructions

Model Packaing

config.properties

Versions

Repro instructions

Setup Instructions

Download weights and torchscript model

Export yolov5 weights as torchscript model

Run model archiver

Serve the torchscript model with torchserve

Return prediction in normalized coordinates with category integer and confidence score

Possible Solution

mreso commented Jan 11, 2023

rbavery commented Jan 11, 2023

rbavery commented Jan 12, 2023

rbavery commented May 23, 2023

Ankur-singh commented Jun 19, 2023

rbavery commented Jan 4, 2023 •

edited

Loading