Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PandasCodec unexpected behavior #737

Closed
pablobgar opened this issue Sep 22, 2022 · 9 comments · Fixed by #1162
Closed

PandasCodec unexpected behavior #737

pablobgar opened this issue Sep 22, 2022 · 9 comments · Fixed by #1162

Comments

@pablobgar
Copy link
Contributor

pablobgar commented Sep 22, 2022

Hi,

I am having some problems with PandasCodec and grpc requests.

On the client side, I have the following code to send a pandas df:

data = pd.DataFrame(data={'col1': ["test_string", "test_string"], 'col2': ["test_string", "test_string"]})

i_request = PandasCodec.encode_request(data)
i_request.parameters = types.Parameters(content_type="pd")

inference_request_g = converters.ModelInferRequestConverter.from_types(
    i_request, model_name=model_name, model_version=None
)

grpc_channel = grpc.insecure_channel("localhost:8081")
grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
response = grpc_stub.ModelInfer(inference_request_g)

And I have this code fragment to receive it on the server side:

async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:

        if payload.parameters.content_type == "pd":
            inference_data = PandasCodec.decode_request(payload)

            print(inference_data)

I get this output:

             col1            col2
0  b'test_string'  b'test_string'
1  b'test_string'  b'test_string'

The same dataframe, but with binary strings. I was expecting to get as a result the same dataframe that I have in the client, since I am using the same codec for encoding and decoding.

Is this an expected behavior? Maybe I am doing something wrong?

Thanks for your help.

@adriangonz
Copy link
Contributor

Hey @pablobgar ,

Thanks for providing that high level of detail! 👍

The PandasCodec will look at the content_types from each individual column (i.e. from each entry in the inputs list) to do any further conversion if required. For inputs with datatype BYTES, it will leave them as they are, unless they are flagged with content_type: "str", in which case it will convert them as strings.

Therefore, my guess is that either the PandasCodec.encode_request() method is not adding is not adding that str label to the string columns, or that the ModelInferRequestConverter.from_types() method is not adding the content_type label into the gRPC payload. To clarify this point, could you share the content of the i_request and inference_request_g payloads?

PS: within an MLModel instance, you can use the self.decode_request(payload) helper method, which will look at the content_type flags for you (i.e. without the need for an explicit if content_type == "...", then ....).

@pablobgar
Copy link
Contributor Author

pablobgar commented Sep 23, 2022

Hi @adriangonz

Thanks for your quick response!!

This is the content of i_request:

{'id': None, 'parameters': {'content_type': 'pd', 'headers': None}, 'inputs': [{'name': 'col1', 'shape': [2], 'datatype': 'BYTES', 'parameters': None, 'data': [b'test_string', b'test_string']}, {'name': 'col2', 'shape': [2], 'datatype': 'BYTES', 'parameters': None, 'data': [b'test_string', b'test_string']}], 'outputs': None}

And this is inference_request_g:

model_name: "test"
parameters {
  key: "content_type"
  value {
    string_param: "pd"
  }
}
inputs {
  name: "col1"
  datatype: "BYTES"
  shape: 2
  contents {
    bytes_contents: "test_string"
    bytes_contents: "test_string"
  }
}
inputs {
  name: "col2"
  datatype: "BYTES"
  shape: 2
  contents {
    bytes_contents: "test_string"
    bytes_contents: "test_string"
  }
}

@adriangonz
Copy link
Contributor

Hey @pablobgar ,

Right, that seems aligned with what was described on my comment above. The PandasCodec.encode_request method is not setting the right content type at the individual input level.

Could you try explicitly setting str as the content type for the inputs of i_request? As in,

from mlserver.codecs import StringCodec

for input in i_request.inputs:
	input.parameters = Parameters(content_type=StringCodec.ContentType)

@pablobgar
Copy link
Contributor Author

This is the resulting request:

model_name: "arte"
parameters {
  key: "content_type"
  value {
    string_param: "pd"
  }
}
inputs {
  name: "col1"
  datatype: "BYTES"
  shape: 2
  parameters {
    key: "content_type"
    value {
      string_param: "str"
    }
  }
  contents {
    bytes_contents: "test_string"
    bytes_contents: "test_string"
  }
}
inputs {
  name: "col2"
  datatype: "BYTES"
  shape: 2
  parameters {
    key: "content_type"
    value {
      string_param: "str"
    }
  }
  contents {
    bytes_contents: "test_string"
    bytes_contents: "test_string"
  }
}

But i have the same result on the server:

             col1            col2
0  b'test_string'  b'test_string'
1  b'test_string'  b'test_string'

@adriangonz
Copy link
Contributor

Hey @pablobgar,

Thanks for trying that out.

Instead of calling the PandasCodec.decode_request(payload) method directly, could you call the self.decode_request(payload, default_codec=PandasCodec) method instead? This method should ensure that each input gets decoded appropriately first .

@pablobgar
Copy link
Contributor Author

Hi @adriangonz

I have the same result with self.decode_request.

It seems that decoding is not performed in any case when the content type is bytes and PandasCodec is used.
https://github.com/SeldonIO/MLServer/blob/1.1.0/mlserver/codecs/pandas.py#L17-L19

@adriangonz
Copy link
Contributor

The decoding of the individual columns should occur within the self.decode_request method in the MLModel base class though... That one calls this decode_inference_request helper, which makes a first pass through the entries of the inputs array to see if there is any decoding needed at the input level (e.g. like decoding bytes into strings):

for request_input in inference_request.inputs:
decode_request_input(request_input, metadata_inputs)

So it's strange why that still doesn't work for you...

Just to be on the safe side, I've added a small test case on the PR linked below, which should handle the same use case. Do you see any differences with your custom code?

https://github.com/SeldonIO/MLServer/pull/752/files#diff-3b8a58a4d021803b3171b886bb9162fd659e671131f3f61036f9210cb5d0bc7cR186-R204

@pablobgar
Copy link
Contributor Author

Hi @adriangonz

Now it is working correctly if I add Parameters(content_type=StringCodec.ContentType). Maybe I did something wrong when I tested it a few days ago, or I didn't test the combination of adding this and use the self.decode_request method, sorry.

This is my code:

data = pd.DataFrame(
    data={
        "col1": ["test_string", "test_string"],
        "col2": ["test_string", "test_string"],
    }
)

i_request = PandasCodec.encode_request(data)
i_request.parameters = types.Parameters(content_type="pd")

for input in i_request.inputs:
    input.parameters = types.Parameters(content_type=StringCodec.ContentType)

But then where I see a problem is the PandasCodec.encode_request(). Shouldn't it add this parameter to columns that are strings automatically?

@adriangonz
Copy link
Contributor

Hey @pablobgar ,

Thanks for the update!

On the last point, I totally agree. And TBH, the PandasCodec should also be able to decode strings. To put it another way, since it's the only which knows when a series should be encoded as a string, it should also be able to decode a bytes array into a string series.

We'll keep this issue open to tackle this last point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants