Add support for unicode string inputs to Workflow Transform in Triton #345

oliverholworthy · 2023-05-10T14:55:42Z

Add support for unicode string inputs to Workflow Transform in Triton.

Adds a test for running a Workflow with non-ascii charaters in string inputs.

We currently get the following error from the .astype("str") call if we pass string inputs with non-ascii characters.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

This is because when we pass a string like "椅子" to a triton model, that tensor is received as np.array([b'\xe6\xa4\x85\xe5\xad\x90'], dtype=object). If you try to do .astype(str) on this, it raises this UnicodeDecodeError.

We can coerce array of byte strings to unicode strings with np.char.decode(out.astype(bytes)) on the array, where out = np.array([b'\xe6\xa4\x85\xe5\xad\x90'], dtype=object). ~~However, it appears we can safely remove the line that is performing the coersion. (It doesn't appear to break any existing tests at least.)~~

github-actions · 2023-05-10T15:07:13Z

Documentation preview

https://nvidia-merlin.github.io/systems/review/pr-345

oliverholworthy · 2023-05-10T15:27:57Z

merlin/systems/triton/__init__.py

@@ -150,9 +150,6 @@ def _convert_tensor(t):
    out = t.as_numpy()
    if len(out.shape) == 2:
        out = out[:, 0]
-    # cudf doesn't seem to handle dtypes like |S15 or object that well
-    if is_string_dtype(out.dtype):
-        out = out.astype("str")


I tried changing this to out = np.char.decode(out.astype(bytes)) which worked for the new test being added here. And then I wondered if this was needed at all. Looking to see if the tests pass without this now.

My impression is that this does (or did) cover a real edge case, which may not be adequately covered by tests. This piece of code was inherited from the old serving code in NVT, which was TBH not very well tested.

I've updated this keeing the string type coercion. Using np.char.decode(out.astype(bytes)) intead of out.astype("str").

It appears we do need this because cudf doesn't accept an array of byte strings as a type when constructing a DataFrame.

oliverholworthy · 2023-05-10T15:46:47Z

merlin/systems/triton/__init__.py

@@ -150,9 +150,6 @@ def _convert_tensor(t):
    out = t.as_numpy()
    if len(out.shape) == 2:
        out = out[:, 0]


Unrelated to this change: It's unclear to me why we'd want to remove dimensions from the input here

This code has existed for a long time, and I think is related to the perennial inconsistency around list formats that has plagued the Merlin code base. Way back in the before times, sometimes you'd get a proper 1d array/tensor and sometimes you'd get a 2d array/tensor that only contained one row. The legacy serving code from NVT that Systems is based on (and still trying to clean up and/or shed) had all kinds of issues like this and mostly solved them by hacking around the inconsistent formats instead of standardizing.

Add test for workflow with unicode string inputs

042a138

oliverholworthy added the enhancement New feature or request label May 10, 2023

oliverholworthy added this to the Merlin 23.05 milestone May 10, 2023

oliverholworthy self-assigned this May 10, 2023

oliverholworthy marked this pull request as draft May 10, 2023 14:55

Reformat test_ensemble.py

980c1a5

Remove string dtype coercion

926a6ce

oliverholworthy commented May 10, 2023

View reviewed changes

oliverholworthy marked this pull request as ready for review May 10, 2023 15:44

oliverholworthy requested review from jperez999 and karlhigley May 10, 2023 15:45

oliverholworthy commented May 10, 2023

View reviewed changes

oliverholworthy and others added 4 commits May 12, 2023 21:06

Add string coercion to ensure we have unicode string array

02d25a7

Merge branch 'main' into workflow-with-unicode-string-inputs

ddf4d36

Merge branch 'main' into workflow-with-unicode-string-inputs

295b58b

Merge branch 'main' into workflow-with-unicode-string-inputs

f8b17fa

oliverholworthy modified the milestones: Merlin 23.05, Merlin 23.06 Jun 6, 2023

oliverholworthy and others added 4 commits June 13, 2023 17:56

Merge branch 'main' into workflow-with-unicode-string-inputs

dcbe4da

Merge branch 'main' into workflow-with-unicode-string-inputs

5644cd3

Merge branch 'main' into workflow-with-unicode-string-inputs

a65aad2

Use LambdaOp instead of Categorify in string test

2c866a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for unicode string inputs to Workflow Transform in Triton #345

Add support for unicode string inputs to Workflow Transform in Triton #345

oliverholworthy commented May 10, 2023 •

edited

Loading

github-actions bot commented May 10, 2023

oliverholworthy May 10, 2023

karlhigley May 10, 2023

oliverholworthy May 12, 2023

oliverholworthy May 10, 2023

karlhigley May 10, 2023 •

edited

Loading

Add support for unicode string inputs to Workflow Transform in Triton #345

Are you sure you want to change the base?

Add support for unicode string inputs to Workflow Transform in Triton #345

Conversation

oliverholworthy commented May 10, 2023 • edited Loading

github-actions bot commented May 10, 2023

Documentation preview

oliverholworthy May 10, 2023

Choose a reason for hiding this comment

karlhigley May 10, 2023

Choose a reason for hiding this comment

oliverholworthy May 12, 2023

Choose a reason for hiding this comment

oliverholworthy May 10, 2023

Choose a reason for hiding this comment

karlhigley May 10, 2023 • edited Loading

Choose a reason for hiding this comment

oliverholworthy commented May 10, 2023 •

edited

Loading

karlhigley May 10, 2023 •

edited

Loading