You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I noticed that when we use the max_size in the Categorify op, the generated unique category parquet files are not correctly mapping the original, encoded and null values in the columns unique category files.
Read back in the unique parquet files, you will see that for the C1 column, null values are not mapped to 0 instead it was mapped to 1. I am not sure this is expected since we use max_size param. For C2 column, 0 index is mapped to NA and its size is 0 since there is no nulls in the C2 column, but we do also have 0s in the transformed dataset, which corresponds to less frequent items.. and this is not considered in the unique.C2.parquet file index mapping.
# read back the unique categories
unique_C1 = pd.read_parquet('./categories/unique.C1.parquet')
print(unique_C1)
# read back the unique categories
unique_C2 = pd.read_parquet('./categories/unique.C2.parquet')
print(unique_C2)
Expected behavior
unique.C1.parquet and unique.C2.parquet files should reflect the index-value mapping correctly.
Environment details (please complete the following information):
Describe the bug
I noticed that when we use the
max_size
in the Categorify op, the generatedunique category
parquet files are not correctly mapping the original, encoded and null values in the columns unique category files.Steps/Code to reproduce bug
Please you run the example below.
Read back in the unique parquet files, you will see that for the
C1
column, null values are not mapped to0
instead it was mapped to1
. I am not sure this is expected since we usemax_size
param. ForC2
column,0
index is mapped toNA
and its size is0
since there is no nulls in the C2 column, but we do also have0
s in the transformed dataset, which corresponds to less frequent items.. and this is not considered in theunique.C2.parquet
file index mapping.Expected behavior
unique.C1.parquet
andunique.C2.parquet
files should reflect the index-value mapping correctly.Environment details (please complete the following information):
docker pull
&docker run
commands usedmerlin-tensorflow-training:22.04
with the latest NVT main branch pulled.The text was updated successfully, but these errors were encountered: