[BUG] Categorify(max_size) is not generating the mappings in the `unique` parquet files as expected #1517

rnyak · 2022-04-18T17:13:02Z

Describe the bug
I noticed that when we use the max_size in the Categorify op, the generated unique category parquet files are not correctly mapping the original, encoded and null values in the columns unique category files.

Steps/Code to reproduce bug

Please you run the example below.

gdf = cudf.DataFrame(
    {
        "C1": [1, np.nan, 3, 4, 3] *5,
        "C2": [1, 1, 2, 3, 6] *5
    }
)

cat_features = ["C1", "C2"] >> ops.Categorify(max_size=4)

train_dataset = nvt.Dataset(gdf)

workflow = Workflow(cat_features)
workflow.fit(train_dataset)
tmp = workflow.transform(train_dataset).to_ddf().compute()
print(tmp)

Read back in the unique parquet files, you will see that for the C1 column, null values are not mapped to 0 instead it was mapped to 1. I am not sure this is expected since we use max_size param. For C2 column, 0 index is mapped to NA and its size is 0 since there is no nulls in the C2 column, but we do also have 0s in the transformed dataset, which corresponds to less frequent items.. and this is not considered in the unique.C2.parquet file index mapping.

# read back the unique categories
unique_C1 = pd.read_parquet('./categories/unique.C1.parquet')
print(unique_C1)

# read back the unique categories
unique_C2 = pd.read_parquet('./categories/unique.C2.parquet')
print(unique_C2)

Expected behavior

unique.C1.parquet and unique.C2.parquet files should reflect the index-value mapping correctly.

Environment details (please complete the following information):

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
Method of NVTabular install: [conda, Docker, or from source]
- If method of install is [Docker], provide docker pull & docker run commands used

merlin-tensorflow-training:22.04 with the latest NVT main branch pulled.

The text was updated successfully, but these errors were encountered:

jperez999 · 2022-04-19T16:25:21Z

This is a bug that trips up here https://github.com/NVIDIA-Merlin/NVTabular/blob/main/nvtabular/ops/categorify.py#L1076. This call resorts the values based on count not respecting the na_sentinel location. Solution inbound.

rnyak added the bug Something isn't working label Apr 18, 2022

karlhigley assigned jperez999 Apr 19, 2022

karlhigley added this to the Merlin 22.05 milestone Apr 19, 2022

jperez999 mentioned this issue Apr 20, 2022

Fix for max-size categorify operator category ordering #1519

Merged

karlhigley closed this as completed Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Categorify(max_size) is not generating the mappings in the `unique` parquet files as expected #1517

[BUG] Categorify(max_size) is not generating the mappings in the `unique` parquet files as expected #1517

rnyak commented Apr 18, 2022

jperez999 commented Apr 19, 2022

[BUG] Categorify(max_size) is not generating the mappings in the unique parquet files as expected #1517

[BUG] Categorify(max_size) is not generating the mappings in the unique parquet files as expected #1517

Comments

rnyak commented Apr 18, 2022

jperez999 commented Apr 19, 2022

[BUG] Categorify(max_size) is not generating the mappings in the `unique` parquet files as expected #1517

[BUG] Categorify(max_size) is not generating the mappings in the `unique` parquet files as expected #1517