Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Categorify(max_size) is not generating the mappings in the unique parquet files as expected #1517

Closed
rnyak opened this issue Apr 18, 2022 · 1 comment
Assignees
Labels
bug Something isn't working
Milestone

Comments

@rnyak
Copy link
Contributor

rnyak commented Apr 18, 2022

Describe the bug
I noticed that when we use the max_size in the Categorify op, the generated unique category parquet files are not correctly mapping the original, encoded and null values in the columns unique category files.

Steps/Code to reproduce bug

Please you run the example below.

gdf = cudf.DataFrame(
    {
        "C1": [1, np.nan, 3, 4, 3] *5,
        "C2": [1, 1, 2, 3, 6] *5
    }
)

cat_features = ["C1", "C2"] >> ops.Categorify(max_size=4)

train_dataset = nvt.Dataset(gdf)

workflow = Workflow(cat_features)
workflow.fit(train_dataset)
tmp = workflow.transform(train_dataset).to_ddf().compute()
print(tmp)

Read back in the unique parquet files, you will see that for the C1 column, null values are not mapped to 0 instead it was mapped to 1. I am not sure this is expected since we use max_size param. For C2 column, 0 index is mapped to NA and its size is 0 since there is no nulls in the C2 column, but we do also have 0s in the transformed dataset, which corresponds to less frequent items.. and this is not considered in the unique.C2.parquet file index mapping.

# read back the unique categories
unique_C1 = pd.read_parquet('./categories/unique.C1.parquet')
print(unique_C1)

# read back the unique categories
unique_C2 = pd.read_parquet('./categories/unique.C2.parquet')
print(unique_C2)

Expected behavior

unique.C1.parquet and unique.C2.parquet files should reflect the index-value mapping correctly.

Environment details (please complete the following information):

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of NVTabular install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used

merlin-tensorflow-training:22.04 with the latest NVT main branch pulled.

@rnyak rnyak added the bug Something isn't working label Apr 18, 2022
@jperez999
Copy link
Contributor

This is a bug that trips up here https://github.com/NVIDIA-Merlin/NVTabular/blob/main/nvtabular/ops/categorify.py#L1076. This call resorts the values based on count not respecting the na_sentinel location. Solution inbound.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants