-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the Categorify
operator to set the domain max correctly
#1641
Update the Categorify
operator to set the domain max correctly
#1641
Conversation
Click to view CI ResultsGitHub pull request #1641 of commit 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4, no merge conflicts. Running as SYSTEM Setting status of 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4615/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10 > git rev-parse 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4^{commit} # timeout=10 Checking out Revision 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 # timeout=10 Commit message: "Update `Categorify` operator to set the domain max correctly" > git rev-list --no-walk c2a5b743c7a0b458be7af4ca96da091887a044b9 # timeout=10 First time build. Skipping changelog. [nvtabular_tests] $ /bin/bash /tmp/jenkins14816013642204511087.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1432 items |
Click to view CI ResultsGitHub pull request #1641 of commit 729eb88f3ebd2064c0eea2acb040ed23aa0e5191, no merge conflicts. Running as SYSTEM Setting status of 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4616/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10 > git rev-parse 729eb88f3ebd2064c0eea2acb040ed23aa0e5191^{commit} # timeout=10 Checking out Revision 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 # timeout=10 Commit message: "Update `DropLowCardinality` to handle changes to `Categorify` domain" > git rev-list --no-walk 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 # timeout=10 [nvtabular_tests] $ /bin/bash /tmp/jenkins1109161135988901750.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1432 items |
Documentation preview |
rerun tests |
Click to view CI ResultsGitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts. Running as SYSTEM Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4629/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10 Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10 Commit message: "Merge branch 'main' into categorify-domain-max" > git rev-list --no-walk 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10 First time build. Skipping changelog. [nvtabular_tests] $ /bin/bash /tmp/jenkins5697026500764221364.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1430 items / 1 skipped |
rerun tests |
Click to view CI ResultsGitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts. GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts. Running as SYSTEM Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4630/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10 Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10 Commit message: "Merge branch 'main' into categorify-domain-max" > git rev-list --no-walk 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10 [nvtabular_tests] $ /bin/bash /tmp/jenkins11669948025439148038.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1430 items / 1 skipped |
rerun tests |
Click to view CI ResultsGitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts. GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts. Running as SYSTEM Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4631/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10 Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10 Commit message: "Merge branch 'main' into categorify-domain-max" > git rev-list --no-walk 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10 [nvtabular_tests] $ /bin/bash /tmp/jenkins9182884185066325902.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1430 items / 1 skipped |
Goal
Reduce the resulting
int_domain.max
property by one on a ColumnSchema after transforming withCategorify
. To match the data correctly.Motivation / Context
This PR was motivated by work on NVIDIA-Merlin/Merlin#479
We are using the
domain.max
to compute the vocab size / cardinality when creating embedding tables in Merlin Models. This off-by-one error is resulting in some confusion when creating the correct shape embedding dimensions from pretrained embedding data.Example
dataset
dataset.schema
After the Categorify op, these ids are transformed to integers {1, 2} with 0 reserved for out-of-vocabulary. So we have a cardinality of 3 (including the zero).
transformed_dataset
:transformed_dataset.schema
However, with the current implementation the int_domain.max value after the transform in this example is 3. This is the same value as the cardinality. However, the maximum integer value is one less than the cardinality here which is 2.