-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpointing DLRMv2 #639
Comments
Can you check the "Snapshot Content Access" section in https://pytorch.org/torchsnapshot/main/getting_started.html to see if the example meets your need? For example: t_cat_0_weight = snapshot.read_object(path="0/model/sparse_arch/embedding_bag_collection/embedding_bags/t_cat_9")
For TorchRec-based DLRM models, you simply can For more info, see:
|
Regarding
There are 26 categories / embedding tables in the model with row counts given by From that, we can see that the largest tables for cat_0, cat_9, cat_19, cat_20 and cat_21. These are also categories you see in "sharded" directory you mentioned. I believe that the suffix in file names are start indices (offsets) for a given category as they never exceed When running https://github.com/mlcommons/training/blob/master/recommendation_v2/torchrec_dlrm/dlrm_main.py you could also add |
@mailvijayasingh is this still an issue? |
Closing as it has been more than a year since the last comment |
Copying the issue link that I had posted on dlrm repo:
facebookresearch/dlrm#346 (comment)
The content goes here:
I tried to use torchsnapshot to save checkpoints of the model in torchrec implementation. I made following changes in the dlrm_main.py for the purpose.
I did get some weights saved in the embedding_shards directory however I am not sure how to interpret the saved directory.
In the directory embedding_shards, I see two directories - batched and sharded.
batched has 8 files (names are uuids)- a total of size 196 GB
sharded has following files with a total size of 98 GB:
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_0_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_10000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_1048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_11048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_12097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_13145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_14194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_15000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_16048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_17097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_18145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_19194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_20000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_2097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_21048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_22097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_23145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_24194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_25000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_26048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_27097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_28145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_29194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_30000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_31048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_3145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_32097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_33145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_34194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_35000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_36048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_37097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_38145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_39194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_4194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_5000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_6048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_7097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_8145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_9194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_0_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_1048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_2097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_11.weight_0_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_0_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_10000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_1048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_11048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_12097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_13145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_14194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_15000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_16048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_17097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_18145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_19194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_20000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_2097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_21048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_22097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_23145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_24194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_25000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_26048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_27097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_28145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_29194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_30000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_31048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_3145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_32097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_33145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_34194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_35000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_36048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_37097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_38145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_39194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_4194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_5000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_6048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_7097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_8145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_9194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_0_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_10000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_1048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_11048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_12097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_13145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_14194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_15000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_16048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_17097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_18145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_19194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_20000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_2097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_21048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_22097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_23145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_24194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_25000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_26048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_27097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_28145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_29194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_30000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_31048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_3145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_32097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_33145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_34194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_35000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_36048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_37097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_38145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_39194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_4194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_5000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_6048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_7097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_8145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_9194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_0_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_10000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_1048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_11048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_12097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_13145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_14194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_15000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_16048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_17097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_18145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_19194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_20000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_2097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_21048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_22097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_23145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_24194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_25000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_26048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_27097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_28145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_29194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_30000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_31048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_3145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_32097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_33145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_34194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_35000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_36048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_37097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_38145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_39194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_4194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_5000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_6048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_7097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_8145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_9194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_22.weight_0_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_0_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_10000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_1048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_11048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_12097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_13145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_14194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_15000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_16048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_17097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_18145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_19194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_20000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_2097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_21048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_22097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_23145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_24194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_25000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_26048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_27097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_28145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_29194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_30000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_31048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_3145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_32097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_33145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_34194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_35000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_36048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_37097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_38145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_39194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_4194304_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_5000000_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_6048576_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_7097152_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_8145728_0
model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_9194304_0
take
within rank 0 or using it the way pointed in the code snippet above is sufficient?cc: @erichan1
The text was updated successfully, but these errors were encountered: