You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm glad to hear that you've achieved good results and are planning further research based on this work. However, you're currently facing challenges with the pre-training process using the code provided.
Firstly, I would like to know if the "4M_corpus.tsv" file provided on GitHub is the same dataset used in the paper. This file seems to contain a total of 5 million image-text pairs, which differs from the pre-training log you provided.
[Count of image-text pairs in "4M_corpus.tsv"]
On the other hand, the pre-training log for ptp-blip shows the following:
In reality, when I trained using the "4M_corpus.tsv" provided by your team, the total count of image-text pairs exceeded 5 million. We conducted the pre-training with the same experimental setup (as mentioned in the pretrain_concated_pred_4M.yaml file). However, we encountered the phenomenon of gradient explosion, as shown in the image.
Our setup includes four A6000 GPUs, with a batch size of 75 per GPU, resulting in a total batch size of 300 per step (compared to 600 in the paper). However, this configuration led to gradient explosion, hindering the progress of training.
We attempted to address this issue by using gradient accumulation to match the paper's setup, where the batch size remained at 600 per step. However, the gradient still exploded.
The main cause of the explosion seems to be the "ita" loss, as it exhibited instability without a consistent decrease. While the language modeling (LM) loss consistently decreased, the unstable behavior of the "ita" loss indicates potential issues with the image data.
If you have any insights or advice regarding the potential causes of the loss explosion during my pre-training, I would greatly appreciate your guidance.
The text was updated successfully, but these errors were encountered:
It's hard to find what's the problem because I do not meet this problem with 4M image (one image may have multiple captions for coco and vg) setting, but I guess the file name may not matched.
Could you please check if the path of CC3M is the same as ours (since we slightly modify the download code)? If the path not matched, the image may generate wrong captions.
If the training is normal with 2.8M Image (2G): CC3M?
I used img2dataset to collect the CC3M images, but it seems that there may have been some mismatch in the file paths, which could have caused the issue.
Regarding your feedback, I attempted pre-training solely on the CC3M dataset (3M_corpus.tsv). However, I encountered the same problem of exploding loss. This leads me to believe that there might be an issue with the CC3M dataset I collected. As a result, I am currently in the process of re-collecting the dataset using the CC3M download code provided by FingerRec. I will make sure to share the pre-training results using this dataset with you once it is completed.
Hello,
I'm glad to hear that you've achieved good results and are planning further research based on this work. However, you're currently facing challenges with the pre-training process using the code provided.
Firstly, I would like to know if the "4M_corpus.tsv" file provided on GitHub is the same dataset used in the paper. This file seems to contain a total of 5 million image-text pairs, which differs from the pre-training log you provided.
[Count of image-text pairs in "4M_corpus.tsv"]
On the other hand, the pre-training log for ptp-blip shows the following:
(ptp-blip pre training log: https://huggingface.co/sail/PTP/blob/main/4M_pretrain.txt)
In reality, when I trained using the "4M_corpus.tsv" provided by your team, the total count of image-text pairs exceeded 5 million. We conducted the pre-training with the same experimental setup (as mentioned in the pretrain_concated_pred_4M.yaml file). However, we encountered the phenomenon of gradient explosion, as shown in the image.
Our setup includes four A6000 GPUs, with a batch size of 75 per GPU, resulting in a total batch size of 300 per step (compared to 600 in the paper). However, this configuration led to gradient explosion, hindering the progress of training.
We attempted to address this issue by using gradient accumulation to match the paper's setup, where the batch size remained at 600 per step. However, the gradient still exploded.
The main cause of the explosion seems to be the "ita" loss, as it exhibited instability without a consistent decrease. While the language modeling (LM) loss consistently decreased, the unstable behavior of the "ita" loss indicates potential issues with the image data.
If you have any insights or advice regarding the potential causes of the loss explosion during my pre-training, I would greatly appreciate your guidance.
The text was updated successfully, but these errors were encountered: