-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Class Error Problem #26
Comments
Hi @Hatins, |
Hi, @orrzohar |
Hi @Hatins, If that is the case, could you please write the hyper parameters here for future reference? Best, |
Hi @orrzohar |
@Hatins @orrzohar I am also having the same issue but I am only using 1 GPU (I do not have a slurm setup but do have a single system with 2GPUs) I configured the hyperparameters as lr: 8e-6 and lr-backbone: 8e-8 How to resolve this issue ? |
Hi @Hatins, @Sidd1609, could you try the OW-DETR hyper parameters if you use the same batch-size as them? namely:
OW-DETR and PROB are both exactly the DDETR model for the known classes during training (notice that the revised inference scheme is used in test-time/evaluation only, while the class_error is calculated during training and objectness plays no role here). The only other option is that you have an issue with your dataset configuration, but that seems rather unlikely. Best, |
Hi @orrzohar parser = argparse.ArgumentParser('Deformable DETR Detector', add_help=False)
parser.add_argument('--lr', default=2e-4, type=float)
parser.add_argument('--lr_backbone_names', default=["backbone.0"], type=str, nargs='+')
parser.add_argument('--lr_backbone', default=2e-5, type=float)
parser.add_argument('--lr_linear_proj_names', default=['reference_points', 'sampling_offsets'], type=str, nargs='+')
parser.add_argument('--lr_linear_proj_mult', default=0.1, type=float)
parser.add_argument('--batch_size', default=2, type=int)
parser.add_argument('--weight_decay', default=1e-4, type=float)
parser.add_argument('--epochs', default=51, type=int)
parser.add_argument('--lr_drop', default=40, type=int)
parser.add_argument('--lr_drop_epochs', default=None, type=int, nargs='+')
parser.add_argument('--clip_max_norm', default=0.1, type=float,
help='gradient clipping max norm') So I would like to use those parameters and test it tonight and return the results as soon as possible! |
Hi @orrzohar thank you so much for your explanation, will definitely test it out with these hyper parameters of OW-DETR. I did notice the similarities with OW-DETR, and am little worried I might not be able to run with a batch size more than 2 (due to constraint of single GPU). |
I'm sorry, I can't do the experiment this time due to the machine being occupied by the others. But from the results (4 epochs) I have got, I found the class error still fluctuates from 50 to 95. Is that normal? When you train, is the value decrease smoothly and don't fluctuate? |
Hi @Hatins, I am not sure as I never trained PROB with a system like yours. I can share the evaluation curves if that helps, and then you could tell if the mAP/U-Recall have similar trends to mine. They may be different due to the different batch_size, but hopefully should have a similar trend. LMK how this does / if the class error issue is resolved - I'll update the codebase accordingly to support systems such as yours as this seems to be a common machine. Best, |
Hi @Sidd1609, That is fine - OW-DETR was trained with a batch size of 2 so this should hopefully work. Best, |
Hi @orrzohar the error of the machine with 12G VRAM: I will update my results after I get the epochs to finish. |
Hi @orrzohar @Sidd1609 By the way, I would like to ask about the main structure in the paper. Are mAP and U-Recall represented by the curves K_AP50 and U_R50 in wandb? If so, could you help me verify if my trend is correct? if not, could you tell which curve is the target? Interestingly, it is worth mentioning that U_R50 did not increase with the increase in training time. OK, that's over, I think you could close the issue right now, thank you once again for your patient response. I feel incredibly fortunate to have encountered such a responsible author like you! |
Hi @Hatins, I am happy to help! Yes, these are very similar to the curves I got on my machine: I actually looked into why this happens and this has to do with PROB learning a good representation of objectness very early on (which is why U-Recall initially jumps, if you plot U-Recall inside epoch 1 you will see it increase from ~0 to 19). Then, as training progresses, it starts declining as it starts making more known object predictions, and therefore less unknown object predictions (e.g., ~U-Recall@100 goes down to ~U-Recall@80). I will update the readme with this hyper parameter setup & machine type for future users. If you encounter any new issues - do not hesitate to reach out, |
Hi @Sidd1609, I am going to close this issue now - please let me know if when training on a 1-GPU system you see the same dynamics so I can update the readme to include your system (as well as details about your system, are you using a single TITAN 12G?) Best, |
Hi @orrzohar parser.add_argument('--lr', default=8e-5, type=float) I do being observing decline in class_error and loss. However, the trend is not smooth, values keep fluctuating between 0-100 (class_error) and the loss is within 18-21 after 10 epochs. |
Hi @Sidd1609 |
Hi @Hatins yes sharing you the results from wandb for 100 epochs |
Hi @Sidd1609 |
Hi @Hatins, Notice that as he is using only two GPUs, his "batch size" (=batch_size*number of GPUs) is 1/2 of yours - so 80K steps for @Sidd1609 is ~40k steps for you. Also you show a grouped plot while Sidd is not - so you are showing the average of 4 such processes - perhaps thats why it looks more smooth? @Sidd1609, how are the K/PK/CK mAP and U-Recall progessing? are they close to the figures I sent earlier? If so, I am not sure how critial the class_error is. If they are not, then I echo @Hatins question - what lr's did you try? Best, |
Hi @Hatins and @orrzohar yes as Orr, mentioned I am using only 1 GPU to train the model. I ran the tests using the following lr's (I scaled it according to the batch size and GPU I had). Also I am running the model only for 1 experience which means I would not have unknown classes. Right now I am running it for 1+1 (2 experiences) instead of 0+2 (1 experience) -> 8e-5 Sharing the plots for K/PK/CK mAP and U-Recall. Some plots are similar to the ones you shared. When I try to use the model for predictions I do not have good results even with lower thresholds. |
@orrzohar another doubt, does the size of images matter here, as in do you have any parameter that handles the size of images that I can modify and see? And would it be possible for sharing a visualization script for testing the model? |
Hi @Sidd1609, I am a little confused. Your known mAP is over 80! This means you are not discussing the class_error on M-OWODB or S-OWODB, but on your custom dataset. Then, you can no longer expect the class_error to follow the same trend - as it will be heavily influenced by factors such as the dataset size, number of classes, and annotation error rate. However, your known mAP is quite high, so you don't have a significant issue with class_error. As for U-Recall, you most likely have a bug in your code such that you have effectively 0 unknown objects annotated during evaluation. Let me know if I am wrong and you are reporting values on M/S-OWODB, and if that is the case, then please let me know how you increased K_AP50 to be over 80. Best, |
Hi @orrzohar you are right those were for the custom dataset, however, I ran for MOWOD and results look the same |
Hi @Sidd1609, Could you please explain your concern with class_error if the known mAP is high? class_error is only to help oversee training and is not a performance metric. As you are using a smaller batch size, of course, the class error will be more erratic. Class error is calculated per batch, so rather than displaying class_error over 8 images, have class_error over 2 images. Lets say you have 3 objects per image, so 6 objects total. If you make one prediction correct, then the class error jumps from 100 to 83%. Meanwhile, if your batch size is 8, then for a single correct prediction the class_error would jump from 100 to 96%. So I do not expect the same behavior for class_error. HOWEVER, what is important is the known mAP (The class_error is ONLY for the known classes). If you have high known mAP, this is irrelevant. Also, notice that 40k steps on 1 GPU are 10k steps on Hatins 4-GPU machine, so this is also quite zoomed in. |
Hi @orrzohar I see, I believed that the class error would influence the learning in some by mapping the classes that were correctly mapped to GT. I did observe that the computation of class_error in the code being related to batch size. I understand your point and it makes sense. Thank you so much for your help. I must say the codebase very user-friendly for extending to custom datasets really appreciate your effort and work. Regards |
Hi @Sidd1609, Happy to help! Just to verify, you same hyperparamers that worked for Hatins worked for you system as well? If so, I will add your system to the README |
Hi @Hatins, |
Hi @Rzx520
|
I cannot obtain the exact value, as the author presented the results of U-Recall: 19.4 map: 59.5 on task 1. May I ask what your final predicted result is, and I would like to know if there is a significant difference?@Hatins |
As you can see, I got the same results as the @orrzohar show in the paper. I wonder how many cards you used with batch_size = 2. I think if you use a single card, the result may be worse than I got (I used four cards with batch_size = 3) @Rzx520 . By the way, what are your final results? Are they far from the authors' results? |
I'm sorry, but I'm still training. Due to the long training time, I would like to know if you didn't use the batch size results provided by the author because I'm afraid the batch size will have a significant impact on the results, and I don't have that many GPUs. I still want to ask, do you mean that the results when you use batch size 3 are the same as the author's? Haha@Hatins |
Yeah, at first, I used a machine with four TITAN cards (12 G) to train, but the machine always reports errors when running, so I use another machine with four GTX3090 (24 G) to run with batch_size = 3 or 4. In my opinion, if you just use a single card with batch size = 2, you do have the potential to get a poorer result. Maybe you can ask @Sidd1609 for a reference since he was using a card with a batch size=2, which is the same as yours. @Rzx520 |
Yes, I have been reporting errors while using 4 12G GPUs. I want to know what errors you are reporting. I always make errors in the second epoch after training one epoch and stop training.@Hatins |
@Rzx520 |
Indeed, the same error was reported,and every time only one card stops running, which can be resolved when you use batch=1, but I think the performance may differ significantly@Hatins |
Yes, @Rzx520 , when I use RTX3090, whatever I set the batch=3 or 4, it can work well with the same parameter in OW-DETR. And as you say, setting the batch=1 may decrease the result significantly. You know, the transformer always needs larger memory /(ㄒoㄒ)/~~. |
Hi @orrzohar yes I did use the same configurations as the OW-DETR paper and the only difference was in the batch size which I set as 2. |
Hi @Sidd1609 , I would like to know what GPU you are using and what is your final verification result, which is significantly different from the author's? Thank you. |
Hi @Rzx520 I am using "one" RTX3090. But, please not that my results were reported for a custom dataset and not VOC which Orr had reported. |
It is indeed like this. We can see that the U_R50 has decreased even after the training time has increased. I am quite puzzled, so why not choose the model with the highest U_R50? Haha@orrzohar |
I used four cards with batch_size = 3,the result is : {"train_lr": 1.999999999999943e-05, "train_class_error": 15.52755644357749, "train_grad_norm": 119.24543388206256, "train_loss": 5.189852057201781, "train_loss_bbox": 0.2700958194790585, "train_loss_bbox_0": 0.29624945830832017, "train_loss_bbox_1": 0.27978440371434526, "train_loss_bbox_2": 0.275065722955665, "train_loss_bbox_3": 0.27241891570675625, "train_loss_bbox_4": 0.27063051075218725, "train_loss_ce": 0.18834440561282928, "train_loss_ce_0": 0.27234036786085974, "train_loss_ce_1": 0.23321395799885028, "train_loss_ce_2": 0.20806531186409408, "train_loss_ce_3": 0.19453731594314128, "train_loss_ce_4": 0.18820172232765492, "train_loss_giou": 0.3351372324140976, "train_loss_giou_0": 0.3679243937037491, "train_loss_giou_1": 0.3483400315024699, "train_loss_giou_2": 0.34171414935044225, "train_loss_giou_3": 0.3379105142249501, "train_loss_giou_4": 0.3368650070453053, "train_loss_obj_ll": 0.02471167313379382, "train_loss_obj_ll_0": 0.034151954339996814, "train_loss_obj_ll_1": 0.03029250531194649, "train_loss_obj_ll_2": 0.0288731191750343, "train_loss_obj_ll_3": 0.028083207809715446, "train_loss_obj_ll_4": 0.026900355121292352, "train_cardinality_error_unscaled": 0.44506890101437985, "train_cardinality_error_0_unscaled": 0.6769398279525907, "train_cardinality_error_1_unscaled": 0.5726976196583499, "train_cardinality_error_2_unscaled": 0.4929900999093851, "train_cardinality_error_3_unscaled": 0.46150593285633223, "train_cardinality_error_4_unscaled": 0.45256225438417086, "train_class_error_unscaled": 15.52755644357749, "train_loss_bbox_unscaled": 0.054019163965779084, "train_loss_bbox_0_unscaled": 0.059249891647616536, "train_loss_bbox_1_unscaled": 0.055956880831476395, "train_loss_bbox_2_unscaled": 0.055013144572493046, "train_loss_bbox_3_unscaled": 0.054483783067331704, "train_loss_bbox_4_unscaled": 0.05412610215448962, "train_loss_ce_unscaled": 0.09417220280641464, "train_loss_ce_0_unscaled": 0.13617018393042987, "train_loss_ce_1_unscaled": 0.11660697899942514, "train_loss_ce_2_unscaled": 0.10403265593204704, "train_loss_ce_3_unscaled": 0.09726865797157064, "train_loss_ce_4_unscaled": 0.09410086116382746, "train_loss_giou_unscaled": 0.1675686162070488, "train_loss_giou_0_unscaled": 0.18396219685187454, "train_loss_giou_1_unscaled": 0.17417001575123495, "train_loss_giou_2_unscaled": 0.17085707467522113, "train_loss_giou_3_unscaled": 0.16895525711247505, "train_loss_giou_4_unscaled": 0.16843250352265265, "train_loss_obj_ll_unscaled": 30.889592197686543, "train_loss_obj_ll_0_unscaled": 42.68994404527915, "train_loss_obj_ll_1_unscaled": 37.86563257517548, "train_loss_obj_ll_2_unscaled": 36.09139981038161, "train_loss_obj_ll_3_unscaled": 35.10401065181873, "train_loss_obj_ll_4_unscaled": 33.62544476769816, "test_metrics": {"WI": 0.05356004827184098, "AOSA": 5220.0, "CK_AP50": 58.3890380859375, "CK_P50": 25.75118307055908, "CK_R50": 71.51227713815234, "K_AP50": 58.3890380859375, "K_P50": 25.75118307055908, "K_R50": 71.51227713815234, "U_AP50": 2.7862398624420166, "U_P50": 0.409358215516747, "U_R50": 16.530874785591767}, "test_coco_eval_bbox": [14.451444625854492, 14.451444625854492, 77.8148193359375, 57.15019607543945, 66.93928527832031, 49.282108306884766, 27.985671997070312, 70.54130554199219, 55.28901290893555, 82.7206039428711, 26.307403564453125, 65.15182495117188, 21.9127197265625, 77.91541290283203, 73.61457061767578, 67.8846206665039, 49.1287841796875, 36.78118896484375, 69.1879653930664, 53.060150146484375, 79.1402359008789, 59.972835540771484, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.7862398624420166], "epoch": 40, "n_parameters": 39742295} the authors' results is : |
I saw your picture showing U_ Recall is only around 18, is this before or after the change? Have you reached 19.4 since then?@Hatins |
Hi, @orrzohar,
I'm facing the same issue as described in input #24. I followed the installation instructions and used four TITAN (12G) GPUs to run the code. I set the batch size to 2 and reduced the learning rate from 2e-4 to 8e-5, as well as the lr_backbone from 2e-5 to 8e-6.
However, even after 2 epochs, the class_error metrics remain quite high, ranging from 80 to 100. Is this normal? If not, what could be causing this problem?
This value has been fluctuating, unstable
The text was updated successfully, but these errors were encountered: