您好，用自己数据集请问训练后期d0和d1的输出为nan是怎么回事呢？ #47

yihong-97 · 2021-01-10T10:47:51Z

您好，我训练自己数据集的时候发现在15.6K迭代额时候，d0和d1的输出为nan，导致l0和l1损失为nan
`[epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755
l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054

[epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731
l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684

[epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773
l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387

[epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785
l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494

[epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769
l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563

[epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791
l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301

[epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan
l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765
`

xuebinqin · 2021-01-10T22:58:27Z

We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid.

…

On Sun, Jan 10, 2021 at 3:48 AM Yihong ***@***.***> wrote: 您好，我训练自己数据集的时候发现在15.6K迭代额时候，d0和d1的输出为nan，导致l0和l1损失为nan `[epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054 [epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684 [epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387 [epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494 [epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563 [epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301 [epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 ` — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA> .

-- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/

yihong-97 · 2021-01-11T03:49:20Z

We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid.
…
On Sun, Jan 10, 2021 at 3:48 AM Yihong @.***> wrote: 您好，我训练自己数据集的时候发现在15.6K迭代额时候，d0和d1的输出为nan，导致l0和l1损失为nan [epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054 [epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684 [epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387 [epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494 [epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563 [epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301 [epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA .
-- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/

There was a problem with the same number of steps after modifying d0 and using the BCEWithLogitsLoss function. The input is valid, and it is worth noting that only d0 and d1 are nan, and the other outputs are normal.

xuebinqin · 2021-01-11T04:00:13Z

There are several other options you can try, for example (1) add the torch.nn.utils.clip_grad_norm just after the loss.backward, (2) change the dataloader by normalizing your input image as image = (image - image.min() + 1e-8)/(image.max() - image.min() + 1e-8), etc.

…

On Jan 10, 2021, at 8:49 PM, Yihong ***@***.***> wrote: We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid. … <x-msg://1/#> On Sun, Jan 10, 2021 at 3:48 AM Yihong @.***> wrote: 您好，我训练自己数据集的时候发现在15.6K迭代额时候，d0和d1的输出为nan，导致l0和l1损失为nan [epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054 [epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684 [epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387 [epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494 [epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563 [epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301 [epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47 <#47>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA <https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA> . -- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/ <https://webdocs.cs.ualberta.ca/~xuebin/> There was a problem with the same number of steps after modifying d0 and using the BCEWithLogitsLoss function. The input is valid, and it is worth noting that only d0 and d1 are nan, and the other outputs are normal. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#47 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSGOROFRPNWMESTSEWN7BTSZJYM3ANCNFSM4V4KN4QA>.

yihong-97 · 2021-01-11T06:22:48Z

There are several other options you can try, for example (1) add the torch.nn.utils.clip_grad_norm just after the loss.backward, (2) change the dataloader by normalizing your input image as image = (image - image.min() + 1e-8)/(image.max() - image.min() + 1e-8), etc.
…
On Jan 10, 2021, at 8:49 PM, Yihong @.> wrote: We use return F.sigmoid(d0) in the network definition. This may not be reliable in some cases. You can try to return only d0 and then replace the current BCE loss with BCEWithLogitsLoss. It may help to solve the issue. In addition, it is also good to check your input to see if they are all valid. … x-msg://1/# On Sun, Jan 10, 2021 at 3:48 AM Yihong @.> wrote: 您好，我训练自己数据集的时候发现在15.6K迭代额时候，d0和d1的输出为nan，导致l0和l1损失为nan [epoch: 308/100000, batch: 2456/ 4085, ite: 156707] train loss: 2.115248, tar: 0.097755 l0: 0.090264, l1: 0.090268, l2: 0.094700, l3: 0.108498, l4: 0.157684, l5: 0.269343, l6: 0.561054 [epoch: 308/100000, batch: 2464/ 4085, ite: 156708] train loss: 2.114669, tar: 0.097731 l0: 0.110660, l1: 0.110660, l2: 0.116909, l3: 0.147194, l4: 0.230913, l5: 0.414125, l6: 0.675684 [epoch: 308/100000, batch: 2472/ 4085, ite: 156709] train loss: 2.115880, tar: 0.097773 l0: 0.101519, l1: 0.101512, l2: 0.107206, l3: 0.128373, l4: 0.198813, l5: 0.377140, l6: 0.674387 [epoch: 308/100000, batch: 2480/ 4085, ite: 156710] train loss: 2.116687, tar: 0.097785 l0: 0.092943, l1: 0.092937, l2: 0.097863, l3: 0.117802, l4: 0.182888, l5: 0.299898, l6: 0.505494 [epoch: 308/100000, batch: 2488/ 4085, ite: 156711] train loss: 2.115991, tar: 0.097769 l0: 0.104595, l1: 0.104529, l2: 0.109673, l3: 0.131785, l4: 0.201885, l5: 0.407138, l6: 0.842563 [epoch: 308/100000, batch: 2496/ 4085, ite: 156712] train loss: 2.118025, tar: 0.097791 l0: nan, l1: nan, l2: 2.413359, l3: 2.419617, l4: 2.441422, l5: 2.419549, l6: 2.403301 [epoch: 308/100000, batch: 2504/ 4085, ite: 156713] train loss: nan, tar: nan l0: nan, l1: nan, l2: 2.489194, l3: 2.498003, l4: 2.527905, l5: 2.497976, l6: 2.474765 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47 <#47>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA https://github.com/notifications/unsubscribe-auth/ADSGORKRMINZLTBHIGQDGZDSZGAWHANCNFSM4V4KN4QA . -- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage:https://webdocs.cs.ualberta.ca/~xuebin/ https://webdocs.cs.ualberta.ca/~xuebin/ There was a problem with the same number of steps after modifying d0 and using the BCEWithLogitsLoss function. The input is valid, and it is worth noting that only d0 and d1 are nan, and the other outputs are normal. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#47 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGOROFRPNWMESTSEWN7BTSZJYM3ANCNFSM4V4KN4QA.

Thank you very much. I'll try these options.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

您好，用自己数据集请问训练后期d0和d1的输出为nan是怎么回事呢？ #47

您好，用自己数据集请问训练后期d0和d1的输出为nan是怎么回事呢？ #47

yihong-97 commented Jan 10, 2021

xuebinqin commented Jan 10, 2021 via email

yihong-97 commented Jan 11, 2021

xuebinqin commented Jan 11, 2021 via email

yihong-97 commented Jan 11, 2021

您好，用自己数据集请问训练后期d0和d1的输出为nan是怎么回事呢？ #47

您好，用自己数据集请问训练后期d0和d1的输出为nan是怎么回事呢？ #47

Comments

yihong-97 commented Jan 10, 2021

xuebinqin commented Jan 10, 2021 via email

yihong-97 commented Jan 11, 2021

xuebinqin commented Jan 11, 2021 via email

yihong-97 commented Jan 11, 2021