Accuracy is low for examples/train_math_net with cuda #58

npuichigo · 2024-05-05T07:35:29Z

cargo run --release --features cuda

Iter 20649 Loss: 6.49 Acc: 0.17
Iter 20650 Loss: 6.48 Acc: 0.17
Iter 20651 Loss: 6.49 Acc: 0.17
Iter 20652 Loss: 6.48 Acc: 0.17
Iter 20653 Loss: 6.49 Acc: 0.17
Iter 20654 Loss: 6.48 Acc: 0.17
Iter 20655 Loss: 6.47 Acc: 0.17
Iter 20656 Loss: 6.48 Acc: 0.17
Iter 20657 Loss: 6.48 Acc: 0.17
Iter 20658 Loss: 6.48 Acc: 0.17
Iter 20659 Loss: 6.48 Acc: 0.17
Iter 20660 Loss: 6.48 Acc: 0.17
Iter 20661 Loss: 6.47 Acc: 0.17
Iter 20662 Loss: 6.47 Acc: 0.17
Iter 20663 Loss: 6.47 Acc: 0.17

The text was updated successfully, but these errors were encountered:

jafioti · 2024-05-05T14:02:27Z

Agreed I'm seeing the same thing. Will fix

swfsql · 2024-06-24T17:35:02Z

I've added a small PR with a temporary fix, which may indicate how the Cuda training (in general) is going wrong.

jafioti · 2024-06-25T03:27:17Z

Hmm very interesting, your changes trigger a copy-back of the data to cpu, rather than keep it on gpu. I wonder why that makes it accurate. Sorry I haven't gotten around to looking at this in-depth. I will have time this weekend to check it out, and access to a cuda machine.

swfsql · 2024-06-25T15:42:27Z

Could it be that the initial CudaCopyToDevice calls (made at the start of every iteration) are always overwriting the latest GPU weight values with the (static, initial) CPU weight values?

jafioti · 2024-06-28T07:20:34Z

I don't think so, ops don't get ran if the destination tensor is already produced, so the copy to device shouldn't be ran as long as the cuda buffers weren't getting deleted first

swfsql mentioned this issue Jun 24, 2024

Temporary fix to cuda training #72

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy is low for examples/train_math_net with cuda #58

Accuracy is low for examples/train_math_net with cuda #58

npuichigo commented May 5, 2024

jafioti commented May 5, 2024

swfsql commented Jun 24, 2024

jafioti commented Jun 25, 2024

swfsql commented Jun 25, 2024

jafioti commented Jun 28, 2024

Accuracy is low for examples/train_math_net with cuda #58

Accuracy is low for examples/train_math_net with cuda #58

Comments

npuichigo commented May 5, 2024

jafioti commented May 5, 2024

swfsql commented Jun 24, 2024

jafioti commented Jun 25, 2024

swfsql commented Jun 25, 2024

jafioti commented Jun 28, 2024