Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy is low for examples/train_math_net with cuda #58

Open
npuichigo opened this issue May 5, 2024 · 5 comments
Open

Accuracy is low for examples/train_math_net with cuda #58

npuichigo opened this issue May 5, 2024 · 5 comments

Comments

@npuichigo
Copy link

cargo run --release --features cuda
Iter 20649 Loss: 6.49 Acc: 0.17
Iter 20650 Loss: 6.48 Acc: 0.17
Iter 20651 Loss: 6.49 Acc: 0.17
Iter 20652 Loss: 6.48 Acc: 0.17
Iter 20653 Loss: 6.49 Acc: 0.17
Iter 20654 Loss: 6.48 Acc: 0.17
Iter 20655 Loss: 6.47 Acc: 0.17
Iter 20656 Loss: 6.48 Acc: 0.17
Iter 20657 Loss: 6.48 Acc: 0.17
Iter 20658 Loss: 6.48 Acc: 0.17
Iter 20659 Loss: 6.48 Acc: 0.17
Iter 20660 Loss: 6.48 Acc: 0.17
Iter 20661 Loss: 6.47 Acc: 0.17
Iter 20662 Loss: 6.47 Acc: 0.17
Iter 20663 Loss: 6.47 Acc: 0.17
@jafioti
Copy link
Owner

jafioti commented May 5, 2024

Agreed I'm seeing the same thing. Will fix

@swfsql
Copy link
Contributor

swfsql commented Jun 24, 2024

I've added a small PR with a temporary fix, which may indicate how the Cuda training (in general) is going wrong.

@jafioti
Copy link
Owner

jafioti commented Jun 25, 2024

Hmm very interesting, your changes trigger a copy-back of the data to cpu, rather than keep it on gpu. I wonder why that makes it accurate. Sorry I haven't gotten around to looking at this in-depth. I will have time this weekend to check it out, and access to a cuda machine.

@swfsql
Copy link
Contributor

swfsql commented Jun 25, 2024

Could it be that the initial CudaCopyToDevice calls (made at the start of every iteration) are always overwriting the latest GPU weight values with the (static, initial) CPU weight values?

@jafioti
Copy link
Owner

jafioti commented Jun 28, 2024

I don't think so, ops don't get ran if the destination tensor is already produced, so the copy to device shouldn't be ran as long as the cuda buffers weren't getting deleted first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants