Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch for Cambricon MLUs test #1747

Merged
merged 14 commits into from
Jun 6, 2024
Merged

Conversation

huismiling
Copy link
Contributor

Cambricon MLUs has been supported by peft (huggingface/peft/pull/1687).
But the pytest have some failed cases. It's mostly about tolerances.
This patch fix the failed case of tolerances.

@@ -464,13 +464,13 @@ def _test_merge_layers_fp16(self, model_id, config_cls, config_kwargs):
if ("gpt2" in model_id.lower()) and (config_cls != LoraConfig):
self.skipTest("Merging GPT2 adapters not supported for IA³ (yet)")

model = self.transformers_class.from_pretrained(model_id, torch_dtype=torch.float16)
model = self.transformers_class.from_pretrained(model_id)
config = config_cls(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch_dtype=torch.float16 leads to an error.
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I cannot replicate this, whether with or without GPU. The idea of this test is exactly to check that this error does not occur with fp16, so not using this dtype is counter-productive. Is this only occurring with MLU devices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reproduction code is as follows.
The issue can be reproduced using PyTorch 2.1, but it executes normally with PyTorch 2.3.

import torch
a = torch.rand(4,4).to(torch.float16)
b = torch.rand(4,4).to(torch.float16)
a @ b

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so instead of changing the dtype, how about skipping the test if an old pytorch version is detected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe we can use fp16 with pt>=2.3, and fp32 with pt<2.3 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really don't need to test merging with fp32 here, as it's tested extensively in other tests. This test is very specifically for merging with fp16, so if we don't use fp16, we can skip it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, Got it! I will fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I found that it is the device "cpu" leads error.
When device is changed to self.torch_device as I fixed, MLU tests is OK with torch.float16 .
@BenjaminBossan Will this test use "cpu" device? If not, it isn't need to skip test for pt2.1 .

model = model.to(device="cpu", dtype=torch.float16)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIRC, there is an error when using CPU + float16 + old PyTorch. If we change either of those variables, there is no error. On CI, we have a new PyTorch version, so it passes, despite using CPU.

If we switch to self.torch_device, it depends, because the device is auto-inferred based on what devices are available. So on our CI, this would still be CPU. On yours, it might not, but then we don't really test what was the original intent, namely that float16 works on CPU.

I assume this fails on your CI because it uses an older PyTorch version. This is why I suggested to just skip the test with older PyTorch versions. If you want, you could add a specific test for merging float16 with MLU, which would be skipped if the device is not available.

Copy link
Contributor Author

@huismiling huismiling Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenjaminBossan
Got it! Skipping test_merge_layers_fp16 for pt2.1 when cpu device, this should be OK.

@huismiling
Copy link
Contributor Author

@BenjaminBossan Hi, could you help to reivew this PR? Thx.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan
Copy link
Member

Thanks @huismiling, could you please run make style on this PR?

@huismiling
Copy link
Contributor Author

make style has been done. Thx. @BenjaminBossan

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these adjustments. Are the results from the tests on MLU visible somewhere?

Overall, the changes LGTM, I just have one concern about the float16 test. Please check my comment.

@@ -464,13 +464,13 @@ def _test_merge_layers_fp16(self, model_id, config_cls, config_kwargs):
if ("gpt2" in model_id.lower()) and (config_cls != LoraConfig):
self.skipTest("Merging GPT2 adapters not supported for IA³ (yet)")

model = self.transformers_class.from_pretrained(model_id, torch_dtype=torch.float16)
model = self.transformers_class.from_pretrained(model_id)
config = config_cls(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I cannot replicate this, whether with or without GPU. The idea of this test is exactly to check that this error does not occur with fp16, so not using this dtype is counter-productive. Is this only occurring with MLU devices?

@huismiling
Copy link
Contributor Author

The results from the tests on MLU are as follows. More details is in attachment.
mlu_pytest.txt

== 4368 passed, 1483 skipped, 1 xfailed, 2981 warnings in 5991.03s (1:39:51) ===

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the skipping logic. There's just a small typo there, the rest should be good.

@@ -464,13 +465,16 @@ def _test_merge_layers_fp16(self, model_id, config_cls, config_kwargs):
if ("gpt2" in model_id.lower()) and (config_cls != LoraConfig):
self.skipTest("Merging GPT2 adapters not supported for IA³ (yet)")

if (self.torch_device in ["cpu"]) and (version.parse(torch.__version__) <= version.parse(2.1)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (self.torch_device in ["cpu"]) and (version.parse(torch.__version__) <= version.parse(2.1)):
if (self.torch_device in ["cpu"]) and (version.parse(torch.__version__) <= version.parse("2.1")):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for that. Fixed !

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, thanks for the PR. Keep us in the loop if tests should start failing for Cambricon MLUs.

@BenjaminBossan BenjaminBossan merged commit 63a536b into huggingface:main Jun 6, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants