Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate_w_clip ERROR #2

Closed
wuxingywa opened this issue May 27, 2024 · 10 comments
Closed

generate_w_clip ERROR #2

wuxingywa opened this issue May 27, 2024 · 10 comments

Comments

@wuxingywa
Copy link

Very good work!

I encountered the following error while running the function

all_generated_str = generate_w_clip(model, tokenizer, vis_processor, text, image, device=device, verbose=False, final_res_num=1, args=args,
                                                                 return_max_probs=True, return_clip_scores=True, clip_scorer=clip_scorer)

GPU RTX 3090
python 3.9.19
torch 2.0.0

Traceback (most recent call last):
  File "root/code/CLIP-Guided-Decoding/clip_decode.py", line 90, in <module>
    all_generated_str = generate_w_clip(model, tokenizer, vis_processor, text, image, device=device, verbose=False, final_res_num=1, args=args,
  File "root/code/CLIP-Guided-Decoding/gen/clip_guided.py", line 412, in generate_w_clip
    outputs = _generate_original(model, tokenizer, image_processor, question, image, args=args,  device=device, spec=spec, clip_scorer=clip_scorer, **kwargs)
  File "root/code/CLIP-Guided-Decoding/\gen\clip_guided.py", line 208, in _generate_original
    outputs = generate_wrapper.generate(
  File "root/code/CLIP-Guided-Decoding/model_utils/generate_wrapper.py", line 33, in generate
    final_output = self.process_output(outputs, start_index=input_ids.shape[1])
  File "root/code/CLIP-Guided-Decoding/model_utils/generate_wrapper.py", line 84, in process_output
    output_token_probs = scores[i1, i2, token_indexs].t() # out: (num_samples, num_tokens)
IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [16, 3], [16, 3], [6, 3]
@d-ailin
Copy link
Owner

d-ailin commented May 27, 2024

Thanks for your interest!

May I know the version of the transformer and the models (e.g. LLaVA, InstructBlip or mPLUG-Owl) that you are using?

@wuxingywa
Copy link
Author

Thanks for your interest!

May I know the version of the transformer and the models (e.g. LLaVA, InstructBlip or mPLUG-Owl) that you are using?

I am using llava-v1.5 and transformer is the 4.31.0 provided in your code

@d-ailin
Copy link
Owner

d-ailin commented May 27, 2024

It seems LLaVA has changed some interface code (e.g., the output_ids not containing input_ids anymore). To adapt to latest version of LLaVA, I might still need some time to change the code. If it is available, could you use LLaVA v1.1.3 first to run the code first? It is the exact model version in our experiments.

git clone --depth 1 --branch v1.1.3 https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip instal -e .

After installing the LLaVA, you could then install the custom transformer. Thanks!

@wuxingywa
Copy link
Author

It seems LLaVA has changed some interface code (e.g., the output_ids not containing input_ids anymore). To adapt to latest version of LLaVA, I might still need some time to change the code. If it is available, could you use LLaVA v1.1.3 first to run the code first? It is the exact model version in our experiments.

git clone --depth 1 --branch v1.1.3 https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip instal -e .

After installing the LLaVA, you could then install the custom transformer. Thanks!

Thank you for your reply

I installed according to the requirements of your LLava version 1.1.3 (and weight is https://huggingface.co/liuhaotian/llava-v1.5-7b), but there are still problems:

Traceback (most recent call last):
  File "root/code/CLIP-Guided-Decoding/clip_decode", line 90, in <module>
    all_generated_str = generate_w_clip(model, tokenizer, vis_processor, text, image, device=device, verbose=False, final_res_num=1, args=args,
  File "/root/gen/clip_guided.py", line 412, in generate_w_clip
    outputs = _generate_original(model, tokenizer, image_processor, question, image, args=args,  device=device, spec=spec, clip_scorer=clip_scorer, **kwargs)
  File "/root/gen/clip_guided.py", line 266, in _generate_original
    clip_score = clip_scorer.get_clip_score(sentence, image)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/lib/clip_utils.py", line 44, in get_clip_score
    text_features = clip_vis_enc_model.encode_text(text_tokens.to(device))
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/open_clip/model.py", line 363, in encode_text
    features = self.text(text)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/open_clip/transformer.py", line 685, in forward
    x = x + self.positional_embedding[:seq_len].to(cast_dtype)
RuntimeError: The size of tensor a (77) must match the size of tensor b (64) at non-singleton dimension 1

Process finished with exit code 1

I do not know whether it is the open_clip version (open-clip-torch==2.24.0) problem or the ViT-SO400M-14-SigLIP-384 weight problem (https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384/tree/main).

By the way, my clip_scorer is created locally like this:

clip_scorer = CLIPModel(model_pretrain="/root/model/ViT-SO400M-14-SigLIP-384/open_clip_pytorch_model.bin",device=device)

@suiyize
Copy link

suiyize commented May 28, 2024

It seems LLaVA has changed some interface code (e.g., the output_ids not containing input_ids anymore). To adapt to latest version of LLaVA, I might still need some time to change the code. If it is available, could you use LLaVA v1.1.3 first to run the code first? It is the exact model version in our experiments.

git clone --depth 1 --branch v1.1.3 https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip instal -e .

After installing the LLaVA, you could then install the custom transformer. Thanks!

Thank you for your reply

I installed according to the requirements of your LLava version 1.1.3 (and weight is https://huggingface.co/liuhaotian/llava-v1.5-7b), but there are still problems:

Traceback (most recent call last):
  File "root/code/CLIP-Guided-Decoding/clip_decode", line 90, in <module>
    all_generated_str = generate_w_clip(model, tokenizer, vis_processor, text, image, device=device, verbose=False, final_res_num=1, args=args,
  File "/root/gen/clip_guided.py", line 412, in generate_w_clip
    outputs = _generate_original(model, tokenizer, image_processor, question, image, args=args,  device=device, spec=spec, clip_scorer=clip_scorer, **kwargs)
  File "/root/gen/clip_guided.py", line 266, in _generate_original
    clip_score = clip_scorer.get_clip_score(sentence, image)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/lib/clip_utils.py", line 44, in get_clip_score
    text_features = clip_vis_enc_model.encode_text(text_tokens.to(device))
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/open_clip/model.py", line 363, in encode_text
    features = self.text(text)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/open_clip/transformer.py", line 685, in forward
    x = x + self.positional_embedding[:seq_len].to(cast_dtype)
RuntimeError: The size of tensor a (77) must match the size of tensor b (64) at non-singleton dimension 1

Process finished with exit code 1

I do not know whether it is the open_clip version (open-clip-torch==2.24.0) problem or the ViT-SO400M-14-SigLIP-384 weight problem (https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384/tree/main).

By the way, my clip_scorer is created locally like this:

clip_scorer = CLIPModel(model_pretrain="/root/model/ViT-SO400M-14-SigLIP-384/open_clip_pytorch_model.bin",device=device)

I met the same error

@d-ailin
Copy link
Owner

d-ailin commented May 28, 2024

Thanks for your feedbacks!

This bug has previously been mentioned in mlfoundations/open_clip#660 (comment). It could be potential solved by using the updated timm package, such as

pip install git+https://github.com/rwightman/pytorch-image-models.git

Please let me know if this solves the issue. Thanks!

@suiyize
Copy link

suiyize commented May 28, 2024

Thanks for your feedbacks!

This bug has previously been mentioned in mlfoundations/open_clip#660 (comment). It could be potential solved by using the updated timm package, such as

pip install git+https://github.com/rwightman/pytorch-image-models.git

Please let me know if this solves the issue. Thanks!

Oh no, I just tried timm for 0.9.8, 0.9.9, 0.9.10, 0.9.11, 0.9.12, 0.9.16, 1.0.3 and the latest dev version (1.0.4 dev) and it doesn't seem to fix the problem, it's still the same bug.

@d-ailin
Copy link
Owner

d-ailin commented May 28, 2024

It is a bit wired XD. Just in case if you might be using inference.ipynb, would you restart the python kernel used in the notebook and rerun the code to enable the new python environment update?

@suiyize
Copy link

suiyize commented May 29, 2024

I know what the problem is, just need to make a slight change in the following code

    def get_clip_score(self, text, image):
        clip_vis_enc_model = self.model
        device = self.device
        clip_vis_enc_preprocess = self.preprocess
        clip_vis_enc_tokenizer = self.tokenizer
        
        clip_vis_enc_model.eval()
        clip_vis_enc_model.to(device)
        
        with torch.no_grad():
            # text_tokens = clip_vis_enc_tokenizer([text])
            text_tokens = clip_vis_enc_tokenizer([text], context_length=clip_vis_enc_model.context_length)
            
            image_input = clip_vis_enc_preprocess(image).to(device)
            image_features = clip_vis_enc_model.encode_image(image_input.unsqueeze(0).to(device))
            text_features = clip_vis_enc_model.encode_text(text_tokens.to(device))
            image_features = F.normalize(image_features, p=2, dim=-1)
            text_features = F.normalize(text_features, p=2, dim=-1)
            
            scores =  text_features @ image_features.T
            

        return scores.squeeze().item()

@d-ailin
Copy link
Owner

d-ailin commented May 29, 2024

I will update it later in the code and thanks for your time! :)

@d-ailin d-ailin closed this as completed May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants