generate_w_clip ERROR #2

wuxingywa · 2024-05-27T04:28:42Z

Very good work!

I encountered the following error while running the function

all_generated_str = generate_w_clip(model, tokenizer, vis_processor, text, image, device=device, verbose=False, final_res_num=1, args=args,
                                                                 return_max_probs=True, return_clip_scores=True, clip_scorer=clip_scorer)

GPU RTX 3090
python 3.9.19
torch 2.0.0

Traceback (most recent call last):
  File "root/code/CLIP-Guided-Decoding/clip_decode.py", line 90, in <module>
    all_generated_str = generate_w_clip(model, tokenizer, vis_processor, text, image, device=device, verbose=False, final_res_num=1, args=args,
  File "root/code/CLIP-Guided-Decoding/gen/clip_guided.py", line 412, in generate_w_clip
    outputs = _generate_original(model, tokenizer, image_processor, question, image, args=args,  device=device, spec=spec, clip_scorer=clip_scorer, **kwargs)
  File "root/code/CLIP-Guided-Decoding/\gen\clip_guided.py", line 208, in _generate_original
    outputs = generate_wrapper.generate(
  File "root/code/CLIP-Guided-Decoding/model_utils/generate_wrapper.py", line 33, in generate
    final_output = self.process_output(outputs, start_index=input_ids.shape[1])
  File "root/code/CLIP-Guided-Decoding/model_utils/generate_wrapper.py", line 84, in process_output
    output_token_probs = scores[i1, i2, token_indexs].t() # out: (num_samples, num_tokens)
IndexError: shape mismatch: indexing tensors could not be broadcast together with shapes [16, 3], [16, 3], [6, 3]

The text was updated successfully, but these errors were encountered:

d-ailin · 2024-05-27T06:18:56Z

Thanks for your interest!

May I know the version of the transformer and the models (e.g. LLaVA, InstructBlip or mPLUG-Owl) that you are using?

wuxingywa · 2024-05-27T07:58:01Z

Thanks for your interest!

May I know the version of the transformer and the models (e.g. LLaVA, InstructBlip or mPLUG-Owl) that you are using?

I am using llava-v1.5 and transformer is the 4.31.0 provided in your code

d-ailin · 2024-05-27T18:00:16Z

It seems LLaVA has changed some interface code (e.g., the output_ids not containing input_ids anymore). To adapt to latest version of LLaVA, I might still need some time to change the code. If it is available, could you use LLaVA v1.1.3 first to run the code first? It is the exact model version in our experiments.

git clone --depth 1 --branch v1.1.3 https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip instal -e .

After installing the LLaVA, you could then install the custom transformer. Thanks!

wuxingywa · 2024-05-28T03:32:16Z

It seems LLaVA has changed some interface code (e.g., the output_ids not containing input_ids anymore). To adapt to latest version of LLaVA, I might still need some time to change the code. If it is available, could you use LLaVA v1.1.3 first to run the code first? It is the exact model version in our experiments.
git clone --depth 1 --branch v1.1.3 https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip instal -e .
After installing the LLaVA, you could then install the custom transformer. Thanks!

Thank you for your reply

I installed according to the requirements of your LLava version 1.1.3 (and weight is https://huggingface.co/liuhaotian/llava-v1.5-7b), but there are still problems：

Traceback (most recent call last):
  File "root/code/CLIP-Guided-Decoding/clip_decode", line 90, in <module>
    all_generated_str = generate_w_clip(model, tokenizer, vis_processor, text, image, device=device, verbose=False, final_res_num=1, args=args,
  File "/root/gen/clip_guided.py", line 412, in generate_w_clip
    outputs = _generate_original(model, tokenizer, image_processor, question, image, args=args,  device=device, spec=spec, clip_scorer=clip_scorer, **kwargs)
  File "/root/gen/clip_guided.py", line 266, in _generate_original
    clip_score = clip_scorer.get_clip_score(sentence, image)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/lib/clip_utils.py", line 44, in get_clip_score
    text_features = clip_vis_enc_model.encode_text(text_tokens.to(device))
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/open_clip/model.py", line 363, in encode_text
    features = self.text(text)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/open_clip/transformer.py", line 685, in forward
    x = x + self.positional_embedding[:seq_len].to(cast_dtype)
RuntimeError: The size of tensor a (77) must match the size of tensor b (64) at non-singleton dimension 1

Process finished with exit code 1

I do not know whether it is the open_clip version (open-clip-torch==2.24.0) problem or the ViT-SO400M-14-SigLIP-384 weight problem (https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384/tree/main).

By the way, my clip_scorer is created locally like this:

clip_scorer = CLIPModel(model_pretrain="/root/model/ViT-SO400M-14-SigLIP-384/open_clip_pytorch_model.bin",device=device)

suiyize · 2024-05-28T06:33:47Z

It seems LLaVA has changed some interface code (e.g., the output_ids not containing input_ids anymore). To adapt to latest version of LLaVA, I might still need some time to change the code. If it is available, could you use LLaVA v1.1.3 first to run the code first? It is the exact model version in our experiments.
git clone --depth 1 --branch v1.1.3 https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip instal -e .
After installing the LLaVA, you could then install the custom transformer. Thanks!
Thank you for your reply

I installed according to the requirements of your LLava version 1.1.3 (and weight is https://huggingface.co/liuhaotian/llava-v1.5-7b), but there are still problems：
Traceback (most recent call last):
  File "root/code/CLIP-Guided-Decoding/clip_decode", line 90, in <module>
    all_generated_str = generate_w_clip(model, tokenizer, vis_processor, text, image, device=device, verbose=False, final_res_num=1, args=args,
  File "/root/gen/clip_guided.py", line 412, in generate_w_clip
    outputs = _generate_original(model, tokenizer, image_processor, question, image, args=args,  device=device, spec=spec, clip_scorer=clip_scorer, **kwargs)
  File "/root/gen/clip_guided.py", line 266, in _generate_original
    clip_score = clip_scorer.get_clip_score(sentence, image)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/lib/clip_utils.py", line 44, in get_clip_score
    text_features = clip_vis_enc_model.encode_text(text_tokens.to(device))
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/open_clip/model.py", line 363, in encode_text
    features = self.text(text)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/CGD/lib/python3.9/site-packages/open_clip/transformer.py", line 685, in forward
    x = x + self.positional_embedding[:seq_len].to(cast_dtype)
RuntimeError: The size of tensor a (77) must match the size of tensor b (64) at non-singleton dimension 1

Process finished with exit code 1
I do not know whether it is the open_clip version (open-clip-torch==2.24.0) problem or the ViT-SO400M-14-SigLIP-384 weight problem (https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384/tree/main).

By the way, my clip_scorer is created locally like this:

clip_scorer = CLIPModel(model_pretrain="/root/model/ViT-SO400M-14-SigLIP-384/open_clip_pytorch_model.bin",device=device)

I met the same error

d-ailin · 2024-05-28T06:55:35Z

Thanks for your feedbacks!

This bug has previously been mentioned in mlfoundations/open_clip#660 (comment). It could be potential solved by using the updated timm package, such as

pip install git+https://github.com/rwightman/pytorch-image-models.git

Please let me know if this solves the issue. Thanks!

suiyize · 2024-05-28T08:27:59Z

Thanks for your feedbacks!

This bug has previously been mentioned in mlfoundations/open_clip#660 (comment). It could be potential solved by using the updated timm package, such as
pip install git+https://github.com/rwightman/pytorch-image-models.git
Please let me know if this solves the issue. Thanks!

Oh no, I just tried timm for 0.9.8, 0.9.9, 0.9.10, 0.9.11, 0.9.12, 0.9.16, 1.0.3 and the latest dev version (1.0.4 dev) and it doesn't seem to fix the problem, it's still the same bug.

d-ailin · 2024-05-28T09:22:40Z

It is a bit wired XD. Just in case if you might be using inference.ipynb, would you restart the python kernel used in the notebook and rerun the code to enable the new python environment update?

suiyize · 2024-05-29T01:01:39Z

I know what the problem is, just need to make a slight change in the following code

    def get_clip_score(self, text, image):
        clip_vis_enc_model = self.model
        device = self.device
        clip_vis_enc_preprocess = self.preprocess
        clip_vis_enc_tokenizer = self.tokenizer
        
        clip_vis_enc_model.eval()
        clip_vis_enc_model.to(device)
        
        with torch.no_grad():
            # text_tokens = clip_vis_enc_tokenizer([text])
            text_tokens = clip_vis_enc_tokenizer([text], context_length=clip_vis_enc_model.context_length)
            
            image_input = clip_vis_enc_preprocess(image).to(device)
            image_features = clip_vis_enc_model.encode_image(image_input.unsqueeze(0).to(device))
            text_features = clip_vis_enc_model.encode_text(text_tokens.to(device))
            image_features = F.normalize(image_features, p=2, dim=-1)
            text_features = F.normalize(text_features, p=2, dim=-1)
            
            scores =  text_features @ image_features.T
            

        return scores.squeeze().item()

d-ailin · 2024-05-29T04:47:23Z

I will update it later in the code and thanks for your time! :)

d-ailin closed this as completed May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generate_w_clip ERROR #2

generate_w_clip ERROR #2

wuxingywa commented May 27, 2024

d-ailin commented May 27, 2024

wuxingywa commented May 27, 2024

d-ailin commented May 27, 2024

wuxingywa commented May 28, 2024

suiyize commented May 28, 2024

d-ailin commented May 28, 2024

suiyize commented May 28, 2024

d-ailin commented May 28, 2024

suiyize commented May 29, 2024

d-ailin commented May 29, 2024

generate_w_clip ERROR #2

generate_w_clip ERROR #2

Comments

wuxingywa commented May 27, 2024

d-ailin commented May 27, 2024

wuxingywa commented May 27, 2024

d-ailin commented May 27, 2024

wuxingywa commented May 28, 2024

suiyize commented May 28, 2024

d-ailin commented May 28, 2024

suiyize commented May 28, 2024

d-ailin commented May 28, 2024

suiyize commented May 29, 2024

d-ailin commented May 29, 2024