Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT Neo past_key_values unexpected behaviour #11787

Closed
edwinagnew opened this issue May 20, 2021 · 6 comments · Fixed by #13491
Closed

GPT Neo past_key_values unexpected behaviour #11787

edwinagnew opened this issue May 20, 2021 · 6 comments · Fixed by #13491
Assignees
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@edwinagnew
Copy link

I have been successfully using the GPT2LMHeadModel module for text generation for some time and I recently tried to reuse the code to generate with GPTNeoForCausalLM. Though the documentations appear identical, I get the error "ValueError: not enough values to unpack (expected 2, got 1)" for the lineoutput, past = self.model(context, past_key_values=past, use_cache=True).values() (which works fine for GPT2).

Is this a bug or has the documentation been copied incorrectly? Would appreciate any tips for fixing.

Many thanks

@edwinagnew edwinagnew changed the title GPT Neo past_key_vlaues unexpected behaviour GPT Neo past_key_values unexpected behaviour May 20, 2021
@Express50
Copy link

I encountered a similar problem when trying to use GPT-Neo with PPLM (https://github.com/uber-research/PPLM). Seems that Neo's past_key_values is returning and consuming key-value tensors as well as (I'm guessing) feed-forward tensors:

inputs = tokenizer(prompt, return_tensors='pt')
outputs = model(**inputs)
past = outputs.past_key_values

for idx, p in enumerate(past):
    print(f'{idx}: {tuple(elem.shape for elem in p)}')

# output
# 0: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 1: (torch.Size([1, 3, 768]),)
# 2: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 3: (torch.Size([1, 3, 768]),)
# 4: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 5: (torch.Size([1, 3, 768]),)
# 6: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 7: (torch.Size([1, 3, 768]),)
# 8: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 9: (torch.Size([1, 3, 768]),)
# 10: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 11: (torch.Size([1, 3, 768]),)

GPT-2 correctly returns just the key-value tensors:

# 0: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 1: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 2: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 3: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 4: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 5: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 6: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 7: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 8: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 9: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 10: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))
# 11: (torch.Size([1, 12, 3, 64]), torch.Size([1, 12, 3, 64]))

@Express50
Copy link

After some more testing, the above seems to be because of local attention layers in GPT-Neo's default configuration. When specifying config = GPTNeoConfig(attention_types=[[["global"], 24]]), I get similar past_key_values as in GPT-2:

# 0: (torch.Size([1, 16, 3, 128]), torch.Size([1, 16, 3, 128]))
# 1: (torch.Size([1, 16, 3, 128]), torch.Size([1, 16, 3, 128])) 
# 2: (torch.Size([1, 16, 3, 128]), torch.Size([1, 16, 3, 128])) 
# 3: (torch.Size([1, 16, 3, 128]), torch.Size([1, 16, 3, 128])) 
# 4: (torch.Size([1, 16, 3, 128]), torch.Size([1, 16, 3, 128])) 
# ...

I do think the documentation for past_key_values should be updated since it currently says: "with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)"

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Express50
Copy link

Hi @patil-suraj, just checking if there is any progress on this issue or pull request #11630? That PR seems to fix the problem related to my usecase.

@finetunej
Copy link

The different shape for local attention layers is because of the folding going on in the current implementation.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@LysandreJik LysandreJik reopened this Aug 30, 2021
@github-actions github-actions bot closed this as completed Sep 8, 2021
@patil-suraj patil-suraj reopened this Sep 9, 2021
@patil-suraj patil-suraj added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants