Skip to content

Commit

Permalink
Add truncate option to tokenize (#126)
Browse files Browse the repository at this point in the history
* Add truncate_text option to tokenize

This makes it possible to run tokenize on texts that are longer than the number of tokens
that fit the context length without having to try to guess how to cut in number of 
characters beforehand

* add doc, rename to just "truncate", use eot_token

Co-authored-by: Jong Wook Kim <[email protected]>
  • Loading branch information
rom1504 and jongwook authored Jul 19, 2021
1 parent db20393 commit a2737ac
Showing 1 changed file with 9 additions and 2 deletions.
11 changes: 9 additions & 2 deletions clip/clip.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ def patch_float(module):
return model, _transform(model.input_resolution.item())


def tokenize(texts: Union[str, List[str]], context_length: int = 77) -> torch.LongTensor:
def tokenize(texts: Union[str, List[str]], context_length: int = 77, truncate: bool = False) -> torch.LongTensor:
"""
Returns the tokenized representation of given input string(s)
Expand All @@ -192,6 +192,9 @@ def tokenize(texts: Union[str, List[str]], context_length: int = 77) -> torch.Lo
context_length : int
The context length to use; all CLIP models use 77 as the context length
truncate: bool
Whether to truncate the text in case its encoding is longer than the context length
Returns
-------
A two-dimensional tensor containing the resulting tokens, shape = [number of input strings, context_length]
Expand All @@ -206,7 +209,11 @@ def tokenize(texts: Union[str, List[str]], context_length: int = 77) -> torch.Lo

for i, tokens in enumerate(all_tokens):
if len(tokens) > context_length:
raise RuntimeError(f"Input {texts[i]} is too long for context length {context_length}")
if truncate:
tokens = tokens[:context_length]
tokens[-1] = eot_token
else:
raise RuntimeError(f"Input {texts[i]} is too long for context length {context_length}")
result[i, :len(tokens)] = torch.tensor(tokens)

return result

0 comments on commit a2737ac

Please sign in to comment.